2020年1月6日 17:30:35go评论186阅读模式

英文:

Python BeautifulSoup can't identify div tag

问题

我使用Python的Beautiful Soup库进行网络抓取。我想要筛选出class属性值为a-column a-span6 a-span-last的div标签。这个div标签确实存在（如图片中所示），但BeautifulSoup无法识别它。想知道原因是什么？

这是截图。
链接

编辑：
附上代码：

from bs4 import BeautifulSoup
import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url='https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=zg_bs_electronics_35?_encoding=UTF8&psc=1&refRID=YGK101A649HEC8NXXM1T'
req=urllib.request.Request(url=url, headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'})
response=urllib.request.urlopen(url=req,context=ctx)
html=response.read().decode('utf-8')
soup=BeautifulSoup(html,'html.parser')
soup.find_all('div','a-column a-span6 a-span-last',recursive=True)

英文:

Here I'm doing web-scraping with Python bs4. I want to filter out the div tag whose class attribute's value is a-column a-span6 a-span-last. This div tag indeed exists (as in the picture), but BeautifulSoup can't identify div tag. Wondering why?

Here is the screenshot.
link

EDIT:
Code Attached:

from bs4 import BeautifulSoup
import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url=&#39;https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=zg_bs_electronics_35?_encoding=UTF8&amp;psc=1&amp;refRID=YGK101A649HEC8NXXM1T&#39;
req=urllib.request.Request(url=url, headers={&#39;User-Agent&#39;:&#39;Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)&#39;})
response=urllib.request.urlopen(url=req,context=ctx)
html=response.read().decode(&#39;utf-8&#39;)
soup=BeautifulSoup(html,&#39;html.parser&#39;)
soup.find_all(&#39;div&#39;,&#39;a-column a-span6 a-span-last&#39;,recursive=True)

答案1

得分: 0

The issue is with html.parser. Depending on the parser used, you could get back different results. Read more about it here.

But change to 'lxml' gave me an output:

from bs4 import BeautifulSoup
import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url='https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=zg_bs_electronics_35?_encoding=UTF8&psc=1&refRID=YGK101A649HEC8NXXM1T'
req=urllib.request.Request(url=url, headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'})
response=urllib.request.urlopen(url=req,context=ctx)
html=response.read().decode('utf-8')
soup=BeautifulSoup(html,'lxml')   # <----- CHANGE MADE HERE
soup.find_all('div',{'class':'a-column a-span6 a-span-last'})

英文:

Ah ok. I see it.

The issue is with html.parser. Depending on the parser used, you could get back different results. Read more about it here.

But change to 'lxml' gave me an output:

from bs4 import BeautifulSoup
import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url=&#39;https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=zg_bs_electronics_35?_encoding=UTF8&amp;psc=1&amp;refRID=YGK101A649HEC8NXXM1T&#39;
req=urllib.request.Request(url=url, headers={&#39;User-Agent&#39;:&#39;Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)&#39;})
response=urllib.request.urlopen(url=req,context=ctx)
html=response.read().decode(&#39;utf-8&#39;)
soup=BeautifulSoup(html,&#39;lxml&#39;)   # &lt;----- CHANGE MADE HERE
soup.find_all(&#39;div&#39;,{&#39;class&#39;:&#39;a-column a-span6 a-span-last&#39;})

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python BeautifulSoup 无法识别 div 标签

问题

答案1

将数据框按组存储为JSON格式

go.scatterpolar：尝试以不同颜色渲染雷达图线未成功。

CSS在生产服务器上没有生效（Bitbucket存储库）

解压列表为多个变量

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。