Python BeautifulSoup 无法识别 div 标签

huangapple go评论143阅读模式
英文:

Python BeautifulSoup can't identify div tag

问题

我使用Python的Beautiful Soup库进行网络抓取。我想要筛选出class属性值为a-column a-span6 a-span-lastdiv标签。这个div标签确实存在(如图片中所示),但BeautifulSoup无法识别它。想知道原因是什么?

这是截图。
链接

编辑:
附上代码:

from bs4 import BeautifulSoup
import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url='https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=zg_bs_electronics_35?_encoding=UTF8&psc=1&refRID=YGK101A649HEC8NXXM1T'
req=urllib.request.Request(url=url, headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'})
response=urllib.request.urlopen(url=req,context=ctx)
html=response.read().decode('utf-8')
soup=BeautifulSoup(html,'html.parser')
soup.find_all('div','a-column a-span6 a-span-last',recursive=True)
英文:

Here I'm doing web-scraping with Python bs4. I want to filter out the div tag whose class attribute's value is a-column a-span6 a-span-last. This div tag indeed exists (as in the picture), but BeautifulSoup can't identify div tag. Wondering why?

Here is the screenshot.
link

EDIT:
Code Attached:

from bs4 import BeautifulSoup
import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url='https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=zg_bs_electronics_35?_encoding=UTF8&psc=1&refRID=YGK101A649HEC8NXXM1T'
req=urllib.request.Request(url=url, headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'})
response=urllib.request.urlopen(url=req,context=ctx)
html=response.read().decode('utf-8')
soup=BeautifulSoup(html,'html.parser')
soup.find_all('div','a-column a-span6 a-span-last',recursive=True)

答案1

得分: 0

The issue is with html.parser. Depending on the parser used, you could get back different results. Read more about it here.

But change to 'lxml' gave me an output:

from bs4 import BeautifulSoup
import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url='https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=zg_bs_electronics_35?_encoding=UTF8&psc=1&refRID=YGK101A649HEC8NXXM1T'
req=urllib.request.Request(url=url, headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'})
response=urllib.request.urlopen(url=req,context=ctx)
html=response.read().decode('utf-8')
soup=BeautifulSoup(html,'lxml')   # <----- CHANGE MADE HERE
soup.find_all('div',{'class':'a-column a-span6 a-span-last'})
英文:

Ah ok. I see it.

The issue is with html.parser. Depending on the parser used, you could get back different results. Read more about it here.

But change to 'lxml' gave me an output:

from bs4 import BeautifulSoup
import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url=&#39;https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=zg_bs_electronics_35?_encoding=UTF8&amp;psc=1&amp;refRID=YGK101A649HEC8NXXM1T&#39;
req=urllib.request.Request(url=url, headers={&#39;User-Agent&#39;:&#39;Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)&#39;})
response=urllib.request.urlopen(url=req,context=ctx)
html=response.read().decode(&#39;utf-8&#39;)
soup=BeautifulSoup(html,&#39;lxml&#39;)   # &lt;----- CHANGE MADE HERE
soup.find_all(&#39;div&#39;,{&#39;class&#39;:&#39;a-column a-span6 a-span-last&#39;})

huangapple
  • 本文由 发表于 2020年1月6日 17:30:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/59609603.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定