英文:
Python BeautifulSoup can't identify div tag
问题
我使用Python的Beautiful Soup库进行网络抓取。我想要筛选出class
属性值为a-column a-span6 a-span-last
的div
标签。这个div
标签确实存在(如图片中所示),但BeautifulSoup无法识别它。想知道原因是什么?
这是截图。
链接
编辑:
附上代码:
from bs4 import BeautifulSoup
import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url='https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=zg_bs_electronics_35?_encoding=UTF8&psc=1&refRID=YGK101A649HEC8NXXM1T'
req=urllib.request.Request(url=url, headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'})
response=urllib.request.urlopen(url=req,context=ctx)
html=response.read().decode('utf-8')
soup=BeautifulSoup(html,'html.parser')
soup.find_all('div','a-column a-span6 a-span-last',recursive=True)
英文:
Here I'm doing web-scraping with Python bs4. I want to filter out the div
tag whose class
attribute's value is a-column a-span6 a-span-last
. This div tag indeed exists (as in the picture), but BeautifulSoup can't identify div tag. Wondering why?
Here is the screenshot.
link
EDIT:
Code Attached:
from bs4 import BeautifulSoup
import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url='https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=zg_bs_electronics_35?_encoding=UTF8&psc=1&refRID=YGK101A649HEC8NXXM1T'
req=urllib.request.Request(url=url, headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'})
response=urllib.request.urlopen(url=req,context=ctx)
html=response.read().decode('utf-8')
soup=BeautifulSoup(html,'html.parser')
soup.find_all('div','a-column a-span6 a-span-last',recursive=True)
答案1
得分: 0
The issue is with html.parser
. Depending on the parser used, you could get back different results. Read more about it here.
But change to 'lxml' gave me an output:
from bs4 import BeautifulSoup
import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url='https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=zg_bs_electronics_35?_encoding=UTF8&psc=1&refRID=YGK101A649HEC8NXXM1T'
req=urllib.request.Request(url=url, headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'})
response=urllib.request.urlopen(url=req,context=ctx)
html=response.read().decode('utf-8')
soup=BeautifulSoup(html,'lxml') # <----- CHANGE MADE HERE
soup.find_all('div',{'class':'a-column a-span6 a-span-last'})
英文:
Ah ok. I see it.
The issue is with html.parser
. Depending on the parser used, you could get back different results. Read more about it here.
But change to 'lxml' gave me an output:
from bs4 import BeautifulSoup
import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url='https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=zg_bs_electronics_35?_encoding=UTF8&psc=1&refRID=YGK101A649HEC8NXXM1T'
req=urllib.request.Request(url=url, headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'})
response=urllib.request.urlopen(url=req,context=ctx)
html=response.read().decode('utf-8')
soup=BeautifulSoup(html,'lxml') # <----- CHANGE MADE HERE
soup.find_all('div',{'class':'a-column a-span6 a-span-last'})
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论