User-agent错误与Python3的网络抓取

huangapple go评论60阅读模式
英文:

User-agent error with web scraping python3

问题

这是我第一次使用网络爬虫。当我使用page = requests.get(URL)时,它运行得非常好,但当我添加以下代码时:

headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15'}

page = requests.get(URL, headers=headers)

我收到了一个错误消息:

    title = soup.find(id="productTitle").get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'

出了什么问题?我应该放弃使用headers吗?

英文:

It is my first time using web scraping. When I am using page = requests.get(URL) it works perfectly fine but when I am adding

headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15'}

page = requests.get(URL, headers=headers)

I am getting an error

    title = soup.find(id="productTitle").get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'

What's wrong with that? Should I resign with headers?

答案1

得分: 0

我认为该页面包含无效的HTML,因此BeautifulSoup无法找到您的元素。

尝试首先美化HTML:

import requests
from bs4 import BeautifulSoup

URL = 'https://www.amazon.com/dp/B07JP9QJ15/ref=dp_cerb_1'
headers = {
    "User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15'}
page = requests.get(URL, headers=headers)

pretty = BeautifulSoup(page.text,'html.parser').prettify()
soup = BeautifulSoup(pretty,'html.parser')
print(soup.find(id='productTitle').get_text())

这将返回:

Dell UltraSharp U2719D - LED Monitor - 27"

英文:

I think the page contains non valid HTML and therefore BeatifulSoup is not able to find your element.

Try to prettify the HTML first:

import requests
from bs4 import BeautifulSoup

URL = 'https://www.amazon.com/dp/B07JP9QJ15/ref=dp_cerb_1'
headers = {
    "User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15'}
page = requests.get(URL, headers=headers)

pretty = BeautifulSoup(page.text,'html.parser').prettify()
soup = BeautifulSoup(pretty,'html.parser')
print(soup.find(id='productTitle').get_text())

Which returns:

Dell UltraSharp U2719D - LED Monitor - 27"

huangapple
  • 本文由 发表于 2020年1月6日 23:44:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/59614989.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定