User-agent错误与Python3的网络抓取

huangapple go评论99阅读模式
英文:

User-agent error with web scraping python3

问题

这是我第一次使用网络爬虫。当我使用page = requests.get(URL)时,它运行得非常好,但当我添加以下代码时:

  1. headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15'}
  2. page = requests.get(URL, headers=headers)

我收到了一个错误消息:

  1. title = soup.find(id="productTitle").get_text()
  2. AttributeError: 'NoneType' object has no attribute 'get_text'

出了什么问题?我应该放弃使用headers吗?

英文:

It is my first time using web scraping. When I am using page = requests.get(URL) it works perfectly fine but when I am adding

  1. headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15'}
  2. page = requests.get(URL, headers=headers)

I am getting an error

  1. title = soup.find(id="productTitle").get_text()
  2. AttributeError: 'NoneType' object has no attribute 'get_text'

What's wrong with that? Should I resign with headers?

答案1

得分: 0

我认为该页面包含无效的HTML,因此BeautifulSoup无法找到您的元素。

尝试首先美化HTML:

  1. import requests
  2. from bs4 import BeautifulSoup
  3. URL = 'https://www.amazon.com/dp/B07JP9QJ15/ref=dp_cerb_1'
  4. headers = {
  5. "User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15'}
  6. page = requests.get(URL, headers=headers)
  7. pretty = BeautifulSoup(page.text,'html.parser').prettify()
  8. soup = BeautifulSoup(pretty,'html.parser')
  9. print(soup.find(id='productTitle').get_text())

这将返回:

Dell UltraSharp U2719D - LED Monitor - 27"

英文:

I think the page contains non valid HTML and therefore BeatifulSoup is not able to find your element.

Try to prettify the HTML first:

  1. import requests
  2. from bs4 import BeautifulSoup
  3. URL = 'https://www.amazon.com/dp/B07JP9QJ15/ref=dp_cerb_1'
  4. headers = {
  5. "User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15'}
  6. page = requests.get(URL, headers=headers)
  7. pretty = BeautifulSoup(page.text,'html.parser').prettify()
  8. soup = BeautifulSoup(pretty,'html.parser')
  9. print(soup.find(id='productTitle').get_text())

Which returns:

Dell UltraSharp U2719D - LED Monitor - 27"

huangapple
  • 本文由 发表于 2020年1月6日 23:44:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/59614989.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定