2020年1月6日 23:44:00go评论99阅读模式

英文:

User-agent error with web scraping python3

问题

这是我第一次使用网络爬虫。当我使用page = requests.get(URL)时，它运行得非常好，但当我添加以下代码时：

headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15'}
page = requests.get(URL, headers=headers)

我收到了一个错误消息：

    title = soup.find(id="productTitle").get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'

出了什么问题？我应该放弃使用headers吗？

英文:

It is my first time using web scraping. When I am using page = requests.get(URL) it works perfectly fine but when I am adding

headers = {&quot;User-Agent&quot;: &#39;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15&#39;}
page = requests.get(URL, headers=headers)

I am getting an error

    title = soup.find(id=&quot;productTitle&quot;).get_text()
AttributeError: &#39;NoneType&#39; object has no attribute &#39;get_text&#39;

What's wrong with that? Should I resign with headers?

答案1

得分: 0

我认为该页面包含无效的HTML，因此BeautifulSoup无法找到您的元素。

尝试首先美化HTML：

import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/dp/B07JP9QJ15/ref=dp_cerb_1'
headers = {
    "User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15'}
page = requests.get(URL, headers=headers)
pretty = BeautifulSoup(page.text,'html.parser').prettify()
soup = BeautifulSoup(pretty,'html.parser')
print(soup.find(id='productTitle').get_text())

这将返回：

Dell UltraSharp U2719D - LED Monitor - 27"

英文:

I think the page contains non valid HTML and therefore BeatifulSoup is not able to find your element.

Try to prettify the HTML first:

import requests
from bs4 import BeautifulSoup
URL = &#39;https://www.amazon.com/dp/B07JP9QJ15/ref=dp_cerb_1&#39;
headers = {
    &quot;User-Agent&quot;: &#39;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15&#39;}
page = requests.get(URL, headers=headers)
pretty = BeautifulSoup(page.text,&#39;html.parser&#39;).prettify()
soup = BeautifulSoup(pretty,&#39;html.parser&#39;)
print(soup.find(id=&#39;productTitle&#39;).get_text())

Which returns:

Dell UltraSharp U2719D - LED Monitor - 27"

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

User-agent错误与Python3的网络抓取

问题

答案1

如何在字典列表中满足条件的情况下计算不同的键？

将一个云2D图像填充到一个连续地图中。

无法使用Selenium从网站提取数据。

嵌套装饰器在Python中定义模拟方法

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。