2023年6月18日 22:14:22go评论128阅读模式

英文:

Why is BeautifulSoup returning None when scraping google search results?

问题

我正在尝试使用BeautifulSoup来查找不同作者的出生年份。我正在使用VS Code工作，如果这有关的话。这是我第一次尝试网页抓取，所以请尽量清楚地解释。

对于有维基百科页面的作者，我可以使用以下代码成功找到出生年份：

source_code = requests.get("a_wikipedia_url")
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="html.parser")
finder = soup.find("span", {"class": "bday"})
if finder is not None:
    birth_year = finder.string[0:4]
    return birth_year

然而，当我尝试对没有（英文）维基百科页面的作者进行谷歌搜索时，我只会得到None。

在阅读了这个问题https://stackoverflow.com/questions/62466340/cant-scrape-google-search-results-with-beautifulsoup之后，我向requests.get添加了一个用户代理响应头（我使用的是Chrome版本114.0.5735.134（官方构建）（64位）和Windows 11 Home），但它只是打印出了None，而不是像在添加头之前一样打印出AttributeError：'NoneType'对象没有属性'string'的错误消息。

这是我的代码：

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.134 Safari/537.36"}
source_code = requests.get("https://www.google.com/search?q=Guillermo+Saccomanno", headers=headers)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="html.parser")
google_finder = soup.find("span", {"class": "LrzXr kno-fv wHYlTd z8gr9e"})
print(google_finder.string)

结果只是None - 没有错误消息，但也没有文本。

我还尝试了头部的Chrome版本为Chrome/114.0.0.0，这是我在网上找到的。仍然返回None。

我不确定哪里出错了，因为语法是相同的，我从页面源代码中复制了类名？对于这个特定的作者，我期望google_finder.string是"9 June 1948 (age 75 years)"。

英文:

I'm trying to use BeautifulSoup to find the birth years of different authors. I'm working in VS Code, if that's relevant. This is my first attempt at web scraping so please explain things as clearly as possible

For authors with wikipedia pages, I can successully find birth years using the following code:

source_code = requests.get(&quot;a_wikipedia_url&quot;)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features=&quot;html.parser&quot;)
finder = soup.find(&quot;span&quot;, {&quot;class&quot;: &quot;bday&quot;})
if finder is not None:
        birth_year = finder.string[0:4]
        return birth_year

However when I try the same thing with google search for authors with no (English) wikipedia page, I just get None.

After reading this question https://stackoverflow.com/questions/62466340/cant-scrape-google-search-results-with-beautifulsoup I added a User Agent response header to requests.get (I'm using Chrome Version 114.0.5735.134 (Official Build) (64-bit) and Windows 11 Home), but all it did was print None instead of giving my AttributeError: 'NoneType' object has no attribute 'string', which is what I was getting before adding the header.

This is my code:

headers = {&quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.134 Safari/537.36&quot;}
source_code = requests.get(&quot;https://www.google.com/search?q=Guillermo+Saccomanno&quot;, headers=headers)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features=&quot;html.parser&quot;)
google_finder = soup.find(&quot;span&quot;, {&quot;class&quot;: &quot;LrzXr kno-fv wHYlTd z8gr9e&quot;})
print(google_finder.string)

The result is just None - no error message, but no text.

I also tried with the header Chrome version as Chrome/114.0.0.0, which is what I found online. Still gives None.

I'm not sure where I'm going wrong as the syntax is identical and I copied the class name from the page source? For this particular author, I would expect google_finder.string to be "9 June 1948 (age 75 years)".

答案1

得分: 1

你可以使用selenium首先渲染网页，然后搜索元素。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
options = Options()
options.add_argument("--headless")
# 省略注释以显示窗口
driver = webdriver.Chrome(options=options)
search_query = "Guillermo Saccomanno"
search_url = f"https://www.google.com/search?q={search_query}"
driver.get(search_url)
google_finder = driver.find_element(By.CLASS_NAME, "LrzXr")
# 只要确保类是唯一的，你可以使用任何一个类: LrzXr kno-fv wHYlTd z8gr9e
result_text = google_finder.text
print(result_text)
driver.quit()

英文:

You can use selenium to render the web page first and then search for the element.

from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
options = Options()
options.add_argument(&quot;--headless&quot;)
# Comment to show the window
driver = webdriver.Chrome(options = options)
search_query = &quot;Guillermo Saccomanno&quot;
search_url = f&quot;https://www.google.com/search?q={search_query}&quot;
driver.get(search_url)
google_finder = driver.find_element(By.CLASS_NAME, &quot;LrzXr&quot;)
# You can use any of the classes as long as you make sure that the class is unique: LrzXr kno-fv wHYlTd z8gr9e
result_text = google_finder.text
print(result_text)
driver.quit()

答案2

得分: 1

如果您想解析出出生日期，我会选择不同的策略：查找包含文本“Born:”的<span>标签，然后获取下一个兄弟元素。还要在URL中添加hl=en参数以获取英文结果：

import requests
from bs4 import BeautifulSoup
url = 'https://www.google.com/search?q=Guillermo+Saccomanno&amp;hl=en'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
born = soup.select_one('span:-soup-contains("Born:") + span')
print(born.text)

打印结果：

June 9, 1948 (age 75 years), Buenos Aires, Argentina

英文:

If you want to parse the born date I'd chose different strategy: Find a <span> tag with text "Born:" and then next sibling. Also add hl=en parameter to URL to get english results:

import requests
from bs4 import BeautifulSoup
url = &#39;https://www.google.com/search?q=Guillermo+Saccomanno&amp;hl=en&#39;
headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0&#39;}
soup = BeautifulSoup(requests.get(url, headers=headers).content, &#39;html.parser&#39;)
born = soup.select_one(&#39;span:-soup-contains(&quot;Born:&quot;) + span&#39;)
print(born.text)

Prints:

June 9, 1948 (age 75&#160;years), Buenos Aires, Argentina

答案3

得分: 0

尝试使用googlesearch。

安装库：pip install google
使用以下代码：

from googlesearch import search
r = search("something", stop=20, num=5, pause=3)  # 搜索内容：something
for i in r:
    print(i)

英文:

try to use googlesearch

install the library: pip install google
use the code:

from googlesearch import search
r = search(&quot;something&quot;, stop = 20, num = 5, pause = 3) #search for : something
for i in r:
    print(i)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么在爬取Google搜索结果时BeautifulSoup返回None？

问题

答案1

答案2

答案3

what does the keyword "\n" do in python? I don't know what it means

Markdown to pdf for Python

在录制视频中检测特定对象的角度

从CSV文件中提取字符串输入中的数字的Pandas问题

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。