为什么在爬取Google搜索结果时BeautifulSoup返回None?

huangapple go评论88阅读模式
英文:

Why is BeautifulSoup returning None when scraping google search results?

问题

我正在尝试使用BeautifulSoup来查找不同作者的出生年份。我正在使用VS Code工作,如果这有关的话。这是我第一次尝试网页抓取,所以请尽量清楚地解释。

对于有维基百科页面的作者,我可以使用以下代码成功找到出生年份:

source_code = requests.get("a_wikipedia_url")
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="html.parser")
finder = soup.find("span", {"class": "bday"})
if finder is not None:
    birth_year = finder.string[0:4]
    return birth_year

然而,当我尝试对没有(英文)维基百科页面的作者进行谷歌搜索时,我只会得到None。

在阅读了这个问题https://stackoverflow.com/questions/62466340/cant-scrape-google-search-results-with-beautifulsoup之后,我向requests.get添加了一个用户代理响应头(我使用的是Chrome版本114.0.5735.134(官方构建)(64位)和Windows 11 Home),但它只是打印出了None,而不是像在添加头之前一样打印出AttributeError:'NoneType'对象没有属性'string'的错误消息。

这是我的代码:

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.134 Safari/537.36"}
source_code = requests.get("https://www.google.com/search?q=Guillermo+Saccomanno", headers=headers)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="html.parser")
google_finder = soup.find("span", {"class": "LrzXr kno-fv wHYlTd z8gr9e"})
print(google_finder.string)

结果只是None - 没有错误消息,但也没有文本。

我还尝试了头部的Chrome版本为Chrome/114.0.0.0,这是我在网上找到的。仍然返回None。

我不确定哪里出错了,因为语法是相同的,我从页面源代码中复制了类名?对于这个特定的作者,我期望google_finder.string是"9 June 1948 (age 75 years)"。

英文:

I'm trying to use BeautifulSoup to find the birth years of different authors. I'm working in VS Code, if that's relevant. This is my first attempt at web scraping so please explain things as clearly as possible

For authors with wikipedia pages, I can successully find birth years using the following code:

source_code = requests.get("a_wikipedia_url")
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="html.parser")
finder = soup.find("span", {"class": "bday"})
if finder is not None:
        birth_year = finder.string[0:4]
        return birth_year

However when I try the same thing with google search for authors with no (English) wikipedia page, I just get None.

After reading this question https://stackoverflow.com/questions/62466340/cant-scrape-google-search-results-with-beautifulsoup I added a User Agent response header to requests.get (I'm using Chrome Version 114.0.5735.134 (Official Build) (64-bit) and Windows 11 Home), but all it did was print None instead of giving my AttributeError: 'NoneType' object has no attribute 'string', which is what I was getting before adding the header.

This is my code:

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.134 Safari/537.36"}
source_code = requests.get("https://www.google.com/search?q=Guillermo+Saccomanno", headers=headers)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="html.parser")
google_finder = soup.find("span", {"class": "LrzXr kno-fv wHYlTd z8gr9e"})
print(google_finder.string)

The result is just None - no error message, but no text.

I also tried with the header Chrome version as Chrome/114.0.0.0, which is what I found online. Still gives None.

I'm not sure where I'm going wrong as the syntax is identical and I copied the class name from the page source? For this particular author, I would expect google_finder.string to be "9 June 1948 (age 75 years)".

答案1

得分: 1

你可以使用selenium首先渲染网页,然后搜索元素。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

options = Options()
options.add_argument("--headless")
# 省略注释以显示窗口
driver = webdriver.Chrome(options=options)

search_query = "Guillermo Saccomanno"
search_url = f"https://www.google.com/search?q={search_query}"
driver.get(search_url)

google_finder = driver.find_element(By.CLASS_NAME, "LrzXr")
# 只要确保类是唯一的,你可以使用任何一个类: LrzXr kno-fv wHYlTd z8gr9e

result_text = google_finder.text
print(result_text)

driver.quit()
英文:

You can use selenium to render the web page first and then search for the element.

from selenium import webdriver

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

options = Options()
options.add_argument("--headless")
# Comment to show the window
driver = webdriver.Chrome(options = options)

search_query = "Guillermo Saccomanno"
search_url = f"https://www.google.com/search?q={search_query}"
driver.get(search_url)

google_finder = driver.find_element(By.CLASS_NAME, "LrzXr")
# You can use any of the classes as long as you make sure that the class is unique: LrzXr kno-fv wHYlTd z8gr9e

result_text = google_finder.text
print(result_text)

driver.quit()

答案2

得分: 1

如果您想解析出出生日期,我会选择不同的策略:查找包含文本“Born:”的<span>标签,然后获取下一个兄弟元素。还要在URL中添加hl=en参数以获取英文结果:

import requests
from bs4 import BeautifulSoup

url = 'https://www.google.com/search?q=Guillermo+Saccomanno&amp;hl=en'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

born = soup.select_one('span:-soup-contains("Born:") + span')
print(born.text)

打印结果:

June 9, 1948 (age 75 years), Buenos Aires, Argentina
英文:

If you want to parse the born date I'd chose different strategy: Find a &lt;span&gt; tag with text &quot;Born:&quot; and then next sibling. Also add hl=en parameter to URL to get english results:

import requests
from bs4 import BeautifulSoup

url = &#39;https://www.google.com/search?q=Guillermo+Saccomanno&amp;hl=en&#39;
headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0&#39;}

soup = BeautifulSoup(requests.get(url, headers=headers).content, &#39;html.parser&#39;)

born = soup.select_one(&#39;span:-soup-contains(&quot;Born:&quot;) + span&#39;)
print(born.text)

Prints:

June 9, 1948 (age 75&#160;years), Buenos Aires, Argentina

答案3

得分: 0

尝试使用googlesearch

  1. 安装库:pip install google

  2. 使用以下代码:

from googlesearch import search

r = search("something", stop=20, num=5, pause=3)  # 搜索内容:something

for i in r:
    print(i)
英文:

try to use googlesearch

  1. install the library: pip install google

  2. use the code:

from googlesearch import search

r = search(&quot;something&quot;, stop = 20, num = 5, pause = 3) #search for : something

for i in r:
    print(i)

huangapple
  • 本文由 发表于 2023年6月18日 22:14:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/76500990.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定