Google搜索结果与抓取Google结果不同,如何使它们相同?

huangapple go评论88阅读模式
英文:

Google search results is not same as scraping google results, how to get them same?

问题

以下是代码的翻译部分:

from bs4 import BeautifulSoup
import requests
from ssl import SSLCertVerificationError
from urllib3.exceptions import MaxRetryError
import urllib.parse
import urllib3
from urllib.error import HTTPError, URLError

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

headers= {  
    "Accept-Language": "en-US,en;q=0.9",  
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/113.0.0.0"
  }

start_date=input("Enter start date: ")
end_date=input("Enter end date: ")
company_name=input("Enter company name: ")
keyword=input("Enter keyword: ")
try:
    url_params = {
            'q': f'{company_name} company {keyword} technology ',
            'tbs': f'cdr:1,cd_min:{start_date},cd_max:{end_date}',
            'source': 'lnms',
            'tbm': 'nws'
        }

    url = 'https://www.google.com/search?' + urllib.parse.urlencode(url_params)

    req = requests.get(url, headers=headers, verify=False)
    soup=BeautifulSoup(req.content,'lxml')

except (requests.exceptions.SSLError,SSLCertVerificationError,MaxRetryError,requests.exceptions.InvalidSchema,requests.exceptions.MissingSchema,requests.exceptions.ConnectionError,requests.exceptions.TooManyRedirects,requests.exceptions.ChunkedEncodingError,TypeError,HTTPError,URLError,ValueError):
    print('Connection failed with this website.')

for links in soup.findAll("a"):
    link=links['href']
    # print(link)
    #Putting the necessary conditions to get a relevant link
    if link[0:4]=='/url' and 'https' in link and 'google.com' not in link and 'proxyDocument' not in link and '.pdf' not in link and 'linkedin.com' not in link and 'cyberghost' not in link and 'cryptojacking' not in link and 'nasdaq.com' not in link and 'thecsr' not in link and 'nokia.com' not in link and 'youtube.com' not in link and 'co.uk' not in link and 'getting-started' not in link and 'scholar.google' not in link and 'quora' not in link and 'isitdown' not in link and 'tapinto' not in link and 'knownews' not in link and 'makeuseof' not in link and 'fordmuscle' not in link and 'marketbeat' not in link and 'statetime' not in link and 'dig-in' not in link:
        #doing the necessary slicing
        stripped_link=link.split('&sa')
        stripped_link.pop()
        #We got our final link 
        final_link=stripped_link[0][7:]
        print(final_link)
print(url)
英文:

Not able to get the same result while scraping google using bs4 as compared to when I search the same thing in browser.

I have made a web crawler which takes the input of start date, end date,company_name and keyword. These 4 parameters are put into google search. I also have user agent. How can I get the same results through scraping that I get while browsing in google.

Here is the code:

from bs4 import BeautifulSoup
import requests
from ssl import SSLCertVerificationError
from urllib3.exceptions import MaxRetryError
import urllib.parse
import urllib3
from urllib.error import HTTPError, URLError
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
headers= {  
"Accept-Language": "en-US,en;q=0.9",  
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/113.0.0.0"
}
start_date=input("Enter start date: ")
end_date=input("Enter end date: ")
company_name=input("Enter company name: ")
keyword=input("Enter keyword: ")
try:
url_params = {
'q': f'{company_name} company {keyword} technology ',
'tbs': f'cdr:1,cd_min:{start_date},cd_max:{end_date}',
'source': 'lnms',
'tbm': 'nws'
}
url = 'https://www.google.com/search?' + urllib.parse.urlencode(url_params)
req = requests.get(url, headers=headers, verify=False)
soup=BeautifulSoup(req.content,'lxml')
except (requests.exceptions.SSLError,SSLCertVerificationError,MaxRetryError,requests.exceptions.InvalidSchema,requests.exceptions.MissingSchema,requests.exceptions.ConnectionError,requests.exceptions.TooManyRedirects,requests.exceptions.ChunkedEncodingError,TypeError,HTTPError,URLError,ValueError):
print('Connection failed with this website.')
for links in soup.findAll("a"):
link=links['href']
# print(link)
#Putting the necessary conditions to get a relevant link
if link[0:4]=='/url' and 'https' in link and 'google.com' not in link and 'proxyDocument' not in link and '.pdf' not in link and 'linkedin.com' not in link and 'cyberghost' not in link and 'cryptojacking' not in link and 'nasdaq.com' not in link and 'thecsr' not in link and 'nokia.com' not in link and 'youtube.com' not in link and 'co.uk' not in link and 'getting-started' not in link and 'scholar.google' not in link and 'quora' not in link and 'isitdown' not in link and 'tapinto' not in link and 'knownews' not in link and 'makeuseof' not in link and 'fordmuscle' not in link and 'marketbeat' not in link and 'statetime' not in link and 'dig-in' not in link:
#doing the necessary slicing
stripped_link=link.split('&sa')
stripped_link.pop()
#We got our final link 
final_link=stripped_link[0][7:]
print(final_link)
print(url)

答案1

得分: 1

很可能请求参数不同,如标头、Cookie 等。可以在浏览器中打开开发者工具查看详细信息。此外,Google 根据个人账户信息(如位置等)定制搜索结果,因此结果很难完全相同。

英文:

It is likely that the request parameters are different, such as headers, cookies, etc. You can open the developer tools in your browser to see the detail. Additionally, Google customizes search results based on some of your personal account information like your location and so on. Therefore, it is difficult for the results to be exactly the same.

huangapple
  • 本文由 发表于 2023年6月1日 14:26:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/76379182.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定