英文:
Google search results is not same as scraping google results, how to get them same?
问题
以下是代码的翻译部分:
from bs4 import BeautifulSoup
import requests
from ssl import SSLCertVerificationError
from urllib3.exceptions import MaxRetryError
import urllib.parse
import urllib3
from urllib.error import HTTPError, URLError
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
headers= {
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/113.0.0.0"
}
start_date=input("Enter start date: ")
end_date=input("Enter end date: ")
company_name=input("Enter company name: ")
keyword=input("Enter keyword: ")
try:
url_params = {
'q': f'{company_name} company {keyword} technology ',
'tbs': f'cdr:1,cd_min:{start_date},cd_max:{end_date}',
'source': 'lnms',
'tbm': 'nws'
}
url = 'https://www.google.com/search?' + urllib.parse.urlencode(url_params)
req = requests.get(url, headers=headers, verify=False)
soup=BeautifulSoup(req.content,'lxml')
except (requests.exceptions.SSLError,SSLCertVerificationError,MaxRetryError,requests.exceptions.InvalidSchema,requests.exceptions.MissingSchema,requests.exceptions.ConnectionError,requests.exceptions.TooManyRedirects,requests.exceptions.ChunkedEncodingError,TypeError,HTTPError,URLError,ValueError):
print('Connection failed with this website.')
for links in soup.findAll("a"):
link=links['href']
# print(link)
#Putting the necessary conditions to get a relevant link
if link[0:4]=='/url' and 'https' in link and 'google.com' not in link and 'proxyDocument' not in link and '.pdf' not in link and 'linkedin.com' not in link and 'cyberghost' not in link and 'cryptojacking' not in link and 'nasdaq.com' not in link and 'thecsr' not in link and 'nokia.com' not in link and 'youtube.com' not in link and 'co.uk' not in link and 'getting-started' not in link and 'scholar.google' not in link and 'quora' not in link and 'isitdown' not in link and 'tapinto' not in link and 'knownews' not in link and 'makeuseof' not in link and 'fordmuscle' not in link and 'marketbeat' not in link and 'statetime' not in link and 'dig-in' not in link:
#doing the necessary slicing
stripped_link=link.split('&sa')
stripped_link.pop()
#We got our final link
final_link=stripped_link[0][7:]
print(final_link)
print(url)
英文:
Not able to get the same result while scraping google using bs4 as compared to when I search the same thing in browser.
I have made a web crawler which takes the input of start date, end date,company_name and keyword. These 4 parameters are put into google search. I also have user agent. How can I get the same results through scraping that I get while browsing in google.
Here is the code:
from bs4 import BeautifulSoup
import requests
from ssl import SSLCertVerificationError
from urllib3.exceptions import MaxRetryError
import urllib.parse
import urllib3
from urllib.error import HTTPError, URLError
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
headers= {
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/113.0.0.0"
}
start_date=input("Enter start date: ")
end_date=input("Enter end date: ")
company_name=input("Enter company name: ")
keyword=input("Enter keyword: ")
try:
url_params = {
'q': f'{company_name} company {keyword} technology ',
'tbs': f'cdr:1,cd_min:{start_date},cd_max:{end_date}',
'source': 'lnms',
'tbm': 'nws'
}
url = 'https://www.google.com/search?' + urllib.parse.urlencode(url_params)
req = requests.get(url, headers=headers, verify=False)
soup=BeautifulSoup(req.content,'lxml')
except (requests.exceptions.SSLError,SSLCertVerificationError,MaxRetryError,requests.exceptions.InvalidSchema,requests.exceptions.MissingSchema,requests.exceptions.ConnectionError,requests.exceptions.TooManyRedirects,requests.exceptions.ChunkedEncodingError,TypeError,HTTPError,URLError,ValueError):
print('Connection failed with this website.')
for links in soup.findAll("a"):
link=links['href']
# print(link)
#Putting the necessary conditions to get a relevant link
if link[0:4]=='/url' and 'https' in link and 'google.com' not in link and 'proxyDocument' not in link and '.pdf' not in link and 'linkedin.com' not in link and 'cyberghost' not in link and 'cryptojacking' not in link and 'nasdaq.com' not in link and 'thecsr' not in link and 'nokia.com' not in link and 'youtube.com' not in link and 'co.uk' not in link and 'getting-started' not in link and 'scholar.google' not in link and 'quora' not in link and 'isitdown' not in link and 'tapinto' not in link and 'knownews' not in link and 'makeuseof' not in link and 'fordmuscle' not in link and 'marketbeat' not in link and 'statetime' not in link and 'dig-in' not in link:
#doing the necessary slicing
stripped_link=link.split('&sa')
stripped_link.pop()
#We got our final link
final_link=stripped_link[0][7:]
print(final_link)
print(url)
答案1
得分: 1
很可能请求参数不同,如标头、Cookie 等。可以在浏览器中打开开发者工具查看详细信息。此外,Google 根据个人账户信息(如位置等)定制搜索结果,因此结果很难完全相同。
英文:
It is likely that the request parameters are different, such as headers, cookies, etc. You can open the developer tools in your browser to see the detail. Additionally, Google customizes search results based on some of your personal account information like your location and so on. Therefore, it is difficult for the results to be exactly the same.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论