2023年6月1日 14:26:21go评论186阅读模式

英文:

Google search results is not same as scraping google results, how to get them same?

问题

以下是代码的翻译部分：

from bs4 import BeautifulSoup
import requests
from ssl import SSLCertVerificationError
from urllib3.exceptions import MaxRetryError
import urllib.parse
import urllib3
from urllib.error import HTTPError, URLError

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

headers= {  
    "Accept-Language": "en-US,en;q=0.9",  
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/113.0.0.0"
  }

start_date=input("Enter start date: ")
end_date=input("Enter end date: ")
company_name=input("Enter company name: ")
keyword=input("Enter keyword: ")
try:
    url_params = {
            'q': f'{company_name} company {keyword} technology ',
            'tbs': f'cdr:1,cd_min:{start_date},cd_max:{end_date}',
            'source': 'lnms',
            'tbm': 'nws'
        }

    url = 'https://www.google.com/search?' + urllib.parse.urlencode(url_params)

    req = requests.get(url, headers=headers, verify=False)
    soup=BeautifulSoup(req.content,'lxml')

except (requests.exceptions.SSLError,SSLCertVerificationError,MaxRetryError,requests.exceptions.InvalidSchema,requests.exceptions.MissingSchema,requests.exceptions.ConnectionError,requests.exceptions.TooManyRedirects,requests.exceptions.ChunkedEncodingError,TypeError,HTTPError,URLError,ValueError):
    print('Connection failed with this website.')

for links in soup.findAll("a"):
    link=links['href']
    # print(link)
    #Putting the necessary conditions to get a relevant link
    if link[0:4]=='/url' and 'https' in link and 'google.com' not in link and 'proxyDocument' not in link and '.pdf' not in link and 'linkedin.com' not in link and 'cyberghost' not in link and 'cryptojacking' not in link and 'nasdaq.com' not in link and 'thecsr' not in link and 'nokia.com' not in link and 'youtube.com' not in link and 'co.uk' not in link and 'getting-started' not in link and 'scholar.google' not in link and 'quora' not in link and 'isitdown' not in link and 'tapinto' not in link and 'knownews' not in link and 'makeuseof' not in link and 'fordmuscle' not in link and 'marketbeat' not in link and 'statetime' not in link and 'dig-in' not in link:
        #doing the necessary slicing
        stripped_link=link.split('&sa')
        stripped_link.pop()
        #We got our final link 
        final_link=stripped_link[0][7:]
        print(final_link)
print(url)

英文:

Not able to get the same result while scraping google using bs4 as compared to when I search the same thing in browser.

I have made a web crawler which takes the input of start date, end date,company_name and keyword. These 4 parameters are put into google search. I also have user agent. How can I get the same results through scraping that I get while browsing in google.

Here is the code:

from bs4 import BeautifulSoup
import requests
from ssl import SSLCertVerificationError
from urllib3.exceptions import MaxRetryError
import urllib.parse
import urllib3
from urllib.error import HTTPError, URLError
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
headers= {  
&quot;Accept-Language&quot;: &quot;en-US,en;q=0.9&quot;,  
&quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/113.0.0.0&quot;
}
start_date=input(&quot;Enter start date: &quot;)
end_date=input(&quot;Enter end date: &quot;)
company_name=input(&quot;Enter company name: &quot;)
keyword=input(&quot;Enter keyword: &quot;)
try:
url_params = {
&#39;q&#39;: f&#39;{company_name} company {keyword} technology &#39;,
&#39;tbs&#39;: f&#39;cdr:1,cd_min:{start_date},cd_max:{end_date}&#39;,
&#39;source&#39;: &#39;lnms&#39;,
&#39;tbm&#39;: &#39;nws&#39;
}
url = &#39;https://www.google.com/search?&#39; + urllib.parse.urlencode(url_params)
req = requests.get(url, headers=headers, verify=False)
soup=BeautifulSoup(req.content,&#39;lxml&#39;)
except (requests.exceptions.SSLError,SSLCertVerificationError,MaxRetryError,requests.exceptions.InvalidSchema,requests.exceptions.MissingSchema,requests.exceptions.ConnectionError,requests.exceptions.TooManyRedirects,requests.exceptions.ChunkedEncodingError,TypeError,HTTPError,URLError,ValueError):
print(&#39;Connection failed with this website.&#39;)
for links in soup.findAll(&quot;a&quot;):
link=links[&#39;href&#39;]
# print(link)
#Putting the necessary conditions to get a relevant link
if link[0:4]==&#39;/url&#39; and &#39;https&#39; in link and &#39;google.com&#39; not in link and &#39;proxyDocument&#39; not in link and &#39;.pdf&#39; not in link and &#39;linkedin.com&#39; not in link and &#39;cyberghost&#39; not in link and &#39;cryptojacking&#39; not in link and &#39;nasdaq.com&#39; not in link and &#39;thecsr&#39; not in link and &#39;nokia.com&#39; not in link and &#39;youtube.com&#39; not in link and &#39;co.uk&#39; not in link and &#39;getting-started&#39; not in link and &#39;scholar.google&#39; not in link and &#39;quora&#39; not in link and &#39;isitdown&#39; not in link and &#39;tapinto&#39; not in link and &#39;knownews&#39; not in link and &#39;makeuseof&#39; not in link and &#39;fordmuscle&#39; not in link and &#39;marketbeat&#39; not in link and &#39;statetime&#39; not in link and &#39;dig-in&#39; not in link:
#doing the necessary slicing
stripped_link=link.split(&#39;&amp;sa&#39;)
stripped_link.pop()
#We got our final link 
final_link=stripped_link[0][7:]
print(final_link)
print(url)

答案1

得分: 1

很可能请求参数不同，如标头、Cookie 等。可以在浏览器中打开开发者工具查看详细信息。此外，Google 根据个人账户信息（如位置等）定制搜索结果，因此结果很难完全相同。

英文:

It is likely that the request parameters are different, such as headers, cookies, etc. You can open the developer tools in your browser to see the detail. Additionally, Google customizes search results based on some of your personal account information like your location and so on. Therefore, it is difficult for the results to be exactly the same.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Google搜索结果与抓取Google结果不同，如何使它们相同？

问题

答案1

在一个大型的Postgres数据库中为所有行添加一列。

使用内置的切片函数来切片一个二维数组。

如何在Python 3.11中判断函数是否已运行？

在一个gzip.GzipFile中向后寻找失败是否意味着它损坏了？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论