从 clutch.io 收集数据:在 Colab 上使用 BS4 时出现了一些问题。

huangapple go评论62阅读模式
英文:

gathering data from clutch.io : some issues with BS4 while working on colab

问题

Here is the translated content you requested:

update: selenium在Colab中的支持如何?我已经检查过了,见下文!

update 2: 感谢badduker以及他关于Colab的解决方法和结果的回复,我尝试添加了一些代码来解析一些结果

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd

options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")

driver = webdriver.Chrome(options=options)
driver.get("https://clutch.co/it-services/msp")
page_source = driver.page_source
driver.quit()

soup = BeautifulSoup(page_source, "html.parser")

# 使用一些BeautifulSoup选择器提取数据
# 例如,让我们提取公司的名称和位置
company_names = [name.text for name in soup.select(".company-name")]
company_locations = [location.text for location in soup.select(".locality")]

# 将数据存储在Pandas DataFrame中
data = {
    "公司名称": company_names,
    "位置": company_locations
}

df = pd.DataFrame(data)

# 将DataFrame保存到CSV文件中
df.to_csv("clutch_data.csv", index=False)

但是这没有产生结果。

我将尝试更深入地研究这个问题,但可能需要开启一个新的线程。感谢你,亲爱的badduker。

最后的更新 - 第二次更新 - 写于6月22日,马拉加

亲爱的专家们,今天好!目前我正在尝试找出从clutch.io获取数据的简单方法和方法。

请注意:我使用Google Colab,并且有时我认为我的Colab帐户上不支持某些方法,原因可能是与Cloudflare有关的问题。

但是请看这个方法 -

import requests
from bs4 import BeautifulSoup
url = 'https://clutch.co/it-services/msp'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

links = []
for l in soup.find_all('li', class_='website-link website-link-a'):
    results = (l.a.get('href'))
    links.append(results)

print(links)

这也没有产生结果。你有什么想法,如何解决这个问题?

更新:你好,亲爱的用户510170。非常感谢你的回答和关于Selenium解决方案的建议。我在Google Colab中尝试了一下,发现了以下结果:

---------------------------------------------------------------------------
WebDriverException                        Traceback (most recent call last)
<ipython-input-2-4f37092106f4> in <module>()
      2 from selenium import webdriver
      3 
----> 4 driver = webdriver.Chrome()
      5 
      6 url = 'https://clutch.co/it-services/msp'

5 frames
/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
    243                 alert_text = value["alert"].get("text")
    244             raise exception_class(message, screen, stacktrace, alert_text)  # type: ignore[call-arg]  # mypy is not smart enough here
--> 245         raise exception_class(message, screen, stacktrace)

WebDriverException: Message: unknown error: cannot find Chrome binary
Stacktrace:
#0 0x56199267a4e3 <unknown>
#1 0x5619923a9c76 <unknown>
#2 0x5619923d0757 <unknown>
#3 0x5619923cf029 <unknown>
#4 0x56199240dccc <unknown>
#5 0x56199240d47f <unknown>
#6 0x561992404de3 <unknown>
#7 0x5619923da2dd <unknown>
#8 0x5619923db34e <unknown>
#9 0x56199263a3e4 <unknown>
#10 0x56199263e3d7 <unknown>
#11 0x561992648b20 <unknown>
#12 0x56199263f023 <unknown>
#13 0x56199260d1aa <unknown>
#14 0x5619926636b8 <unknown>
#15 0x561992663847 <unknown>
#16 0x561992673243 <unknown>
#17 0x7efc5583e609 start_thread

对我来说,似乎与第4行有关,即:

driver = webdriver.Chrome()

这一行是否需要进行微小的修改和更改?

更新:感谢tarun,我注意到了这个解决方法:链接

我在Google Colab中应用了这个方法,尝试运行以下代码:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

browser = webdriver.Chrome(ChromeDriverManager().install())
browser.get("https://www.reddit.com/")
browser.quit()

嗯,最终应该能够在Colab中运行这段代码:

import requests
from bs4 import BeautifulSoup
url = 'https://clutch.co/it-services/msp'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

links = []
for l in soup.find_all('li', class_='website-link website-link-a'):
    results = (l.a.get('href'))
    links.append(results)

print(links)

更新:请见下文 - 在Colab中的检查以及问题 - Colab是否通常支持Selenium且准备就绪?

从 clutch.io 收集数据:在 Colab 上使用 BS4 时出现了一些问题。

期待你的回音。

感谢@user510170指出了另一种方法:链接

最近,Google Colab进行了升级,自Ubuntu 20.04以来不再分发chromium-browser,除非使用snap包安装兼容版本,你可以从Debian buster存储库安装兼容版本:

然后,你可以这样运行Selenium:

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.headless = True
wd = webdriver.Chrome('chromedriver', options=chrome

<details>
<summary>英文:</summary>

**update:** what bout selenium - support in colab: i have checked 
this..see below!

**update 2:** thanks to badduker and his reply with the colab-workaround  and results - i have tried to add some more code in order to parse some of the results


    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from bs4 import BeautifulSoup
    import pandas as pd
    
    options = Options()
    options.add_argument(&quot;--headless&quot;)
    options.add_argument(&quot;--no-sandbox&quot;)
    
    driver = webdriver.Chrome(options=options)
    driver.get(&quot;https://clutch.co/it-services/msp&quot;)
    page_source = driver.page_source
    driver.quit()
    
    soup = BeautifulSoup(page_source, &quot;html.parser&quot;)
    
    
    # Extract the data using some BeautifulSoup selectors
    # For example, let&#39;s extract the names and locations of the companies
    
    company_names = [name.text for name in soup.select(&quot;.company-name&quot;)]
    company_locations = [location.text for location in soup.select(&quot;.locality&quot;)]
    
    # Store the data in a Pandas DataFrame
    
    data = {
        &quot;Company Name&quot;: company_names,
        &quot;Location&quot;: company_locations
    }
    
    df = pd.DataFrame(data)
    
    # Save the DataFrame to a CSV file
    
    df.to_csv(&quot;clutch_data.csv&quot;, index=False)


but this leads to no results. 

i will try digg any deeper into that - but probably  with an new thread.. - thank you dear badduker. 

End of the last update -  the second update - written on june 22th 
malaga



good day dear experts - well at the moment i am trying to figure out a simple way and method to obtain data from clutch.io

note: i work with google colab - and sometimes i think that some approches were not supported on my collab account - some due cloudflare-things and issues. 

but see this one - 

    import requests
    from bs4 import BeautifulSoup
    url = &#39;https://clutch.co/it-services/msp&#39;
    response = requests.get(url)
    soup = BeautifulSoup(response.content, &#39;html.parser&#39;)
    
    links = []
    for l in soup.find_all(&#39;li&#39;,class_=&#39;website-link website-link-a&#39;):
        results = (l.a.get(&#39;href&#39;))
        links.append(results)
    
    print(links)

this also do not work - do you have any idea - how to solve the issue

it gives back a empty result. 




update: hello dear user510170 . many thanks for the answer and the selenium solution - tried it out in google.colab and found these following results

    --------------------------------------------------------------------------
    WebDriverException                        Traceback (most recent call last)
    &lt;ipython-input-2-4f37092106f4&gt; in &lt;cell line: 4&gt;()
          2 from selenium import webdriver
          3 
    ----&gt; 4 driver = webdriver.Chrome()
          5 
          6 url = &#39;https://clutch.co/it-services/msp&#39;
    
    5 frames
    /usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
        243                 alert_text = value[&quot;alert&quot;].get(&quot;text&quot;)
        244             raise exception_class(message, screen, stacktrace, alert_text)  # type: ignore[call-arg]  # mypy is not smart enough here
    --&gt; 245         raise exception_class(message, screen, stacktrace)
    
    WebDriverException: Message: unknown error: cannot find Chrome binary
    Stacktrace:
    #0 0x56199267a4e3 &lt;unknown&gt;
    #1 0x5619923a9c76 &lt;unknown&gt;
    #2 0x5619923d0757 &lt;unknown&gt;
    #3 0x5619923cf029 &lt;unknown&gt;
    #4 0x56199240dccc &lt;unknown&gt;
    #5 0x56199240d47f &lt;unknown&gt;
    #6 0x561992404de3 &lt;unknown&gt;
    #7 0x5619923da2dd &lt;unknown&gt;
    #8 0x5619923db34e &lt;unknown&gt;
    #9 0x56199263a3e4 &lt;unknown&gt;
    #10 0x56199263e3d7 &lt;unknown&gt;
    #11 0x561992648b20 &lt;unknown&gt;
    #12 0x56199263f023 &lt;unknown&gt;
    #13 0x56199260d1aa &lt;unknown&gt;
    #14 0x5619926636b8 &lt;unknown&gt;
    #15 0x561992663847 &lt;unknown&gt;
    #16 0x561992673243 &lt;unknown&gt;
    #17 0x7efc5583e609 start_thread


to me it seems to have to do with the line 4 - the 

      ----&gt; 4 driver = webdriver.Chrome()

  
is it this line that needs a minor correction and change!?


update: thanks to tarun i got notice of this workaround here: 

https://medium.com/cubemail88/automatically-download-chromedriver-for-selenium-aaf2e3fd9d81

did it: in other words i appied it to google-colab and tried to run the following:


    from selenium import webdriver
    from webdriver_manager.chrome import ChromeDriverManager
    
    #if __name__ == &quot;__main__&quot;:
            browser = webdriver.Chrome(ChromeDriverManager().install())
            browser.get(&quot;https://www.reddit.com/&quot;)
            browser.quit()



well - finally it should be able to run with this code in colab:

    import requests
    from bs4 import BeautifulSoup
    url = &#39;https://clutch.co/it-services/msp&#39;
    response = requests.get(url)
    soup = BeautifulSoup(response.content, &#39;html.parser&#39;)
    
    links = []
    for l in soup.find_all(&#39;li&#39;,class_=&#39;website-link website-link-a&#39;):
        results = (l.a.get(&#39;href&#39;))
        links.append(results)
    
    print(links)


update: see below - the check in colab - and the question - is colab genearlly selenium capable and selenium-ready!?

[![enter image description here][1]][1]


  [1]: https://i.stack.imgur.com/HVpUZ.png

look forward to hear from you



thanks to @user510170 who has pointed me to another approach :https://stackoverflow.com/questions/51046454/

Recently Google collab was upgraded and since Ubuntu 20.04+ no longer distributes chromium-browser outside of a snap package, you can install a compatible version from the Debian buster repository:


Then you can run selenium like this:

    from selenium import webdriver
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument(&#39;--headless&#39;)
    chrome_options.add_argument(&#39;--no-sandbox&#39;)
    chrome_options.headless = True
    wd = webdriver.Chrome(&#39;chromedriver&#39;,options=chrome_options)
    wd.get(&quot;https://www.webite-url.com&quot;)

cf this thread https://stackoverflow.com/questions/51046454/

i need to try out this.... - on colab



</details>


# 答案1
**得分**: 2

如果你执行 `print(response.content)`,你会看到以下内容:`启用 JavaScript 和 Cookies 以继续`。如果不使用 JavaScript你将无法访问完整内容下面是基于 Selenium 的可行解决方案

```python
from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()

url = 'https://clutch.co/it-services/msp'

driver.get(url=url)
soup = BeautifulSoup(driver.page_source, "lxml")

links = []
for l in soup.find_all('li', class_='website-link website-link-a'):
    results = (l.a.get('href'))
    links.append(results)

print(links, "\n", "Count links - ", len(links))

结果:

...ch.co&utm_medium=referral&utm_campaign=directory', 'https://www.turrito.com/?utm_source=clutch.co&utm_medium=referral&utm_campaign=directory']
Count links -  50
英文:

If you do print(response.content), you will see the following: Enable JavaScript and cookies to continue. Without using JavaScript, you don't get access to the full content. Here is a working solution based on selenium.

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
url = &#39;https://clutch.co/it-services/msp&#39;
driver.get(url=url)
soup = BeautifulSoup(driver.page_source,&quot;lxml&quot;)
links = []
for l in soup.find_all(&#39;li&#39;,class_=&#39;website-link website-link-a&#39;):
results = (l.a.get(&#39;href&#39;))
links.append(results)
print(links, &quot;\n&quot;, &quot;Count links - &quot;, len(links))

Result:

...ch.co&amp;utm_medium=referral&amp;utm_campaign=directory&#39;, &#39;https://www.turrito.com/?utm_source=clutch.co&amp;utm_medium=referral&amp;utm_campaign=directory&#39;] 
Count links -  50

答案2

得分: 2

TL;DR: Selenium无法绕过Cloudflare的安全验证。

你可以在本地运行略微修改的代码,不需要坚持使用Google Colab。这将打开一个浏览器窗口(这次是Edge浏览器)并返回下载链接列表。

英文:

TL;DR

The big guns of selenium can't shoot the Cloudflare sheriff.

The Colab link with what's below.


All right, here's a working selenium on Google colab that proves my point in the comment that even if you run it, you still must deal with a Cloudflare challenge.

Do the following:

  • Open a new colab Notebook
  • Run the code below:
%%shell
# Ubuntu no longer distributes chromium-browser outside of snap
#
# Proposed solution: https://askubuntu.com/questions/1204571/how-to-install-chromium-without-snap

# Add debian buster
cat &gt; /etc/apt/sources.list.d/debian.list &lt;&lt;&#39;EOF&#39;
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
EOF

# Add keys
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A

apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg

# Prefer debian repo for chromium* packages only
# Note the double-blank lines between entries
cat &gt; /etc/apt/preferences.d/chromium.pref &lt;&lt; &#39;EOF&#39;
Package: *
Pin: release a=eoan
Pin-Priority: 500


Package: *
Pin: origin &quot;deb.debian.org&quot;
Pin-Priority: 300


Package: chromium*
Pin: origin &quot;deb.debian.org&quot;
Pin-Priority: 700
EOF

# Install chromium and chromium-driver
apt-get update
apt-get install chromium chromium-driver

# Install selenium
pip install selenium
  • Run this code then
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument(&quot;--headless&quot;)
options.add_argument(&quot;--no-sandbox&quot;)


driver = webdriver.Chrome(options=options)

driver.get(&quot;https://clutch.co/it-services/msp&quot;)
print(driver.page_source)
driver.quit()

You should see this:

&lt;html lang=&quot;en-US&quot; class=&quot;lang-en&quot;&gt;&lt;head&gt;
    &lt;title&gt;Just a moment...&lt;/title&gt;
    &lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=UTF-8&quot;&gt;
    &lt;meta http-equiv=&quot;X-UA-Compatible&quot; content=&quot;IE=Edge&quot;&gt;
    &lt;meta name=&quot;robots&quot; content=&quot;noindex,nofollow&quot;&gt;
    &lt;meta name=&quot;viewport&quot; content=&quot;width=device-width,initial-scale=1&quot;&gt;
    &lt;link href=&quot;/cdn-cgi/styles/challenges.css&quot; rel=&quot;stylesheet&quot;&gt;
    

&lt;script src=&quot;/cdn-cgi/challenge-platform/h/b/orchestrate/managed/v1?ray=7daf435aeecd112d&quot;&gt;&lt;/script&gt;&lt;script src=&quot;https://challenges.cloudflare.com/turnstile/v0/b/19ad4730/api.js?onload=_cf_chl_turnstile_l&amp;amp;render=explicit&quot; async=&quot;&quot; defer=&quot;&quot; crossorigin=&quot;anonymous&quot;&gt;&lt;/script&gt;&lt;/head&gt;
&lt;body class=&quot;no-js&quot;&gt;
    &lt;div class=&quot;main-wrapper&quot; role=&quot;main&quot;&gt;
    &lt;div class=&quot;main-content&quot;&gt;
        &lt;h1 class=&quot;zone-name-title h1&quot;&gt;&lt;img src=&quot;/favicon.ico&quot; class=&quot;heading-favicon&quot; alt=&quot;Icon for clutch.co&quot;&gt;clutch.co&lt;/h1&gt;&lt;h2 id=&quot;challenge-running&quot; class=&quot;h2&quot;&gt;Checking if the site connection is secure&lt;/h2&gt;&lt;div id=&quot;challenge-stage&quot;&gt;&lt;/div&gt;&lt;div id=&quot;challenge-spinner&quot; class=&quot;spacer loading-spinner&quot; style=&quot;display: block; visibility: visible;&quot;&gt;&lt;div class=&quot;lds-ring&quot;&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div id=&quot;challenge-body-text&quot; class=&quot;core-msg spacer&quot;&gt;clutch.co needs to review the security of your connection before proceeding.&lt;/div&gt;&lt;div id=&quot;challenge-explainer-expandable&quot; class=&quot;hidden expandable body-text spacer&quot; style=&quot;display: none;&quot;&gt;&lt;div class=&quot;expandable-title&quot; id=&quot;challenge-explainer-summary&quot;&gt;&lt;button class=&quot;expandable-summary-btn&quot; id=&quot;challenge-explainer-btn&quot; type=&quot;button&quot;&gt;Why am I seeing this page?&lt;span class=&quot;caret-icon-wrapper&quot;&gt; &lt;div class=&quot;caret-icon&quot;&gt;&lt;/div&gt; &lt;/span&gt; &lt;/button&gt; &lt;/div&gt; &lt;div class=&quot;expandable-details&quot; id=&quot;challenge-explainer-details&quot;&gt;Requests from malicious bots can pose as legitimate traffic. Occasionally, you may see this page while the site ensures that the connection is secure.&lt;/div&gt;&lt;/div&gt;&lt;div id=&quot;challenge-success&quot; style=&quot;display: none;&quot;&gt;&lt;div class=&quot;h2&quot;&gt;&lt;span class=&quot;icon-wrapper&quot;&gt;&lt;img class=&quot;heading-icon&quot; alt=&quot;Success icon&quot; src=&quot;&quot;&gt;&lt;/span&gt;Connection is secure&lt;/div&gt;&lt;div class=&quot;core-msg spacer&quot;&gt;Proceeding...&lt;/div&gt;&lt;/div&gt;&lt;noscript&gt;
            &lt;div id=&quot;challenge-error-title&quot;&gt;
                &lt;div class=&quot;h2&quot;&gt;
                    &lt;span class=&quot;icon-wrapper&quot;&gt;
                        &lt;div class=&quot;heading-icon warning-icon&quot;&gt;&lt;/div&gt;
                    &lt;/span&gt;
                    &lt;span id=&quot;challenge-error-text&quot;&gt;
                        Enable JavaScript and cookies to continue
                    &lt;/span&gt;
                &lt;/div&gt;
            &lt;/div&gt;
        &lt;/noscript&gt;
        &lt;div id=&quot;trk_jschal_js&quot; style=&quot;display:none;background-image:url(&#39;/cdn-cgi/images/trace/managed/nojs/transparent.gif?ray=7daf435aeecd112d&#39;)&quot;&gt;&lt;/div&gt;
        &lt;form id=&quot;challenge-form&quot; action=&quot;/it-services/msp?__cf_chl_f_tk=fodu0hgxQaVwZPaQ.DlaPwI.O5svWinEWf94LM_MirI-1687382086-0-gaNycGzNCxA&quot; method=&quot;POST&quot; enctype=&quot;application/x-www-form-urlencoded&quot;&gt;
            &lt;input type=&quot;hidden&quot; name=&quot;md&quot; value=&quot;Y531CK3.GDorU7Iwk2DV6cF23mTV48icLTjuAZV6568-1687382086-0-AY7Fiv3qUhkh_i93AsbTcYh_D3SG2ZegyiWzGIVG8NgRvrQkiLAuCZ_x8rfr_A4Wy5QOyAOrBLs-avkoeJD0_1G3AYtVfv9rIc6umkp5J_y75TurwQH5fCwjSC3biYbFJbdTbW_NeKfRDUQgh230Lb1UMApiygfWXkeMlzznEEKUa3EXALaHU6co68L5nf_vY6c9QyyILeTdhcspjfUkXCUIUB7ff-8QQgCKpUkZa3UH9V9Icbndie4LGMCl_QJsy5jPvIzTt7nAS_Kk1-TrPxZltr8ZyHhjdvhEyVTkyrTi46auFGmixnyt9bK5dKnGv-J59nXp3EMF34gnVnbmTQuMDG9KHaN4bR4Ij6IO94sRnDGIJnXX6aiFLHiqFx9_kh1krAg3qOuuXZ9UghjKoITy2uPx9ng7hZ73p6QILb0aW-f-GL4VBdv-f1mdZyXJYRRrlfpnGoQMy-jxy6zsZshYtI-fzuDAL3A7nU_NVEGoN7SRrS4dFdn2mGhwPwVhhzt37SQ04MMjfs-_r8KNkOVbnNBtfHp_TWwyEbrhM4Lgc-YEYVRrI-J5LVYwIv4K7JAgObKJffhs53zwB0RrFQG3pF2Qy9W8Cxq2HvlKko3clzUXmw6meZfYJPZaYIMbJa39rqF0jltNKoqOcgJa5xQSTSXrNShUO1ClAHsjUGuTA11lM8Dk5rlnS9qXVWhDWI51i-4Q7BPIkb1BqaW6K_0ltyCzXBtN8q1EqrJeno7ryMC1FyCZ2y8Hy0IsHAhNg2DAvhYov34mrEeoOc4iG4ZHZghGAPkf9tNXo5NBTVNbrwDzvwxXaMVWJRHYQ8YB6LiFK7VPWa_ZjEU7GsdWzXpa_Tp4ulnnbUGrdEThXQC3chCij4f3T7m-Pc7LZdTvs-qs2f5g6_kBwiAAro2KelOxhCsf66l5HcpHHy9uhERBx7FgItODQDqG7kR2r80QCo3kOzBqFL3CIsvtg_KYNG8HkxYqDc-YMRWsvBj5Mmt6c8RzCOkDxKC_DJwOj58CeC2o9e-6wCfgcjb0EPR8cTK_S8ht28zPLUCDJ_j119ErBnHJ1zpdJHydT1HEdnK-vaSuyYf69kOSCC7Kij4ZRttSlfiA4k9gau8QoREht_pxMwfxXraBRfYUWVXO_ZSyz561B9C4Fa1L0gW31RXgCRuzCdDg-Cgr9AN8ky06s19D3N4CZLhtGOjRfMbidHVBD9Ppe4jlcUnSx-wdkJkVXZ2S8XO4F4ou7jGhrN9l9mDIDZ98OXaL_CvhHXNBWxE1Gn1_i1_Ndb7VKFP5Y6YuPLTXaN9kS-kF3rZcIBuh_dczTVQKOEWq1QYy9_CBj2sIPSxhcuQCXwTt4K81e6UiIrovBNWiZ4VjKvLdetwmUUgnpfNbssOz5S6GieV7ENqMBdaYlIP9YPdzHdJl4WQ_stCiC_Yc0wew2XI2XvOOil8_7F1yHgCg4mPS98Y9BXNDKiLDGGl3lRs9ydBvCdiY8__KztFLuVyDiWqschUvXUOg07KBtyQDnSxOyZUn873i7Kg4dKoqAyUICRT_nhsNtGUe4wzXYk3eevEG-7Ct4tSBpw6rTrjeNqa9Lsu5b6Pv-eJX0gYpg-1pydKSKLfvQYNp9wjwT-Oh5UH8vw8lo7b3uSc6QMmkaP2jQVDnqIyQDN8cDAYu6Vdr83xiZJG1Qqn80xVe0RMwEzMcjFv7yy6QM3O-uv0tJHC8EnINpXc1uMp1zphYyIgw-xSy68x55DEf38OrsY7xbJUqdMdF_qJQPi3FOh5MYHftgyH1WyDUHrxXiVJYuTMv7DtgaLjGoA0ybDW_PcBOXI5LAXnqYYR92WmHTEghxLHKxpWqZt9t_XS4j4rycqHU261_6zPhkTklv2cUFJOOT5lRTkY3OySP7-CEp0ZgjPrAOu4g-wt1YUprDjQzYrpmlBUXqKXzeJ795UBKn0HZLDoGQkY5_w-deyzcLV4XZXdGrxnAEOQq5Kx330hD2XgH8Q0be4WinLLZ6R8Tsl3c_5UuxLn0YJlxosFgXXLZehemg9WxGzfrOnb_5reyNr_3KU4nYWl9wFy-wsz6HtyPQ_1LnvBBgxVbrCFy-m9Wm8mt1BcaLwTUA2NSpTY0fbSwkuvx0LKTmG865H5C9qqBAgTGw2R99fv6vqq6ZP_HOzv5Q-c5L2C17lCp4cJwOCkvj7NEWQ1iCoi0X6CWZtVYFC-wXeuI4dh2D6BxGtekFuC77-Rt335ib1wPN7bf6_lA-TPb92U2IUCoq8K9frexCE7QzxaCSdKB-wRkE6g5FERuP-waii1Uquiut4aQ8tJVlwvi0nvuOvuP_Rg4P9xa2HlOxzwajrDBmzfnerhwdyEzOYQTXvwDF-ApPg5rPiMpjo29icy0K9arOF9yY_Wf2EXZD-6hjCcDswfhO3lQWFfnf1ANOFvnp0hcCvr-k93ukAnVbm8uorhSIWr2iy1JeNeGH8kM66IDkdlSLnj9igHNf6C0vDnkBoOolfXQECpmfhS6dai7Np01RQjoKsoGQU1S4rQnjsdYxBdOXqdrfYw_wfsBhV87qxHGUND6uD6m3qwU2vKCyQa_GSIGgzPfqWnhpXozyHUbmBOYJDiKI6u0x3u8mZDWhaaQWYttxUa1gQKnOQy1qM5NI8D881kXI_M2cpvX4rW9coG1k9_qE7yC--4u537ojssm9gSzNnQgeOpn-___N978hwxMqftej1jdhMJePK959TjaeJvMu045n-xtFbFGF81FIhiKtMWskbvRy1wIB3I&quot;&gt;
        &lt;span style=&quot;display: none;&quot;&gt;&lt;span class=&quot;text-gray-600&quot; data-translate=&quot;error&quot;&gt;error code: 1020&lt;/span&gt;&lt;/span&gt;&lt;/form&gt;
    &lt;/div&gt;
&lt;/div&gt;
&lt;script&gt;
    (function(){
        window._cf_chl_opt={
            cvId: &#39;2&#39;,
            cZone: &#39;clutch.co&#39;,
            cType: &#39;managed&#39;,
            cNounce: &#39;44156&#39;,
            cRay: &#39;7daf435aeecd112d&#39;,
            cHash: &#39;f86e351e5e00345&#39;,
            cUPMDTk: &quot;\/it-services\/msp?__cf_chl_tk=fodu0hgxQaVwZPaQ.DlaPwI.O5svWinEWf94LM_MirI-1687382086-0-gaNycGzNCxA&quot;,
            cFPWv: &#39;b&#39;,
            cTTimeMs: &#39;1000&#39;,
            cMTimeMs: &#39;0&#39;,
            cTplV: 5,
            cTplB: &#39;cf&#39;,
            cK: &quot;&quot;,
            cRq: {
                ru: &#39;aHR0cHM6Ly9jbHV0Y2guY28vaXQtc2VydmljZXMvbXNw&#39;,
                ra: &#39;TW96aWxsYS81LjAgKFgxMTsgTGludXggeDg2XzY0KSBBcHBsZVdlYktpdC81MzcuMzYgKEtIVE1MLCBsaWtlIEdlY2tvKSBIZWFkbGVzc0Nocm9tZS85MC4wLjQ0MzAuMjEyIFNhZmFyaS81MzcuMzY=&#39;,
                rm: &#39;R0VU&#39;,
                d: &#39;Ac34gEYVhl8DXbnILOq76p8yhzcHr06ria7SjaltDZ17DDHJrhCowkieLnLjzsxr3IgprB+0nJObDfv3tbOFZfQanW8VrnMBqy2JC8EFTBSXy7ra08EgPGOSUetaRr/bENIZ81mt06Vq52ykJX01fCO0wyHdNMat8fNwgF9RDfp7CFMpUtp0E+lofrj9tut74nR1+yniOo1zFt2zmKVpFFUunX1K1oMy8Fp1ubIQgHIBEG8g8h3CRzHD2WMTRtqYfFvCfD5PhcR+uWWgxf6ybQnii3noC7BLSbJZHZ5abVjNKZTvRGyLtkP8uNLoAQTF8A5ir68vmv+c6weSVw845TjogSfOFzHrXQvj5dnpPWEmReEsQfl2p3nJJuswyd/OUIPTMuLfPOM7EYHQKawKqI1+jp15e4QZjAl4LIhAwQoHqqcXPd9NqvBkzxrb7YhWBsvOHzgUMb5gR3exN42NVnFbUimWWdhX7Ei+tXR43I+68kGLFe4kQccvXzfYtl3G7mudbXvhkFMjAJk24bb9ugax1RyJeT1HMXZAZG7vOzGxEpf2Zgly+6twZ+C1JShkmfbHj9Z8EkYIlkxm99wVFg==&#39;,
                t: &#39;MTY4NzM4MjA4Ni44NzAwMDA=&#39;,
                cT: Math.floor(Date.now() / 1000),
                m: &#39;eRBgvpMHb6ottjHZ8LYOdoe7cvhlOKe5j2vP7BjQYIE=&#39;,
                i1: &#39;DrqvOBUgqLvl22W0Yoh8VA==&#39;,
                i2: &#39;Co7rIFnUzVj/9LmqAUCUUw==&#39;,
                zh: &#39;MYPZaDt93/n+i/zoik8Q5B4rNo75M88ZQHevg31AJek=&#39;,
                uh: &#39;U3QjejX60yUnAxm0WjPwFsHXm0FG5VD2yNoc1w8iQek=&#39;,
                hh: &#39;w+icDAWoSjxex064a5CZutpetBiSACwcZG4EmfuqjNI=&#39;,
            }
        };
        var trkjs = document.createElement(&#39;img&#39;);
        trkjs.setAttribute(&#39;src&#39;, &#39;/cdn-cgi/images/trace/managed/js/transparent.gif?ray=7daf435aeecd112d&#39;);
        trkjs.setAttribute(&#39;alt&#39;, &#39;&#39;);
        trkjs.setAttribute(&#39;style&#39;, &#39;display: none&#39;);
        document.body.appendChild(trkjs);
        var cpo = document.createElement(&#39;script&#39;);
        cpo.src = &#39;/cdn-cgi/challenge-platform/h/b/orchestrate/managed/v1?ray=7daf435aeecd112d&#39;;
        window._cf_chl_opt.cOgUHash = location.hash === &#39;&#39; &amp;&amp; location.href.indexOf(&#39;#&#39;) !== -1 ? &#39;#&#39; : location.hash;
        window._cf_chl_opt.cOgUQuery = location.search === &#39;&#39; &amp;&amp; location.href.slice(0, location.href.length - window._cf_chl_opt.cOgUHash.length).indexOf(&#39;?&#39;) !== -1 ? &#39;?&#39; : location.search;
        if (window.history &amp;&amp; window.history.replaceState) {
            var ogU = location.pathname + window._cf_chl_opt.cOgUQuery + window._cf_chl_opt.cOgUHash;
            history.replaceState(null, null, &quot;\/it-services\/msp?__cf_chl_rt_tk=fodu0hgxQaVwZPaQ.DlaPwI.O5svWinEWf94LM_MirI-1687382086-0-gaNycGzNCxA&quot; + window._cf_chl_opt.cOgUHash);
            cpo.onload = function() {
                history.replaceState(null, null, ogU);
            };
        }
        document.getElementsByTagName(&#39;head&#39;)[0].appendChild(cpo);
    }());
&lt;/script&gt;&lt;img src=&quot;/cdn-cgi/images/trace/managed/js/transparent.gif?ray=7daf435aeecd112d&quot; alt=&quot;&quot; style=&quot;display: none&quot;&gt;




&lt;div class=&quot;footer&quot; role=&quot;contentinfo&quot;&gt;&lt;div class=&quot;footer-inner&quot;&gt;&lt;div class=&quot;clearfix diagnostic-wrapper&quot;&gt;&lt;div class=&quot;ray-id&quot;&gt;Ray ID: &lt;code&gt;7daf435aeecd112d&lt;/code&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&quot;text-center&quot; id=&quot;footer-text&quot;&gt;Performance &amp;amp; security by &lt;a rel=&quot;noopener noreferrer&quot; href=&quot;https://www.cloudflare.com?utm_source=challenge&amp;amp;utm_campaign=m&quot; target=&quot;_blank&quot;&gt;Cloudflare&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;span id=&quot;trk_jschal_js&quot;&gt;&lt;/span&gt;&lt;/body&gt;&lt;/html&gt;

As you can see, running selenium doesn't change much.


So, my question to you is:

Why do you want to stick to colab so badly?

Because, running a slightly modified code locally:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

options = webdriver.EdgeOptions()
options.add_argument(&quot;--window-size=1920x1080&quot;)
options.add_argument(&quot;--disable-gpu&quot;)
options.add_argument(&quot;--disable-extensions&quot;)
browser = webdriver.Edge(options=options)

browser.get(&quot;https://clutch.co/it-services/msp&quot;)

print(&quot;Waiting for download links to appear...&quot;)
WebDriverWait(browser, 5).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, &quot;.infobar__counter&quot;))
)

css_selector = &quot;.directory-list div.provider-info--header .company_info a&quot;
download_links = [
    link.get_attribute(&quot;href&quot;) for link
    in browser.find_elements(By.CSS_SELECTOR, css_selector)
]

print(download_links)

Should open a browser window (this time is Edge) and return this:

[&#39;https://clutch.co/profile/empist&#39;, &#39;https://clutch.co/profile/sugarshot&#39;, &#39;https://clutch.co/profile/veraqor&#39;, &#39;https://clutch.co/profile/vertical-computers&#39;, &#39;https://clutch.co/profile/andromeda-technology-solutions&#39;, &#39;https://clutch.co/profile/betterworld-technology&#39;, &#39;https://clutch.co/profile/symphony-solutions&#39;, &#39;https://clutch.co/profile/andersen&#39;, &#39;https://clutch.co/profile/blackthorn-vision&#39;, &#39;https://clutch.co/profile/pca-technology-group&#39;, &#39;https://clutch.co/profile/deft&#39;, &#39;https://clutch.co/profile/varsity-technologies&#39;, &#39;https://clutch.co/profile/techprocomp&#39;, &#39;https://clutch.co/profile/vintage-it-services&#39;, &#39;https://clutch.co/profile/imagis&#39;, &#39;https://clutch.co/profile/xiztdevops&#39;, &#39;https://clutch.co/profile/parachute-technology&#39;, &#39;https://clutch.co/profile/blackpoint-it&#39;, &#39;https://clutch.co/profile/exigent-technologies&#39;, &#39;https://clutch.co/profile/xenonstack&#39;, &#39;https://clutch.co/profile/it-outposts&#39;, &#39;https://clutch.co/profile/integris&#39;, &#39;https://clutch.co/profile/techmd&#39;, &#39;https://clutch.co/profile/total-networks&#39;, &#39;https://clutch.co/profile/applied-tech&#39;, &#39;https://clutch.co/profile/alpacked&#39;, &#39;https://clutch.co/profile/bit-bit-computer-consultants&#39;, &#39;https://clutch.co/profile/framework-it&#39;, &#39;https://clutch.co/profile/britenet&#39;, &#39;https://clutch.co/profile/success-computer-consulting&#39;, &#39;https://clutch.co/profile/cyberduo&#39;, &#39;https://clutch.co/profile/bca-it&#39;, &#39;https://clutch.co/profile/britecity&#39;, &#39;https://clutch.co/profile/designdata&#39;, &#39;https://clutch.co/profile/ascendant-technologies-0&#39;, &#39;https://clutch.co/profile/ripple-it&#39;, &#39;https://clutch.co/profile/tpx-communications&#39;, &#39;https://clutch.co/profile/xvand-technology-corp&#39;, &#39;https://clutch.co/profile/sikich&#39;, &#39;https://clutch.co/profile/cloudience&#39;, &#39;https://clutch.co/profile/mis-solutions&#39;, &#39;https://clutch.co/profile/real-it-solutions&#39;, &#39;https://clutch.co/profile/arium&#39;, &#39;https://clutch.co/profile/intetics&#39;, &#39;https://clutch.co/profile/gencare&#39;, &#39;https://clutch.co/profile/innowise-group&#39;, &#39;https://clutch.co/profile/tech-superpowers-0&#39;, &#39;https://clutch.co/profile/spd-group&#39;, &#39;https://clutch.co/profile/juern-technology&#39;, &#39;https://clutch.co/profile/turrito-networks&#39;]

On the other hand, locally, you don't even need selenium if you have cloudscraper.

For example, this:

import cloudscraper
from bs4 import BeautifulSoup

scraper = cloudscraper.create_scraper()
source = scraper.get(&quot;https://clutch.co/it-services/msp&quot;)

css_selector = &quot;.directory-list div.provider-info--header .company_info a&quot;

links = [
    f&#39;https://clutch.co{anchor[&quot;href&quot;]}&#39; for anchor in
    BeautifulSoup(source.text, &quot;html.parser&quot;).select(css_selector)
]
print(links)

Should return:

[&#39;https://clutch.co/profile/empist&#39;, &#39;https://clutch.co/profile/sugarshot&#39;, &#39;https://clutch.co/profile/veraqor&#39;, &#39;https://clutch.co/profile/vertical-computers&#39;, &#39;https://clutch.co/profile/andromeda-technology-solutions&#39;, &#39;https://clutch.co/profile/betterworld-technology&#39;, &#39;https://clutch.co/profile/symphony-solutions&#39;, &#39;https://clutch.co/profile/andersen&#39;, &#39;https://clutch.co/profile/blackthorn-vision&#39;, &#39;https://clutch.co/profile/pca-technology-group&#39;, &#39;https://clutch.co/profile/deft&#39;, &#39;https://clutch.co/profile/varsity-technologies&#39;, &#39;https://clutch.co/profile/techprocomp&#39;, &#39;https://clutch.co/profile/vintage-it-services&#39;, &#39;https://clutch.co/profile/imagis&#39;, &#39;https://clutch.co/profile/xiztdevops&#39;, &#39;https://clutch.co/profile/parachute-technology&#39;, &#39;https://clutch.co/profile/blackpoint-it&#39;, &#39;https://clutch.co/profile/exigent-technologies&#39;, &#39;https://clutch.co/profile/xenonstack&#39;, &#39;https://clutch.co/profile/it-outposts&#39;, &#39;https://clutch.co/profile/integris&#39;, &#39;https://clutch.co/profile/techmd&#39;, &#39;https://clutch.co/profile/total-networks&#39;, &#39;https://clutch.co/profile/applied-tech&#39;, &#39;https://clutch.co/profile/alpacked&#39;, &#39;https://clutch.co/profile/bit-bit-computer-consultants&#39;, &#39;https://clutch.co/profile/framework-it&#39;, &#39;https://clutch.co/profile/britenet&#39;, &#39;https://clutch.co/profile/success-computer-consulting&#39;, &#39;https://clutch.co/profile/cyberduo&#39;, &#39;https://clutch.co/profile/bca-it&#39;, &#39;https://clutch.co/profile/britecity&#39;, &#39;https://clutch.co/profile/designdata&#39;, &#39;https://clutch.co/profile/ascendant-technologies-0&#39;, &#39;https://clutch.co/profile/ripple-it&#39;, &#39;https://clutch.co/profile/tpx-communications&#39;, &#39;https://clutch.co/profile/xvand-technology-corp&#39;, &#39;https://clutch.co/profile/sikich&#39;, &#39;https://clutch.co/profile/cloudience&#39;, &#39;https://clutch.co/profile/mis-solutions&#39;, &#39;https://clutch.co/profile/real-it-solutions&#39;, &#39;https://clutch.co/profile/arium&#39;, &#39;https://clutch.co/profile/intetics&#39;, &#39;https://clutch.co/profile/gencare&#39;, &#39;https://clutch.co/profile/innowise-group&#39;, &#39;https://clutch.co/profile/tech-superpowers-0&#39;, &#39;https://clutch.co/profile/spd-group&#39;, &#39;https://clutch.co/profile/juern-technology&#39;, &#39;https://clutch.co/profile/turrito-networks&#39;]

PS. The source for the Debian magic on colab is here.

huangapple
  • 本文由 发表于 2023年6月5日 00:47:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76401453.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定