使用Python的Selenium来抓取一些网站会出现一些错误。

huangapple go评论78阅读模式
英文:

Using Pythons Selenium to scrape some website gives some errors

问题

Here is the translated portion of your code:

我有一个稍微奇怪的设置我有两个Python和R脚本

最初我使用R的`RSelenium`,它能正常工作但后来停止工作所以我的原始代码全部是R - 现在我不得不切换到Python并使用Python中Selenium提供的`undetected_chromedriver`和一些其他选项所以我有2个脚本

- R脚本 - 使用`rvest`处理网络抓取的数据然后通过命令行将信息发送到Python脚本Python脚本运行Selenium部分
- Python Selenium脚本执行网络抓取我希望通过终端运行这些脚本

问题

当我在Python中设置`headless = True`会出现所有错误这是我的代码

以下为代码片段

如何使无头模式工作当我设置`headless = False`一切都正常工作但是当进行网络抓取时浏览器会打开

错误消息部分翻译):

- [7238:7238:0621/230217.106355:ERROR:process_singleton_posix.cc(334)] 创建失败/home/matt/.config/google-chrome/SingletonLock文件已存在17
- MESA-INTEL: 警告性能支持已禁用请考虑sysctl dev.i915.perf_stream_paranoid=0
- [0621/230217.125157:ERROR:nacl_helper_linux.cc(355)] NaCl助手进程在没有沙箱的情况下运行
- 大多数情况下您需要正确配置SUID沙箱
- selenium.common.exceptions.WebDriverException: 未知错误无法连接到位于127.0.0.1:35111的Chrome
- 从Chrome不可访问

请注意错误消息的翻译只是部分翻译因为其中包含了一些技术术语

注意您的代码中似乎存在一些混乱的注释例如`options.add_argument('--user-data-directory=/home/matt/.config/google-chrome-beta/Default')`中的HTML编码符号(`'`)。这些可能需要进行修复以确保代码的正确性。

此外部分错误消息中提到了Chrome驱动程序的路径和配置问题您可能需要仔细检查这些设置以解决问题

<details>
<summary>英文:</summary>

I have a slightly strange setup. I have two scripts in Python and R.

I was originally using R&#39;s `RSelenium` which worked but then stopped working so my original code was all in R - now I had to switch to Python and use `undetected_chromedriver` and a few other options that Selenium has available in Python. So I have 2 scripts.

- The R script - uses `rvest` to process the web scraped data and sends via the command line the info to the Python script which runs the Selenium part.
- The Python Selenium script which does the web scraping and I want to run the scripts via the terminal.

Problem:

When I set `headless = True` in Python I get all the errors. Here is my code:

    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time
    import random
    from bs4 import BeautifulSoup
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.common.exceptions import NoSuchElementException
    from selenium.webdriver.chrome.options import Options
    from selenium.common.exceptions import ElementClickInterceptedException
    import undetected_chromedriver as uc
    import sys
    import random
    import pandas as pd
    
    from selenium.webdriver.chrome.service import Service
    from webdriver_manager.chrome import ChromeDriverManager
    from selenium.webdriver.chrome.options import Options
    
    from selenium.webdriver.chrome.service import Service as ChromeService
    from seleniumwire import webdriver
    import re
    import uuid
    import datetime
    
    from selenium.common.exceptions import TimeoutException
    
  
    
    RUN_HEADLESS = True
    
    PATH = &#39;/home/matt/bin/chromedriver/chromedriver&#39;
    
    options = uc.ChromeOptions()
    options.add_argument(&#39;--user-data-directory=/home/matt/.config/google-chrome-beta/Default&#39;)
    

    if RUN_HEADLESS:
        options.add_argument(&quot;--headless=new&quot;) # (&#39;--headless&#39;)
        options.add_argument(&quot;--no-sandbox&quot;)
        options.add_argument(&quot;--disable-dev-shm-usage&quot;)
        # #options.add_argument(&quot;--disable-gpu&quot;)  # If running on Linux/Unix system
        # options.add_argument(&quot;--disable-extensions&quot;)
        # options.add_argument(&quot;--start-maximized&quot;) 
        # # options.add_argument(&#39;--disable-javascript&#39;)
    else:
        pass

How can I make the headless version work? when I set `headless = False` everything works but of course I have the browser open when scraping.

Error:

    [7238:7238:0621/230217.106355:ERROR:process_singleton_posix.cc(334)] Failed to create /home/matt/.config/google-chrome/SingletonLock: El fichero ya existe (17)
    MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0
    
    [0621/230217.125157:ERROR:nacl_helper_linux.cc(355)] NaCl helper process running without a sandbox!
    Most likely you need to configure your SUID sandbox correctly
    Traceback (most recent call last):
      File &quot;/run/media/matt/A34E-C6B8/inmobiliarioProject2/rscripts/production/collectingTheData/v2/collectIndividualPages_Comprar.py&quot;, line 151, in &lt;module&gt;
        driver = uc.Chrome(
                 ^^^^^^^^^^
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/undetected_chromedriver/__init__.py&quot;, line 466, in __init__
        super(Chrome, self).__init__(
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py&quot;, line 84, in __init__
        super().__init__(
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py&quot;, line 104, in __init__
        super().__init__(
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py&quot;, line 286, in __init__
        self.start_session(capabilities, browser_profile)
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/undetected_chromedriver/__init__.py&quot;, line 729, in start_session
        super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session(
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py&quot;, line 378, in start_session
        response = self.execute(Command.NEW_SESSION, parameters)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py&quot;, line 440, in execute
        self.error_handler.check_response(response)
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py&quot;, line 245, in check_response
        raise exception_class(message, screen, stacktrace)
    selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:35111
    from chrome not reachable
    Stacktrace:
    #0 0x561e4749c4e3 &lt;unknown&gt;
    #1 0x561e471cbb00 &lt;unknown&gt;
    #2 0x561e471b9436 &lt;unknown&gt;
    #3 0x561e471f89be &lt;unknown&gt;
    #4 0x561e471f0884 &lt;unknown&gt;
    #5 0x561e4722fccc &lt;unknown&gt;
    #6 0x561e4722f47f &lt;unknown&gt;
    #7 0x561e47226de3 &lt;unknown&gt;
    #8 0x561e471fc2dd &lt;unknown&gt;
    #9 0x561e471fd34e &lt;unknown&gt;
    #10 0x561e4745c3e4 &lt;unknown&gt;
    #11 0x561e474603d7 &lt;unknown&gt;
    #12 0x561e4746ab20 &lt;unknown&gt;
    #13 0x561e47461023 &lt;unknown&gt;
    #14 0x561e4742f1aa &lt;unknown&gt;
    #15 0x561e474856b8 &lt;unknown&gt;
    #16 0x561e47485847 &lt;unknown&gt;
    #17 0x561e47495243 &lt;unknown&gt;
    #18 0x7fa7187a16ea start_thread



EDIT:

using `--headless` instead of `headless=new` I get the following:



    [1] &quot;NA link. Skipping....&quot;
    [6782:6782:0622/000945.392409:ERROR:process_singleton_posix.cc(334)] Failed to create /home/matt/.config/google-chrome/SingletonLock: El fichero ya existe (17)
    [0622/000945.407709:ERROR:nacl_helper_linux.cc(355)] NaCl helper process running without a sandbox!
    Most likely you need to configure your SUID sandbox correctly
    MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0
    
    Traceback (most recent call last):
      File &quot;/run/media/matt/A34E-C6B8/inmobiliarioProject2/rscripts/production/collectingTheData/v2/collectIndividualPages_Comprar.py&quot;, line 151, in &lt;module&gt;
        driver = uc.Chrome(
                 ^^^^^^^^^^
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/undetected_chromedriver/__init__.py&quot;, line 466, in __init__
        super(Chrome, self).__init__(
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py&quot;, line 84, in __init__
        super().__init__(
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py&quot;, line 104, in __init__
        super().__init__(
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py&quot;, line 286, in __init__
        self.start_session(capabilities, browser_profile)
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/undetected_chromedriver/__init__.py&quot;, line 729, in start_session
        super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session(
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py&quot;, line 378, in start_session
        response = self.execute(Command.NEW_SESSION, parameters)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py&quot;, line 440, in execute
        self.error_handler.check_response(response)
      File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py&quot;, line 245, in check_response
        raise exception_class(message, screen, stacktrace)
    selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:33765
    from chrome not reachable
    Stacktrace:
    #0 0x56307a71f4e3 &lt;unknown&gt;
    #1 0x56307a44eb00 &lt;unknown&gt;
    #2 0x56307a43c436 &lt;unknown&gt;
    #3 0x56307a47b9be &lt;unknown&gt;
    #4 0x56307a473884 &lt;unknown&gt;
    #5 0x56307a4b2ccc &lt;unknown&gt;
    #6 0x56307a4b247f &lt;unknown&gt;
    #7 0x56307a4a9de3 &lt;unknown&gt;
    #8 0x56307a47f2dd &lt;unknown&gt;
    #9 0x56307a48034e &lt;unknown&gt;
    #10 0x56307a6df3e4 &lt;unknown&gt;
    #11 0x56307a6e33d7 &lt;unknown&gt;
    #12 0x56307a6edb20 &lt;unknown&gt;
    #13 0x56307a6e4023 &lt;unknown&gt;
    #14 0x56307a6b21aa &lt;unknown&gt;
    #15 0x56307a7086b8 &lt;unknown&gt;
    #16 0x56307a708847 &lt;unknown&gt;
    #17 0x56307a718243 &lt;unknown&gt;
    #18 0x7ff1875e96ea start_thread




EDIT: Code:

    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time
    import random
    from bs4 import BeautifulSoup
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.common.exceptions import NoSuchElementException
    from selenium.webdriver.chrome.options import Options
    from selenium.common.exceptions import ElementClickInterceptedException
    import undetected_chromedriver as uc
    import sys
    import random
    import pandas as pd
    
    from selenium.webdriver.chrome.service import Service
    from webdriver_manager.chrome import ChromeDriverManager
    from selenium.webdriver.chrome.options import Options
    
    from selenium.webdriver.chrome.service import Service as ChromeService
    from seleniumwire import webdriver
    import re
    import uuid
    import datetime
    
    from selenium.common.exceptions import TimeoutException
    
    import time
    from seleniumbase import Driver
    
    
    
    
    RUN_HEADLESS = False
    
    PATH = &#39;/home/matt/bin/chromedriver/chromedriver&#39;
    
    options = uc.ChromeOptions()
    options.add_argument(&#39;--user-data-directory=/home/matt/.config/google-chrome-beta/Default&#39;)
    
    ################ New code added to fix chromedriver and chrome version mismatch ################
    #options = Options()
    if RUN_HEADLESS:
        options.add_argument(&quot;--headless&quot;) # (&#39;--headless&#39;)
        options.add_argument(&quot;--no-sandbox&quot;)
        options.add_argument(&quot;--disable-dev-shm-usage&quot;)
        # #options.add_argument(&quot;--disable-gpu&quot;)  # If running on Linux/Unix system
        # options.add_argument(&quot;--disable-extensions&quot;)
        # options.add_argument(&quot;--start-maximized&quot;) 
        # # options.add_argument(&#39;--disable-javascript&#39;)
    else:
        pass
      
      
    driver = uc.Chrome(options=options)
      #seleniumwire_options = seleniumwire_options # here is where I pass random proxies to the options
      
      
    link = &quot;https://www.fotocasa.es/es/comprar/vivienda/madrid-capital/calefaccion-terraza-trastero-ascensor-no-amueblado/176698848/d?from=list&quot;
    driver.get(link)


This gives me:

    TypeError: WebDriver.__init__() got an unexpected keyword argument &#39;executable_path&#39;

</details>


# 答案1
**得分**: 2

[SeleniumBase](https://github.com/seleniumbase/SeleniumBase)有一个与无头模式兼容的未检测到的 Chrome 驱动模式

在执行```pip install seleniumbase```使用以下Python代码运行

```python
import time
from seleniumbase import Driver

driver = Driver(uc_cdp=True, incognito=True, headless=True)
driver.get("https://nowsecure.nl/#relax")
time.sleep(7)
driver.get_screenshot_as_file("screenshot.png")
driver.quit()

(删除headless部分以查看发生了什么。)

这是另一种使用pytest命令行选项启用带有headless模式的uc模式的格式:

from seleniumbase import BaseCase

if __name__ == "__main__":
    from pytest import main
    main([__file__, "--uc", "--uc-cdp", "--incognito", "-s", "--headless"])

class UndetectedTest(BaseCase):
    def test_browser_is_undetected(self):
        self.open("https://nowsecure.nl/#relax")
        self.assert_text("OH YEAH, you passed!", "h1", timeout=7.25)
        self.post_message("Selenium wasn't detected!", duration=2.8)
        self._print("\n Success! Website did not detect Selenium! ")

(再次删除--headless部分以查看发生了什么。)

英文:

SeleniumBase has an undetected-chromedriver mode that works with headless mode.

After pip install seleniumbase, run the following with python:

import time
from seleniumbase import Driver

driver = Driver(uc_cdp=True, incognito=True, headless=True)
driver.get(&quot;https://nowsecure.nl/#relax&quot;)
time.sleep(7)
driver.get_screenshot_as_file(&quot;screenshot.png&quot;)
driver.quit()

(Remove the headless part to see what's going on.)

Here's another format that uses pytest command-line options to enable uc mode with headless mode:

from seleniumbase import BaseCase

if __name__ == &quot;__main__&quot;:
    from pytest import main
    main([__file__, &quot;--uc&quot;, &quot;--uc-cdp&quot;, &quot;--incognito&quot;, &quot;-s&quot;, &quot;--headless&quot;])

class UndetectedTest(BaseCase):
    def test_browser_is_undetected(self):
        self.open(&quot;https://nowsecure.nl/#relax&quot;)
        self.assert_text(&quot;OH YEAH, you passed!&quot;, &quot;h1&quot;, timeout=7.25)
        self.post_message(&quot;Selenium wasn&#39;t detected!&quot;, duration=2.8)
        self._print(&quot;\n Success! Website did not detect Selenium! &quot;)

(Again, remove the --headless part to see what's going on.)

答案2

得分: 0

你尝试过使用 --headless 而不是 --headless=new 吗?

英文:

Have you tried using --headless instead of --headless=new?

huangapple
  • 本文由 发表于 2023年6月22日 05:16:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76527192.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定