英文:
Using Pythons Selenium to scrape some website gives some errors
问题
Here is the translated portion of your code:
我有一个稍微奇怪的设置。我有两个Python和R脚本。
最初我使用R的`RSelenium`,它能正常工作,但后来停止工作,所以我的原始代码全部是R - 现在我不得不切换到Python,并使用Python中Selenium提供的`undetected_chromedriver`和一些其他选项。所以我有2个脚本。
- R脚本 - 使用`rvest`处理网络抓取的数据,然后通过命令行将信息发送到Python脚本,Python脚本运行Selenium部分。
- Python Selenium脚本执行网络抓取,我希望通过终端运行这些脚本。
问题:
当我在Python中设置`headless = True`时,会出现所有错误。这是我的代码:
(以下为代码片段)
如何使无头模式工作?当我设置`headless = False`时,一切都正常工作,但是当进行网络抓取时,浏览器会打开。
错误消息(部分翻译):
- [7238:7238:0621/230217.106355:ERROR:process_singleton_posix.cc(334)] 创建失败:/home/matt/.config/google-chrome/SingletonLock:文件已存在(17)
- MESA-INTEL: 警告:性能支持已禁用,请考虑sysctl dev.i915.perf_stream_paranoid=0
- [0621/230217.125157:ERROR:nacl_helper_linux.cc(355)] NaCl助手进程在没有沙箱的情况下运行!
- 大多数情况下,您需要正确配置SUID沙箱
- selenium.common.exceptions.WebDriverException: 未知错误:无法连接到位于127.0.0.1:35111的Chrome
- 从Chrome不可访问
请注意:错误消息的翻译只是部分翻译,因为其中包含了一些技术术语。
注意:您的代码中似乎存在一些混乱的注释,例如`options.add_argument('--user-data-directory=/home/matt/.config/google-chrome-beta/Default')`中的HTML编码符号(`'`)。这些可能需要进行修复以确保代码的正确性。
此外,部分错误消息中提到了Chrome驱动程序的路径和配置问题,您可能需要仔细检查这些设置以解决问题。
<details>
<summary>英文:</summary>
I have a slightly strange setup. I have two scripts in Python and R.
I was originally using R's `RSelenium` which worked but then stopped working so my original code was all in R - now I had to switch to Python and use `undetected_chromedriver` and a few other options that Selenium has available in Python. So I have 2 scripts.
- The R script - uses `rvest` to process the web scraped data and sends via the command line the info to the Python script which runs the Selenium part.
- The Python Selenium script which does the web scraping and I want to run the scripts via the terminal.
Problem:
When I set `headless = True` in Python I get all the errors. Here is my code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import random
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import ElementClickInterceptedException
import undetected_chromedriver as uc
import sys
import random
import pandas as pd
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service as ChromeService
from seleniumwire import webdriver
import re
import uuid
import datetime
from selenium.common.exceptions import TimeoutException
RUN_HEADLESS = True
PATH = '/home/matt/bin/chromedriver/chromedriver'
options = uc.ChromeOptions()
options.add_argument('--user-data-directory=/home/matt/.config/google-chrome-beta/Default')
if RUN_HEADLESS:
options.add_argument("--headless=new") # ('--headless')
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
# #options.add_argument("--disable-gpu") # If running on Linux/Unix system
# options.add_argument("--disable-extensions")
# options.add_argument("--start-maximized")
# # options.add_argument('--disable-javascript')
else:
pass
How can I make the headless version work? when I set `headless = False` everything works but of course I have the browser open when scraping.
Error:
[7238:7238:0621/230217.106355:ERROR:process_singleton_posix.cc(334)] Failed to create /home/matt/.config/google-chrome/SingletonLock: El fichero ya existe (17)
MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0
[0621/230217.125157:ERROR:nacl_helper_linux.cc(355)] NaCl helper process running without a sandbox!
Most likely you need to configure your SUID sandbox correctly
Traceback (most recent call last):
File "/run/media/matt/A34E-C6B8/inmobiliarioProject2/rscripts/production/collectingTheData/v2/collectIndividualPages_Comprar.py", line 151, in <module>
driver = uc.Chrome(
^^^^^^^^^^
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 466, in __init__
super(Chrome, self).__init__(
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py", line 84, in __init__
super().__init__(
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py", line 104, in __init__
super().__init__(
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 286, in __init__
self.start_session(capabilities, browser_profile)
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 729, in start_session
super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session(
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 378, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 440, in execute
self.error_handler.check_response(response)
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 245, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:35111
from chrome not reachable
Stacktrace:
#0 0x561e4749c4e3 <unknown>
#1 0x561e471cbb00 <unknown>
#2 0x561e471b9436 <unknown>
#3 0x561e471f89be <unknown>
#4 0x561e471f0884 <unknown>
#5 0x561e4722fccc <unknown>
#6 0x561e4722f47f <unknown>
#7 0x561e47226de3 <unknown>
#8 0x561e471fc2dd <unknown>
#9 0x561e471fd34e <unknown>
#10 0x561e4745c3e4 <unknown>
#11 0x561e474603d7 <unknown>
#12 0x561e4746ab20 <unknown>
#13 0x561e47461023 <unknown>
#14 0x561e4742f1aa <unknown>
#15 0x561e474856b8 <unknown>
#16 0x561e47485847 <unknown>
#17 0x561e47495243 <unknown>
#18 0x7fa7187a16ea start_thread
EDIT:
using `--headless` instead of `headless=new` I get the following:
[1] "NA link. Skipping...."
[6782:6782:0622/000945.392409:ERROR:process_singleton_posix.cc(334)] Failed to create /home/matt/.config/google-chrome/SingletonLock: El fichero ya existe (17)
[0622/000945.407709:ERROR:nacl_helper_linux.cc(355)] NaCl helper process running without a sandbox!
Most likely you need to configure your SUID sandbox correctly
MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0
Traceback (most recent call last):
File "/run/media/matt/A34E-C6B8/inmobiliarioProject2/rscripts/production/collectingTheData/v2/collectIndividualPages_Comprar.py", line 151, in <module>
driver = uc.Chrome(
^^^^^^^^^^
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 466, in __init__
super(Chrome, self).__init__(
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py", line 84, in __init__
super().__init__(
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py", line 104, in __init__
super().__init__(
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 286, in __init__
self.start_session(capabilities, browser_profile)
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 729, in start_session
super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session(
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 378, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 440, in execute
self.error_handler.check_response(response)
File "/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 245, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:33765
from chrome not reachable
Stacktrace:
#0 0x56307a71f4e3 <unknown>
#1 0x56307a44eb00 <unknown>
#2 0x56307a43c436 <unknown>
#3 0x56307a47b9be <unknown>
#4 0x56307a473884 <unknown>
#5 0x56307a4b2ccc <unknown>
#6 0x56307a4b247f <unknown>
#7 0x56307a4a9de3 <unknown>
#8 0x56307a47f2dd <unknown>
#9 0x56307a48034e <unknown>
#10 0x56307a6df3e4 <unknown>
#11 0x56307a6e33d7 <unknown>
#12 0x56307a6edb20 <unknown>
#13 0x56307a6e4023 <unknown>
#14 0x56307a6b21aa <unknown>
#15 0x56307a7086b8 <unknown>
#16 0x56307a708847 <unknown>
#17 0x56307a718243 <unknown>
#18 0x7ff1875e96ea start_thread
EDIT: Code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import random
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import ElementClickInterceptedException
import undetected_chromedriver as uc
import sys
import random
import pandas as pd
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service as ChromeService
from seleniumwire import webdriver
import re
import uuid
import datetime
from selenium.common.exceptions import TimeoutException
import time
from seleniumbase import Driver
RUN_HEADLESS = False
PATH = '/home/matt/bin/chromedriver/chromedriver'
options = uc.ChromeOptions()
options.add_argument('--user-data-directory=/home/matt/.config/google-chrome-beta/Default')
################ New code added to fix chromedriver and chrome version mismatch ################
#options = Options()
if RUN_HEADLESS:
options.add_argument("--headless") # ('--headless')
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
# #options.add_argument("--disable-gpu") # If running on Linux/Unix system
# options.add_argument("--disable-extensions")
# options.add_argument("--start-maximized")
# # options.add_argument('--disable-javascript')
else:
pass
driver = uc.Chrome(options=options)
#seleniumwire_options = seleniumwire_options # here is where I pass random proxies to the options
link = "https://www.fotocasa.es/es/comprar/vivienda/madrid-capital/calefaccion-terraza-trastero-ascensor-no-amueblado/176698848/d?from=list"
driver.get(link)
This gives me:
TypeError: WebDriver.__init__() got an unexpected keyword argument 'executable_path'
</details>
# 答案1
**得分**: 2
[SeleniumBase](https://github.com/seleniumbase/SeleniumBase)有一个与无头模式兼容的未检测到的 Chrome 驱动模式。
在执行```pip install seleniumbase```后,使用以下Python代码运行:
```python
import time
from seleniumbase import Driver
driver = Driver(uc_cdp=True, incognito=True, headless=True)
driver.get("https://nowsecure.nl/#relax")
time.sleep(7)
driver.get_screenshot_as_file("screenshot.png")
driver.quit()
(删除headless
部分以查看发生了什么。)
这是另一种使用pytest
命令行选项启用带有headless
模式的uc
模式的格式:
from seleniumbase import BaseCase
if __name__ == "__main__":
from pytest import main
main([__file__, "--uc", "--uc-cdp", "--incognito", "-s", "--headless"])
class UndetectedTest(BaseCase):
def test_browser_is_undetected(self):
self.open("https://nowsecure.nl/#relax")
self.assert_text("OH YEAH, you passed!", "h1", timeout=7.25)
self.post_message("Selenium wasn't detected!", duration=2.8)
self._print("\n Success! Website did not detect Selenium! ")
(再次删除--headless
部分以查看发生了什么。)
英文:
SeleniumBase has an undetected-chromedriver mode that works with headless mode.
After pip install seleniumbase
, run the following with python
:
import time
from seleniumbase import Driver
driver = Driver(uc_cdp=True, incognito=True, headless=True)
driver.get("https://nowsecure.nl/#relax")
time.sleep(7)
driver.get_screenshot_as_file("screenshot.png")
driver.quit()
(Remove the headless
part to see what's going on.)
Here's another format that uses pytest
command-line options to enable uc
mode with headless
mode:
from seleniumbase import BaseCase
if __name__ == "__main__":
from pytest import main
main([__file__, "--uc", "--uc-cdp", "--incognito", "-s", "--headless"])
class UndetectedTest(BaseCase):
def test_browser_is_undetected(self):
self.open("https://nowsecure.nl/#relax")
self.assert_text("OH YEAH, you passed!", "h1", timeout=7.25)
self.post_message("Selenium wasn't detected!", duration=2.8)
self._print("\n Success! Website did not detect Selenium! ")
(Again, remove the --headless
part to see what's going on.)
答案2
得分: 0
你尝试过使用 --headless
而不是 --headless=new
吗?
英文:
Have you tried using --headless
instead of --headless=new
?
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论