使用Python的Selenium来抓取一些网站会出现一些错误。

huangapple go评论118阅读模式
英文:

Using Pythons Selenium to scrape some website gives some errors

问题

Here is the translated portion of your code:

  1. 我有一个稍微奇怪的设置我有两个PythonR脚本
  2. 最初我使用R`RSelenium`它能正常工作但后来停止工作所以我的原始代码全部是R - 现在我不得不切换到Python并使用PythonSelenium提供的`undetected_chromedriver`和一些其他选项所以我有2个脚本
  3. - R脚本 - 使用`rvest`处理网络抓取的数据然后通过命令行将信息发送到Python脚本Python脚本运行Selenium部分
  4. - Python Selenium脚本执行网络抓取我希望通过终端运行这些脚本
  5. 问题
  6. 当我在Python中设置`headless = True`会出现所有错误这是我的代码
  7. 以下为代码片段
  8. 如何使无头模式工作当我设置`headless = False`一切都正常工作但是当进行网络抓取时浏览器会打开
  9. 错误消息部分翻译):
  10. - [7238:7238:0621/230217.106355:ERROR:process_singleton_posix.cc(334)] 创建失败/home/matt/.config/google-chrome/SingletonLock文件已存在17
  11. - MESA-INTEL: 警告性能支持已禁用请考虑sysctl dev.i915.perf_stream_paranoid=0
  12. - [0621/230217.125157:ERROR:nacl_helper_linux.cc(355)] NaCl助手进程在没有沙箱的情况下运行
  13. - 大多数情况下您需要正确配置SUID沙箱
  14. - selenium.common.exceptions.WebDriverException: 未知错误无法连接到位于127.0.0.1:35111Chrome
  15. - Chrome不可访问
  16. 请注意错误消息的翻译只是部分翻译因为其中包含了一些技术术语
  17. 注意您的代码中似乎存在一些混乱的注释例如`options.add_argument('--user-data-directory=/home/matt/.config/google-chrome-beta/Default')`中的HTML编码符号(`'`)。这些可能需要进行修复以确保代码的正确性。
  18. 此外部分错误消息中提到了Chrome驱动程序的路径和配置问题您可能需要仔细检查这些设置以解决问题
  19. <details>
  20. <summary>英文:</summary>
  21. I have a slightly strange setup. I have two scripts in Python and R.
  22. I was originally using R&#39;s `RSelenium` which worked but then stopped working so my original code was all in R - now I had to switch to Python and use `undetected_chromedriver` and a few other options that Selenium has available in Python. So I have 2 scripts.
  23. - The R script - uses `rvest` to process the web scraped data and sends via the command line the info to the Python script which runs the Selenium part.
  24. - The Python Selenium script which does the web scraping and I want to run the scripts via the terminal.
  25. Problem:
  26. When I set `headless = True` in Python I get all the errors. Here is my code:
  27. from selenium import webdriver
  28. from selenium.webdriver.common.keys import Keys
  29. import time
  30. import random
  31. from bs4 import BeautifulSoup
  32. from selenium.webdriver.common.by import By
  33. from selenium.webdriver.support.ui import WebDriverWait
  34. from selenium.webdriver.support import expected_conditions as EC
  35. from selenium.common.exceptions import NoSuchElementException
  36. from selenium.webdriver.chrome.options import Options
  37. from selenium.common.exceptions import ElementClickInterceptedException
  38. import undetected_chromedriver as uc
  39. import sys
  40. import random
  41. import pandas as pd
  42. from selenium.webdriver.chrome.service import Service
  43. from webdriver_manager.chrome import ChromeDriverManager
  44. from selenium.webdriver.chrome.options import Options
  45. from selenium.webdriver.chrome.service import Service as ChromeService
  46. from seleniumwire import webdriver
  47. import re
  48. import uuid
  49. import datetime
  50. from selenium.common.exceptions import TimeoutException
  51. RUN_HEADLESS = True
  52. PATH = &#39;/home/matt/bin/chromedriver/chromedriver&#39;
  53. options = uc.ChromeOptions()
  54. options.add_argument(&#39;--user-data-directory=/home/matt/.config/google-chrome-beta/Default&#39;)
  55. if RUN_HEADLESS:
  56. options.add_argument(&quot;--headless=new&quot;) # (&#39;--headless&#39;)
  57. options.add_argument(&quot;--no-sandbox&quot;)
  58. options.add_argument(&quot;--disable-dev-shm-usage&quot;)
  59. # #options.add_argument(&quot;--disable-gpu&quot;) # If running on Linux/Unix system
  60. # options.add_argument(&quot;--disable-extensions&quot;)
  61. # options.add_argument(&quot;--start-maximized&quot;)
  62. # # options.add_argument(&#39;--disable-javascript&#39;)
  63. else:
  64. pass
  65. How can I make the headless version work? when I set `headless = False` everything works but of course I have the browser open when scraping.
  66. Error:
  67. [7238:7238:0621/230217.106355:ERROR:process_singleton_posix.cc(334)] Failed to create /home/matt/.config/google-chrome/SingletonLock: El fichero ya existe (17)
  68. MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0
  69. [0621/230217.125157:ERROR:nacl_helper_linux.cc(355)] NaCl helper process running without a sandbox!
  70. Most likely you need to configure your SUID sandbox correctly
  71. Traceback (most recent call last):
  72. File &quot;/run/media/matt/A34E-C6B8/inmobiliarioProject2/rscripts/production/collectingTheData/v2/collectIndividualPages_Comprar.py&quot;, line 151, in &lt;module&gt;
  73. driver = uc.Chrome(
  74. ^^^^^^^^^^
  75. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/undetected_chromedriver/__init__.py&quot;, line 466, in __init__
  76. super(Chrome, self).__init__(
  77. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py&quot;, line 84, in __init__
  78. super().__init__(
  79. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py&quot;, line 104, in __init__
  80. super().__init__(
  81. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py&quot;, line 286, in __init__
  82. self.start_session(capabilities, browser_profile)
  83. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/undetected_chromedriver/__init__.py&quot;, line 729, in start_session
  84. super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session(
  85. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py&quot;, line 378, in start_session
  86. response = self.execute(Command.NEW_SESSION, parameters)
  87. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  88. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py&quot;, line 440, in execute
  89. self.error_handler.check_response(response)
  90. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py&quot;, line 245, in check_response
  91. raise exception_class(message, screen, stacktrace)
  92. selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:35111
  93. from chrome not reachable
  94. Stacktrace:
  95. #0 0x561e4749c4e3 &lt;unknown&gt;
  96. #1 0x561e471cbb00 &lt;unknown&gt;
  97. #2 0x561e471b9436 &lt;unknown&gt;
  98. #3 0x561e471f89be &lt;unknown&gt;
  99. #4 0x561e471f0884 &lt;unknown&gt;
  100. #5 0x561e4722fccc &lt;unknown&gt;
  101. #6 0x561e4722f47f &lt;unknown&gt;
  102. #7 0x561e47226de3 &lt;unknown&gt;
  103. #8 0x561e471fc2dd &lt;unknown&gt;
  104. #9 0x561e471fd34e &lt;unknown&gt;
  105. #10 0x561e4745c3e4 &lt;unknown&gt;
  106. #11 0x561e474603d7 &lt;unknown&gt;
  107. #12 0x561e4746ab20 &lt;unknown&gt;
  108. #13 0x561e47461023 &lt;unknown&gt;
  109. #14 0x561e4742f1aa &lt;unknown&gt;
  110. #15 0x561e474856b8 &lt;unknown&gt;
  111. #16 0x561e47485847 &lt;unknown&gt;
  112. #17 0x561e47495243 &lt;unknown&gt;
  113. #18 0x7fa7187a16ea start_thread
  114. EDIT:
  115. using `--headless` instead of `headless=new` I get the following:
  116. [1] &quot;NA link. Skipping....&quot;
  117. [6782:6782:0622/000945.392409:ERROR:process_singleton_posix.cc(334)] Failed to create /home/matt/.config/google-chrome/SingletonLock: El fichero ya existe (17)
  118. [0622/000945.407709:ERROR:nacl_helper_linux.cc(355)] NaCl helper process running without a sandbox!
  119. Most likely you need to configure your SUID sandbox correctly
  120. MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0
  121. Traceback (most recent call last):
  122. File &quot;/run/media/matt/A34E-C6B8/inmobiliarioProject2/rscripts/production/collectingTheData/v2/collectIndividualPages_Comprar.py&quot;, line 151, in &lt;module&gt;
  123. driver = uc.Chrome(
  124. ^^^^^^^^^^
  125. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/undetected_chromedriver/__init__.py&quot;, line 466, in __init__
  126. super(Chrome, self).__init__(
  127. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py&quot;, line 84, in __init__
  128. super().__init__(
  129. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py&quot;, line 104, in __init__
  130. super().__init__(
  131. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py&quot;, line 286, in __init__
  132. self.start_session(capabilities, browser_profile)
  133. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/undetected_chromedriver/__init__.py&quot;, line 729, in start_session
  134. super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session(
  135. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py&quot;, line 378, in start_session
  136. response = self.execute(Command.NEW_SESSION, parameters)
  137. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  138. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py&quot;, line 440, in execute
  139. self.error_handler.check_response(response)
  140. File &quot;/home/matt/.asdf/installs/python/3.11.3/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py&quot;, line 245, in check_response
  141. raise exception_class(message, screen, stacktrace)
  142. selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:33765
  143. from chrome not reachable
  144. Stacktrace:
  145. #0 0x56307a71f4e3 &lt;unknown&gt;
  146. #1 0x56307a44eb00 &lt;unknown&gt;
  147. #2 0x56307a43c436 &lt;unknown&gt;
  148. #3 0x56307a47b9be &lt;unknown&gt;
  149. #4 0x56307a473884 &lt;unknown&gt;
  150. #5 0x56307a4b2ccc &lt;unknown&gt;
  151. #6 0x56307a4b247f &lt;unknown&gt;
  152. #7 0x56307a4a9de3 &lt;unknown&gt;
  153. #8 0x56307a47f2dd &lt;unknown&gt;
  154. #9 0x56307a48034e &lt;unknown&gt;
  155. #10 0x56307a6df3e4 &lt;unknown&gt;
  156. #11 0x56307a6e33d7 &lt;unknown&gt;
  157. #12 0x56307a6edb20 &lt;unknown&gt;
  158. #13 0x56307a6e4023 &lt;unknown&gt;
  159. #14 0x56307a6b21aa &lt;unknown&gt;
  160. #15 0x56307a7086b8 &lt;unknown&gt;
  161. #16 0x56307a708847 &lt;unknown&gt;
  162. #17 0x56307a718243 &lt;unknown&gt;
  163. #18 0x7ff1875e96ea start_thread
  164. EDIT: Code:
  165. from selenium import webdriver
  166. from selenium.webdriver.common.keys import Keys
  167. import time
  168. import random
  169. from bs4 import BeautifulSoup
  170. from selenium.webdriver.common.by import By
  171. from selenium.webdriver.support.ui import WebDriverWait
  172. from selenium.webdriver.support import expected_conditions as EC
  173. from selenium.common.exceptions import NoSuchElementException
  174. from selenium.webdriver.chrome.options import Options
  175. from selenium.common.exceptions import ElementClickInterceptedException
  176. import undetected_chromedriver as uc
  177. import sys
  178. import random
  179. import pandas as pd
  180. from selenium.webdriver.chrome.service import Service
  181. from webdriver_manager.chrome import ChromeDriverManager
  182. from selenium.webdriver.chrome.options import Options
  183. from selenium.webdriver.chrome.service import Service as ChromeService
  184. from seleniumwire import webdriver
  185. import re
  186. import uuid
  187. import datetime
  188. from selenium.common.exceptions import TimeoutException
  189. import time
  190. from seleniumbase import Driver
  191. RUN_HEADLESS = False
  192. PATH = &#39;/home/matt/bin/chromedriver/chromedriver&#39;
  193. options = uc.ChromeOptions()
  194. options.add_argument(&#39;--user-data-directory=/home/matt/.config/google-chrome-beta/Default&#39;)
  195. ################ New code added to fix chromedriver and chrome version mismatch ################
  196. #options = Options()
  197. if RUN_HEADLESS:
  198. options.add_argument(&quot;--headless&quot;) # (&#39;--headless&#39;)
  199. options.add_argument(&quot;--no-sandbox&quot;)
  200. options.add_argument(&quot;--disable-dev-shm-usage&quot;)
  201. # #options.add_argument(&quot;--disable-gpu&quot;) # If running on Linux/Unix system
  202. # options.add_argument(&quot;--disable-extensions&quot;)
  203. # options.add_argument(&quot;--start-maximized&quot;)
  204. # # options.add_argument(&#39;--disable-javascript&#39;)
  205. else:
  206. pass
  207. driver = uc.Chrome(options=options)
  208. #seleniumwire_options = seleniumwire_options # here is where I pass random proxies to the options
  209. link = &quot;https://www.fotocasa.es/es/comprar/vivienda/madrid-capital/calefaccion-terraza-trastero-ascensor-no-amueblado/176698848/d?from=list&quot;
  210. driver.get(link)
  211. This gives me:
  212. TypeError: WebDriver.__init__() got an unexpected keyword argument &#39;executable_path&#39;
  213. </details>
  214. # 答案1
  215. **得分**: 2
  216. [SeleniumBase](https://github.com/seleniumbase/SeleniumBase)有一个与无头模式兼容的未检测到的 Chrome 驱动模式
  217. 在执行```pip install seleniumbase```使用以下Python代码运行
  218. ```python
  219. import time
  220. from seleniumbase import Driver
  221. driver = Driver(uc_cdp=True, incognito=True, headless=True)
  222. driver.get("https://nowsecure.nl/#relax")
  223. time.sleep(7)
  224. driver.get_screenshot_as_file("screenshot.png")
  225. driver.quit()

(删除headless部分以查看发生了什么。)

这是另一种使用pytest命令行选项启用带有headless模式的uc模式的格式:

  1. from seleniumbase import BaseCase
  2. if __name__ == "__main__":
  3. from pytest import main
  4. main([__file__, "--uc", "--uc-cdp", "--incognito", "-s", "--headless"])
  5. class UndetectedTest(BaseCase):
  6. def test_browser_is_undetected(self):
  7. self.open("https://nowsecure.nl/#relax")
  8. self.assert_text("OH YEAH, you passed!", "h1", timeout=7.25)
  9. self.post_message("Selenium wasn't detected!", duration=2.8)
  10. self._print("\n Success! Website did not detect Selenium! ")

(再次删除--headless部分以查看发生了什么。)

英文:

SeleniumBase has an undetected-chromedriver mode that works with headless mode.

After pip install seleniumbase, run the following with python:

  1. import time
  2. from seleniumbase import Driver
  3. driver = Driver(uc_cdp=True, incognito=True, headless=True)
  4. driver.get(&quot;https://nowsecure.nl/#relax&quot;)
  5. time.sleep(7)
  6. driver.get_screenshot_as_file(&quot;screenshot.png&quot;)
  7. driver.quit()

(Remove the headless part to see what's going on.)

Here's another format that uses pytest command-line options to enable uc mode with headless mode:

  1. from seleniumbase import BaseCase
  2. if __name__ == &quot;__main__&quot;:
  3. from pytest import main
  4. main([__file__, &quot;--uc&quot;, &quot;--uc-cdp&quot;, &quot;--incognito&quot;, &quot;-s&quot;, &quot;--headless&quot;])
  5. class UndetectedTest(BaseCase):
  6. def test_browser_is_undetected(self):
  7. self.open(&quot;https://nowsecure.nl/#relax&quot;)
  8. self.assert_text(&quot;OH YEAH, you passed!&quot;, &quot;h1&quot;, timeout=7.25)
  9. self.post_message(&quot;Selenium wasn&#39;t detected!&quot;, duration=2.8)
  10. self._print(&quot;\n Success! Website did not detect Selenium! &quot;)

(Again, remove the --headless part to see what's going on.)

答案2

得分: 0

你尝试过使用 --headless 而不是 --headless=new 吗?

英文:

Have you tried using --headless instead of --headless=new?

huangapple
  • 本文由 发表于 2023年6月22日 05:16:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76527192.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定