Selenium Python Chrome: 极其缓慢。是 Cookie 的问题吗?

huangapple go评论83阅读模式
英文:

Selenium Python Chrome: Extremely slow. Are cookies the problem?

问题

Selenium Chrome 可以通过使用隐式等待、无头模式、ID 和 CSS 选择器等方法来加快运行速度。在实施这些更改之前,我想知道是否 cookies 或缓存可能在减慢我的速度。

Selenium 是否像普通浏览器一样存储 cookies 和缓存,还是每次导航到网站的新页面时都重新加载所有资源?

如果是的话,这将减慢抓取数百万个相同个人资料页面的过程,其中每个个人资料的脚本和图像都相似。

如果是的话,是否有办法避免这个问题?我想在会话期间使用 cookies 和缓存,然后在关闭浏览器后销毁它们。

编辑,更多细节:

sel_options = {'proxy': {'https': pString}}
prefs = {'download.default_directory' : dFolder}
options.add_experimental_option('prefs', prefs)
blocker = os.path.join( os.getcwd(), "extension_iijehicfndmapfeoplkdpinnaicikehn")
options.add_argument(f"--load-extension={blocker}")
wS = "--window-size="+s1+","+s2
options.add_argument(wS)
if headless == "yes": options.add_argument("--headless");
driver = uc.Chrome(seleniumwire_options=sel_options, options=options, use_subprocess=True, version_main=109)
stealth(driver, languages=["en-US", "en"], vendor="Google Inc.", platform="Win32", webgl_vendor="Intel Inc.", renderer="Intel Iris OpenGL Engine", fix_hairline=True)
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": agent})
navigate("https://linkedin.com")

我认为我的代理或扩展程序不是问题,因为我有一个类似的自动化应用程序运行,没有速度问题。

英文:

I read that Selenium Chrome can run faster if you use implicit waits, headless, ID and CSS selectors etc. Before implementing those changes, I want to know whether cookies or caching could be slowing me down.

Does Selenium store cookies and cache like a normal browser or does it reload all assets everytime it navigates to a new page on a website?

If yes, then this would slow down the process of scraping websites with millions of identical profile pages, where the scripts and images are similar for each profile.

Is yes, is there a way to avoid this problem? Interested in using cookies and cache during a session and then destroying after the browser is closed.

Edit, more details:

sel_options = {'proxy': {'https': pString}}
prefs = {'download.default_directory' : dFolder}
options.add_experimental_option('prefs', prefs)
blocker = os.path.join( os.getcwd(), "extension_iijehicfndmapfeoplkdpinnaicikehn")
options.add_argument(f"--load-extension={blocker}")
wS = "--window-size="+s1+","+s2
options.add_argument(wS)
if headless == "yes": options.add_argument("--headless");
driver = uc.Chrome(seleniumwire_options=sel_options, options=options, use_subprocess=True, version_main=109)
stealth(driver, languages=["en-US", "en"], vendor="Google Inc.", platform="Win32",  webgl_vendor="Intel Inc.", renderer="Intel Iris OpenGL Engine", fix_hairline=True)
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": agent})
navigate("https://linkedin.com")

I don't think my proxy or extension is the culprit, because I have a similar automation app running with no speed issue.

答案1

得分: 1

默认情况下,Selenium WebDriver 可能不启用缓存或存储cookie,就像普通浏览器一样,我认为始终使用会话缓存而不是在每次爬取定时任务运行后重新下载相同文件是有趣的。

使用静态浏览器配置文件进行缓存(Selenium/Chrome):

您可以创建一个启用缓存的Chrome配置文件,然后在Selenium中使用该配置文件。这样,浏览器将利用相同的配置文件目录在所有的爬取运行期间存储所有文件,您将不会在运行之间丢失文件。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--user-data-dir=/path/to/profile/folder")
options.add_argument("--disk-cache-dir=/path/to/cache/folder")
driver = webdriver.Chrome(options=options)

/path/to/profile/folder 替换为您要存储配置文件数据的目录,将 /path/to/cache/folder 替换为您要存储缓存的目录。这样,在同一会话中的后续页面加载过程中,浏览器将利用缓存。

还值得知道的是,Python的Selenium库提供了管理cookie的方法。您可以使用 driver.manage().getCookies() 来检索cookie,使用 driver.manage().addCookie(cookie) 来向当前会话添加cookie。

英文:

By default, Selenium WebDriver may not enable caching or store cookies like a normal browser and yes i do think that is always interesting to use session caching instead of re-downloading the same files after each scraping cron job run.

Cache (Selenium/Chrome) with a static browser profile:

You can create a Chrome profile with caching enabled and then use that profile with Selenium. This way, the browser will utilize the same profile directory to store all your files during all your scraping runs and you may not lose files in between runs.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--user-data-dir=/path/to/profile/folder")
options.add_argument("--disk-cache-dir=/path/to/cache/folder")
driver = webdriver.Chrome(options=options)

Replace /path/to/profile/folder with the desired folder where you want to store the profile data, and /path/to/cache/folder with the folder where you want to store the cache. This way, the browser will utilize the cache during subsequent page loads within the same session.

It is also good to know that the python selenium library provides methods to manage cookies. You can use driver.manage().getCookies() to retrieve cookies and driver.manage().addCookie(cookie) to add a cookie to the current session.

答案2

得分: 1

默认情况下,由Selenium使用的chromedriver与您正在抓取的linkedIn.com网站提供的cookies进行交互。这些cookies仅在关闭驱动程序会话之前由chromedriver维护。当您启动新的chromedriver实例时,旧的cookies将不存在。LinkedIn.com将提供一批新的cookies供新的chromedriver会话使用。

在我看来,来自活动会话的cookies和缓存并不是导致您在抓取linkedIn.com时性能问题的原因。

您在问题中提到,您正在抓取linkedIn.com上的“数百万个相同的个人资料页面”。即使对于自动化的Selenium会话来说,抓取这么多个人资料是非常耗时的。导致性能问题的最有可能的原因是linkedIn.com实施的“速率限制”。

我会假设在您的抓取活动变慢时,您会收到HTTP 429 Too Many Requests响应状态代码。您需要进行一些调试来确定这一点。

以下是一些代码,您可以进行修改以获取HTTP状态代码:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

chrome_options = Options()
chrome_options.add_argument("--disable-infobars")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-popup-blocking")
chrome_options.add_argument('--ignore-ssl-errors')
chrome_options.add_argument('--ignore-certificate-errors')

# 禁用横幅“Chrome正在被自动测试软件控制”
chrome_options.add_experimental_option("useAutomationExtension", False)
chrome_options.add_experimental_option("excludeSwitches", ['enable-automation'])

capabilities = DesiredCapabilities().CHROME
capabilities['goog:loggingPrefs'] = {'performance': 'ALL'}

driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver', options=chrome_options, desired_capabilities=capabilities)

driver.get(f'https://www.linkedin.com')

perfLog = driver.get_log('performance')
for logIndex in range(0, len(perfLog)):  # 解析Chrome性能日志
    logMessage = json.loads(perfLog[logIndex]["message"])["message"]
    if logMessage["method"] == "Network.responseReceived":  # 过滤掉HTTP响应
        print(logMessage["params"]["response"])

我注意到当我检查连接的状态代码时,linkedIn.com也在使用cloudflare进行保护。这将在linkedIn.com上“抓取数百万个相同的个人资料页面”时创建重大问题。您可能会被阻止或您的代理地址可能会被列入黑名单。

您的当前代码需要重新设计以处理“速率限制”,“429状态代码”和任何“cloudflare”障碍。

英文:

By default, the chromedriver being used by Selenium interacts with the cookies provided by the website linkedIn.com which you are scraping. These cookies will only be maintained by the chromedriver until you close the driver session. When you launch a new instance of chromedriver the old cookies do not exist. LinkedIn.com will provide a new batch of cookies to use in this with new chromedriver session.

In my opinion the cookies and caching from an active session aren’t causing your performance issues when scraping linkedIn.com.

You stated in your question that you were scraping millions of identical profile pages on linkedIn.com. Scraping this many profiles is very time consuming even for an automated Selenium session. The mostly likely culprit of your performance issue is rate-limiting being imposed by linkedIn.com.

I would assume that you are receiving HTTP 429 Too Many Requests response status code when your scraping activities become slow. You would need to do some debugging to determine this.

Here is some code that you can modified to get the HTTP status code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

chrome_options = Options()
chrome_options.add_argument("--disable-infobars")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-popup-blocking")
chrome_options.add_argument('--ignore-ssl-errors')
chrome_options.add_argument('--ignore-certificate-errors')

# disable the banner "Chrome is being controlled by automated test software"
chrome_options.add_experimental_option("useAutomationExtension", False)
chrome_options.add_experimental_option("excludeSwitches", ['enable-automation'])

capabilities = DesiredCapabilities().CHROME
capabilities['goog:loggingPrefs'] = {'performance': 'ALL'}

driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver', options=chrome_options, desired_capabilities=capabilities)

driver.get(f'https://www.linkedin.com')

perfLog = driver.get_log('performance')
for logIndex in range(0, len(perfLog)):  # Parse the Chrome Performance logs
    logMessage = json.loads(perfLog[logIndex]["message"])["message"]
    if logMessage["method"] == "Network.responseReceived":  # Filter out HTTP responses
        print(logMessage["params"]["response"]) 

I noted that when I was checking the status code for my connection that linkedIn.com also using cloudflare for protection. This is going to create a major issue when scraping millions of identical profile pages on linkedIn.com. You will likely get blocked or your proxy addresses will get blacklisted.

Your current code will require redesigning to handle rate-limiting, status codes of 429 and any cloudflare hurdles.

huangapple
  • 本文由 发表于 2023年2月14日 06:27:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/75441786.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定