2020年1月3日 19:12:25go评论192阅读模式

英文:

Scraping a webpage taking too long using Selenium, BeautifulSoup

问题

我想要抓取一个网站及其子页面，但是这个过程太耗时了。我应该如何优化请求或使用替代方案？

以下是我正在使用的代码。仅加载谷歌首页就需要10秒。所以如果我要给它280个链接，显然不可行。

from selenium import webdriver
import time
# 为chrome driver准备选项
options = webdriver.ChromeOptions()
options.add_argument('headless')

# 启动chrome浏览器
browser = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", chrome_options=options)
start = time.time()
browser.get('http://www.google.com/xhtml')
print(time.time() - start)
browser.quit()

英文:

I want to scrape a website and its sub-pages, but it is taking too long. How can I optimize the request or use an alternative solution?

Below is the code I am using. It takes 10s for just loading the Google home page. So it's clearly not scalable if I were to give it 280 links

from selenium import webdriver
import time
# prepare the option for the chrome driver
options = webdriver.ChromeOptions()
options.add_argument(&#39;headless&#39;)

# start chrome browser
browser = webdriver.Chrome(&quot;/usr/lib/chromium-browser/chromedriver&quot; ,chrome_options=options)
start=time.time()
browser.get(&#39;http://www.google.com/xhtml&#39;)
print(time.time()-start)
browser.quit()

答案1

得分: 2

import requests
from bs4 import BeautifulSoup

url = "https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/"
url1 = "https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/{}/"
req = requests.get(url, verify=False)
soup = BeautifulSoup(req.text, 'html.parser')
print("Letters : A")
print([item['href'] for item in soup.select('.columns-list a[href]')])

letters = ['B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']

for letter in letters:

    req = requests.get(url1.format(letter), verify=False)
    soup = BeautifulSoup(req.text, 'html.parser')
    print('Letters : ' + letter)
    print([item['href'] for item in soup.select('.columns-list a[href]')])

英文:

Use python requests and Beautiful soup module.

import requests
from bs4 import BeautifulSoup
url=&quot;https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/&quot;
url1=&quot;https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/{}/&quot;
req = requests.get(url,verify=False)
soup = BeautifulSoup(req.text, &#39;html.parser&#39;)
print(&quot;Letters : A&quot;)
print([item[&#39;href&#39;] for item in soup.select(&#39;.columns-list a[href]&#39;)])

letters=[&#39;B&#39;,&#39;C&#39;,&#39;D&#39;,&#39;E&#39;,&#39;F&#39;,&#39;G&#39;,&#39;H&#39;,&#39;I&#39;,&#39;J&#39;,&#39;K&#39;,&#39;L&#39;,&#39;M&#39;,&#39;N&#39;,&#39;O&#39;,&#39;P&#39;,&#39;Q&#39;,&#39;R&#39;,&#39;S&#39;,&#39;T&#39;,&#39;U&#39;,&#39;V&#39;,&#39;W&#39;,&#39;X&#39;,&#39;Y&#39;,&#39;Z&#39;]

for letter in letters:

    req = requests.get(url1.format(letter), verify=False)
    soup = BeautifulSoup(req.text, &#39;html.parser&#39;)
    print(&#39;Letters : &#39; + letter)
    print([item[&#39;href&#39;] for item in soup.select(&#39;.columns-list a[href]&#39;)])

答案2

得分: 2

你可以使用这个脚本来提高速度。多线程爬虫比其他方式更好：

https://edmundmartin.com/multi-threaded-crawler-in-python/

之后，你需要更改这段代码：

def run_scraper(self):
    with open("francais-arabe-marocain.csv", 'a') as file:
        file.write("url")
        file.writelines("\n")
        for i in range(50000):
            try:
                target_url = self.to_crawl.get(timeout=600)
                if target_url not in self.scraped_pages and "francais-arabe-marocain" in target_url:
                    self.scraped_pages.add(target_url)
                    job = self.pool.submit(self.scrape_page, target_url)
                    job.add_done_callback(self.post_scrape_callback)
                    df = pd.DataFrame([{'url': target_url}])
                    df.to_csv(file, index=False, header=False)
                    print(target_url)
            except Empty:
                return
            except Exception as e:
                print(e)
                continue

如果URL包含"francais-arabe-marocain"，则将URL保存在CSV文件中。

之后，你可以使用相同的方式，在一个for循环中逐行读取CSV文件中的URL并进行爬取。

英文:

you can use that script for the speed. multithread crawler better than all:

https://edmundmartin.com/multi-threaded-crawler-in-python/

After that you must change that code:

def run_scraper(self):
    with open(&quot;francais-arabe-marocain.csv&quot;, &#39;a&#39;) as file:
        file.write(&quot;url&quot;)
        file.writelines(&quot;\n&quot;)
        for i in range(50000):
            try:
                target_url = self.to_crawl.get(timeout=600)
                if target_url not in self.scraped_pages and &quot;francais-arabe-marocain&quot; in target_url:
                    self.scraped_pages.add(target_url)
                    job = self.pool.submit(self.scrape_page, target_url)
                    job.add_done_callback(self.post_scrape_callback)
                    df = pd.DataFrame([{&#39;url&#39;: target_url}])
                    df.to_csv(file, index=False, header=False)
                    print(target_url)
            except Empty:
                return
            except Exception as e:
                print(e)
                continue

If url include "francais-arabe-marocain" save urls in a csv file.

After that you can scrape that urls in one for loop reading csv line by line with same way

答案3

得分: 0

尝试像这样使用 urllib：

import urllib.request
start = time.time()
page = urllib.request.urlopen("https://google.com/xhtml")
print(time.time() - start)

仅耗时2秒。然而，这也取决于您的网络连接质量。

英文:

try to use urllib just like this

import urllib.request
start=time.time()
page = urllib.request.urlopen(&quot;https://google.com/xhtml&quot;)
print(time.time()-start)

it took only 2s. However, it depends also on the quality of connection you have

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Selenium和BeautifulSoup爬取网页时花费的时间太长

问题

答案1

答案2

答案3

“NameError: name ‘challenges’ is not defined” 变量名错误：名称 ‘challenges’ 未定义。

HttpResponseError: 此请求未获授权以使用Python Azure Function中的此权限执行此操作

提交带超链接的表单

当从Go应用程序向Python应用程序的stdin写入时出现”Broken Pipe”错误。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论