英文:
Scraping a webpage taking too long using Selenium, BeautifulSoup
问题
我想要抓取一个网站及其子页面,但是这个过程太耗时了。我应该如何优化请求或使用替代方案?
以下是我正在使用的代码。仅加载谷歌首页就需要10秒。所以如果我要给它280个链接,显然不可行。
from selenium import webdriver
import time
# 为chrome driver准备选项
options = webdriver.ChromeOptions()
options.add_argument('headless')
# 启动chrome浏览器
browser = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", chrome_options=options)
start = time.time()
browser.get('http://www.google.com/xhtml')
print(time.time() - start)
browser.quit()
英文:
I want to scrape a website and its sub-pages, but it is taking too long. How can I optimize the request or use an alternative solution?
Below is the code I am using. It takes 10s for just loading the Google home page. So it's clearly not scalable if I were to give it 280 links
from selenium import webdriver
import time
# prepare the option for the chrome driver
options = webdriver.ChromeOptions()
options.add_argument('headless')
# start chrome browser
browser = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver" ,chrome_options=options)
start=time.time()
browser.get('http://www.google.com/xhtml')
print(time.time()-start)
browser.quit()
答案1
得分: 2
import requests
from bs4 import BeautifulSoup
url = "https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/"
url1 = "https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/{}/"
req = requests.get(url, verify=False)
soup = BeautifulSoup(req.text, 'html.parser')
print("Letters : A")
print([item['href'] for item in soup.select('.columns-list a[href]')])
letters = ['B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
for letter in letters:
req = requests.get(url1.format(letter), verify=False)
soup = BeautifulSoup(req.text, 'html.parser')
print('Letters : ' + letter)
print([item['href'] for item in soup.select('.columns-list a[href]')])
英文:
Use python requests
and Beautiful soup
module.
import requests
from bs4 import BeautifulSoup
url="https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/"
url1="https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/{}/"
req = requests.get(url,verify=False)
soup = BeautifulSoup(req.text, 'html.parser')
print("Letters : A")
print([item['href'] for item in soup.select('.columns-list a[href]')])
letters=['B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
for letter in letters:
req = requests.get(url1.format(letter), verify=False)
soup = BeautifulSoup(req.text, 'html.parser')
print('Letters : ' + letter)
print([item['href'] for item in soup.select('.columns-list a[href]')])
答案2
得分: 2
你可以使用这个脚本来提高速度。多线程爬虫比其他方式更好:
https://edmundmartin.com/multi-threaded-crawler-in-python/
之后,你需要更改这段代码:
def run_scraper(self):
with open("francais-arabe-marocain.csv", 'a') as file:
file.write("url")
file.writelines("\n")
for i in range(50000):
try:
target_url = self.to_crawl.get(timeout=600)
if target_url not in self.scraped_pages and "francais-arabe-marocain" in target_url:
self.scraped_pages.add(target_url)
job = self.pool.submit(self.scrape_page, target_url)
job.add_done_callback(self.post_scrape_callback)
df = pd.DataFrame([{'url': target_url}])
df.to_csv(file, index=False, header=False)
print(target_url)
except Empty:
return
except Exception as e:
print(e)
continue
如果URL包含"francais-arabe-marocain",则将URL保存在CSV文件中。
之后,你可以使用相同的方式,在一个for循环中逐行读取CSV文件中的URL并进行爬取。
英文:
you can use that script for the speed. multithread crawler better than all:
https://edmundmartin.com/multi-threaded-crawler-in-python/
After that you must change that code:
def run_scraper(self):
with open("francais-arabe-marocain.csv", 'a') as file:
file.write("url")
file.writelines("\n")
for i in range(50000):
try:
target_url = self.to_crawl.get(timeout=600)
if target_url not in self.scraped_pages and "francais-arabe-marocain" in target_url:
self.scraped_pages.add(target_url)
job = self.pool.submit(self.scrape_page, target_url)
job.add_done_callback(self.post_scrape_callback)
df = pd.DataFrame([{'url': target_url}])
df.to_csv(file, index=False, header=False)
print(target_url)
except Empty:
return
except Exception as e:
print(e)
continue
If url include "francais-arabe-marocain" save urls in a csv file.
After that you can scrape that urls in one for loop reading csv line by line with same way
答案3
得分: 0
尝试像这样使用 urllib:
import urllib.request
start = time.time()
page = urllib.request.urlopen("https://google.com/xhtml")
print(time.time() - start)
仅耗时2秒。然而,这也取决于您的网络连接质量。
英文:
try to use urllib just like this
import urllib.request
start=time.time()
page = urllib.request.urlopen("https://google.com/xhtml")
print(time.time()-start)
it took only 2s. However, it depends also on the quality of connection you have
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论