2023年8月4日 21:06:55go评论152阅读模式

英文:

How to implement multithreading for web scraping?

问题

我的问题似乎很简单，但我不知道如何解决它。

我有一个看起来像这样的程序：

def webscrape(url):
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    data = json.loads(data)
    df = pd.DataFrame(data['dataProvider'])
    return df

我想要将这个函数应用于多个网站：

df = pd.DataFrame()
for url in urls:
    df = pd.concat([df, webscrape(url)], axis=1)

我想要通过多线程加速我的程序，但我不知道应该如何开始。

英文:

My problem seems simple and I don't know how to solve it.

I have a program that look like this:

def webscrape(url):
    soup = BeautifulSoup(requests.get(url).content, &quot;html.parser&quot;)
    data = json.loads(data)
    df = pd.DataFrame(data[&#39;dataProvider&#39;])
    return df

I would like to apply this function for multiple sites:

df = pd.DataFrame()
for url in urls:
    df = pd.concat([df, webscrape(url)], axis=1)

I would like to accelerate my program with multithreading but I don't know how I should start.

答案1

得分: 1

为了加速您的程序使用多线程，您可以在Python中使用concurrent.futures模块。该模块提供了一个高级接口，用于异步执行可调用对象（函数或方法）使用线程或进程。在您的情况下，您可以使用concurrent.futures中的ThreadPoolExecutor来并行化网页抓取过程。
以下是如何修改您的代码以使用多线程的方式：

import concurrent.futures
import pandas as pd
from bs4 import BeautifulSoup
import requests
import json
def webscrape(url):
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    data = json.loads(data)
    df = pd.DataFrame(data['dataProvider'])
    return df
def scrape_all_sites(urls):
    dfs = []
    with concurrent.futures.ThreadPoolExecutor() as executor:
        # 将任务提交给执行器并存储未来对象
        futures = [executor.submit(webscrape, url) for url in urls]
        # 从已完成的未来对象中收集结果
        for future in concurrent.futures.as_completed(futures):
            df = future.result()
            dfs.append(df)
    # 将所有数据框连接成一个
    result_df = pd.concat(dfs, axis=1)
    return result_df
# 假设您有一个名为'urls'的URL列表
result_df = scrape_all_sites(urls)

如果您需要添加工作线程，以下是如何在scrape_all_sites函数中设置max_workers选项的方法：

def scrape_all_sites(urls, max_workers=None):
    dfs = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        # 其余的代码保持不变
        # ...

在调用scrape_all_sites函数时，您可以指定max_workers参数来确定线程数：

# 假设您有一个名为'urls'的URL列表
result_df = scrape_all_sites(urls, max_workers=4)  # 根据需要设置max_workers

英文:

To accelerate your program with multithreading, you can use the concurrent.futures module in Python. This module provides a high-level interface for asynchronously executing callables (functions or methods) using threads or processes. In your case, you can use the ThreadPoolExecutor from concurrent.futures to parallelize the web scraping process.
Here's how you can modify your code to use multithreading:

import concurrent.futures
import pandas as pd
from bs4 import BeautifulSoup
import requests
import json
def webscrape(url):
    soup = BeautifulSoup(requests.get(url).content, &quot;html.parser&quot;)
    data = json.loads(data)
    df = pd.DataFrame(data[&#39;dataProvider&#39;])
    return df
def scrape_all_sites(urls):
    dfs = []
    with concurrent.futures.ThreadPoolExecutor() as executor:
        # Submit the tasks to the executor and store the future objects
        futures = [executor.submit(webscrape, url) for url in urls]
        # Gather the results from the completed futures
        for future in concurrent.futures.as_completed(futures):
            df = future.result()
            dfs.append(df)
    # Concatenate all dataframes into one
    result_df = pd.concat(dfs, axis=1)
    return result_df
# Assuming you have a list of URLs called &#39;urls&#39;
result_df = scrape_all_sites(urls)

If you need add workers, Here's how you can set the max_workers option in the scrape_all_sites function:

def scrape_all_sites(urls, max_workers=None):
    dfs = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Rest of the code remains unchanged
        # ...

When calling the scrape_all_sites function, you can specify the max_workers argument to determine the number of threads:

# Assuming you have a list of URLs called &#39;urls&#39;
result_df = scrape_all_sites(urls, max_workers=4)  # Set max_workers as desired

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何实现用于网页抓取的多线程？

问题

答案1

用非解析积分求解超越方程的数值解

如何使用dplyr库选择特定变量，然后将一个变量与特定值匹配。

PriorityQueue在每次调用get时都调用sorted吗？

Pandas同时转换多列数据类型

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。