2023年6月19日 01:17:08go评论168阅读模式

英文:

parallelize a function that fill missing values from duplicates in pandas dataframe

问题

我有一个产品数据框，包含1838379行，其中有描述、图片链接、EAN和产品名称。这个数据集中的产品名称存在重复，我尝试使用重复的产品名称来填充描述、图片链接和EAN中的空值，所以我实现了以下函数：

def fill_descriptions_images_ean_from_duplicates(row, train):
    import pandas as pd
    duplicated_rows = train.loc[train['product_name'] == row["product_name"]]
    if not duplicated_rows.empty:
        descriptions = duplicated_rows["description"].dropna()
        if not descriptions.empty:
            description = list(descriptions)[0]
            train.loc[train['product_name'] == row["product_name"], 'description',] = train.loc[train['product_name'] ==  row["product_name"], 'description'].fillna(description)

        images = duplicated_rows["image_url"].dropna()
        if not images.empty:
            image = list(images)[0]
            train.loc[train['product_name'] == row["product_name"], 'image_url',] = train.loc[train['product_name'] ==  row["product_name"], 'image_url'].fillna(image)

        eans = duplicated_rows["ean"].dropna()
        if not eans.empty:
            ean = list(eans)[0]
            train.loc[train['product_name'] == row["product_name"], 'ean',] = train.loc[train['product_name'] ==  row["product_name"], 'ean'].fillna(ean)

当我使用apply方法时，执行时间太长，所以我尝试使用pandarallel，但是pandarallel不支持lambda函数，并告诉我fill_descriptions_images_ean_from_duplicates未定义：

from pandarallel import pandarallel
import psutil

psutil.cpu_count(logical=False)

pandarallel.initialize()
train.parallel_apply(lambda row: fill_descriptions_images_ean_from_duplicates(row, train), axis=1)

然后我尝试使用dask，但进度条卡住了：

def process_partition(df_partition, train):
    df_partition.apply(lambda row: fill_descriptions_images_ean_from_duplicates(row, train), axis=1)
    return df_partition

import dask.dataframe as dd
from dask.diagnostics import ProgressBar
dask_train = dd.from_pandas(train, npartitions=7)
dask_df_applied = dask_train.map_partitions(lambda row: process_partition(row, train), meta=train.dtypes)
with ProgressBar():
    train = dask_df_applied.compute()

样本数据：

import pandas as pd
import numpy as np

# 设置随机种子以便重现性
np.random.seed(42)

# 生成随机数据
data = {
    'product_name': ['Product A', 'Product B', 'Product B', 'Product C', 'Product D'] * 20,
    'description': np.random.choice([np.nan, 'Description'], size=100),
    'image_url': np.random.choice([np.nan, 'image_url'], size=100),
    'ean': np.random.choice([np.nan, 'EAN123456'], size=100)
}

# 创建DataFrame
train = pd.DataFrame(data)

请注意，我只提供了代码部分的翻译。如果您需要更多帮助或有其他问题，请随时提出。

英文:

I have a product data frame that consists of 1838379 rows that have description image_url, eans, and product name
this dataset has duplicates in the product name I am trying to fill the nan values in description image_url, eans with the duplicated values in product name so i implemented this function

def fill_descriptions_images_ean_from_duplicates(row,train):
    import pandas as pd
    duplicated_rows = train.loc[train[&#39;product_name&#39;] == row[&quot;product_name&quot;]]
    if not duplicated_rows.empty:

        
        descriptions=duplicated_rows[&quot;description&quot;].dropna()
        if not descriptions.empty:
            description=list(descriptions)[0]
            train.loc[train[&#39;product_name&#39;] ==  row[&quot;product_name&quot;], &#39;description&#39;,] = train.loc[train[&#39;product_name&#39;] ==  row[&quot;product_name&quot;], &#39;description&#39;].fillna(description)

        images=duplicated_rows[&quot;image_url&quot;].dropna()
        if not images.empty:
            
            image=list(images)[0]
            train.loc[train[&#39;product_name&#39;] ==  row[&quot;product_name&quot;], &#39;image_url&#39;,] = train.loc[train[&#39;product_name&#39;] ==  row[&quot;product_name&quot;], &#39;image_url&#39;].fillna(image)
        

        eans=duplicated_rows[&quot;ean&quot;].dropna()
        if not eans.empty:
            ean=list(eans)[0]
            train.loc[train[&#39;product_name&#39;] ==  row[&quot;product_name&quot;], &#39;ean&#39;,] = train.loc[train[&#39;product_name&#39;] ==  row[&quot;product_name&quot;], &#39;ean&#39;].fillna(ean)

when I use apply it takes forever to execute so I tried using Pandaralele but pandaralele doesn't support the lambda function and it tells me that the fill_descriptions_images_ean_from_duplicates is not defined

from pandarallel import pandarallel
import psutil

psutil.cpu_count(logical=False)

pandarallel.initialize()
train.parallel_apply(lambda row: fill_descriptions_images_ean_from_duplicates(row, train), axis=1)

so i tried using dask but nothing happend either the progressbar is stuck

def process_partition(df_partition,train):
    df_partition.apply(lambda row: fill_descriptions_images_ean_from_duplicates(row, train), axis=1)
    return df_partition

import dask.dataframe as dd
from dask.diagnostics import ProgressBar
dask_train = dd.from_pandas(train, npartitions=7)
dask_df_applied = dask_train.map_partitions(lambda row: process_partition(row, train),meta=train.dtypes)
with ProgressBar():
    train=dask_df_applied.compute()

sample data

import pandas as pd
import numpy as np

# Set the random seed for reproducibility
np.random.seed(42)

# Generate random data
data = {
    &#39;product_name&#39;: [&#39;Product A&#39;, &#39;Product B&#39;, &#39;Product B&#39;, &#39;Product C&#39;, &#39;Product D&#39;] * 20,
    &#39;description&#39;: np.random.choice([np.nan, &#39;Description&#39;], size=100),
    &#39;image_url&#39;: np.random.choice([np.nan, &#39;image_url&#39;], size=100),
    &#39;ean&#39;: np.random.choice([np.nan, &#39;EAN123456&#39;], size=100)
}

# Create the DataFrame
train= pd.DataFrame(data)

答案1

得分: 0

这是我能找到的最好的东西，它将时间缩短到15分钟。

train['description'] = train.groupby('product_name')['description'].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)
train['image_url'] = train.groupby('product_name')['image_url'].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)
train['ean'] = train.groupby('product_name')['ean'].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)

英文:

this is the best thing I could find it reduces the time to 15 minutes

train[&#39;description&#39;] = train.groupby(&#39;product_name&#39;)[&#39;description&#39;].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)
train[&#39;image_url&#39;] = train.groupby(&#39;product_name&#39;)[&#39;image_url&#39;].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)
train[&#39;ean&#39;] = train.groupby(&#39;product_name&#39;)[&#39;ean&#39;].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)

答案2

得分: 0

你可以尝试使用parallel-pandas库。它具有比pandarallel更多的功能，并且还支持lambda函数。

import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

# 初始化parallel-pandas
ParallelPandas.initialize(n_cpu=16, split_factor=4, disable_pr_bar=True)

# 创建DataFrame
df = pd.DataFrame(np.random.random((1_000, 100))) 

df.head()
	  0	           1	        2	       3	        4
0	0.525561	0.342411	0.546397	0.016009	0.810697
1	0.206626	0.794180	0.856513	0.492897	0.446797
2	0.795895	0.790188	0.651192	0.196008	0.415761
3	0.214247	0.307092	0.873755	0.518329	0.166529
4	0.059282	0.306833	0.137190	0.206785	0.314207

# 并行的应用方法的类似例子
# 仅作为示例
df.p_apply(lambda x: x[0], axis=1)

0      0.525561
1      0.206626
2      0.795895
3      0.214247
4      0.059282
         ...   
995    0.490312
996    0.239747
997    0.893300
998    0.395077
999    0.710804
Length: 1000, dtype: float64

英文:

You can try use parallel-pandas library. It has much more functionality that pandarallel and also supports lambda functions

import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=16, split_factor=4, disable_pr_bar=True)

# create DataFrame
df = pd.DataFrame(np.random.random((1_000, 100))) 

df.head()
	  0	           1	        2	       3	        4
0	0.525561	0.342411	0.546397	0.016009	0.810697
1	0.206626	0.794180	0.856513	0.492897	0.446797
2	0.795895	0.790188	0.651192	0.196008	0.415761
3	0.214247	0.307092	0.873755	0.518329	0.166529
4	0.059282	0.306833	0.137190	0.206785	0.314207

#parallel analogue of apply method
#just as an example
df.p_apply(lambda x: x[0], axis=1)

0      0.525561
1      0.206626
2      0.795895
3      0.214247
4      0.059282
         ...   
995    0.490312
996    0.239747
997    0.893300
998    0.395077
999    0.710804
Length: 1000, dtype: float64

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

并行化一个函数，该函数用于在Pandas数据框中填充重复值的缺失值。

问题

答案1

答案2

“Beautiful Soup: AttributeError: ‘NoneType’ object has no attribute ‘text'”

回文检查代码在def内部无法工作。

如何在汇总图例中将类别ID映射到类别名称

以另一个Python脚本提交多个具有不同输入的Python作业。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论