并行化一个函数,该函数用于在Pandas数据框中填充重复值的缺失值。

huangapple go评论66阅读模式
英文:

parallelize a function that fill missing values from duplicates in pandas dataframe

问题

我有一个产品数据框,包含1838379行,其中有描述、图片链接、EAN和产品名称。这个数据集中的产品名称存在重复,我尝试使用重复的产品名称来填充描述、图片链接和EAN中的空值,所以我实现了以下函数:

def fill_descriptions_images_ean_from_duplicates(row, train):
    import pandas as pd
    duplicated_rows = train.loc[train['product_name'] == row["product_name"]]
    if not duplicated_rows.empty:
        descriptions = duplicated_rows["description"].dropna()
        if not descriptions.empty:
            description = list(descriptions)[0]
            train.loc[train['product_name'] == row["product_name"], 'description',] = train.loc[train['product_name'] ==  row["product_name"], 'description'].fillna(description)

        images = duplicated_rows["image_url"].dropna()
        if not images.empty:
            image = list(images)[0]
            train.loc[train['product_name'] == row["product_name"], 'image_url',] = train.loc[train['product_name'] ==  row["product_name"], 'image_url'].fillna(image)

        eans = duplicated_rows["ean"].dropna()
        if not eans.empty:
            ean = list(eans)[0]
            train.loc[train['product_name'] == row["product_name"], 'ean',] = train.loc[train['product_name'] ==  row["product_name"], 'ean'].fillna(ean)

当我使用apply方法时,执行时间太长,所以我尝试使用pandarallel,但是pandarallel不支持lambda函数,并告诉我fill_descriptions_images_ean_from_duplicates未定义:

from pandarallel import pandarallel
import psutil

psutil.cpu_count(logical=False)

pandarallel.initialize()
train.parallel_apply(lambda row: fill_descriptions_images_ean_from_duplicates(row, train), axis=1)

然后我尝试使用dask,但进度条卡住了:

def process_partition(df_partition, train):
    df_partition.apply(lambda row: fill_descriptions_images_ean_from_duplicates(row, train), axis=1)
    return df_partition

import dask.dataframe as dd
from dask.diagnostics import ProgressBar
dask_train = dd.from_pandas(train, npartitions=7)
dask_df_applied = dask_train.map_partitions(lambda row: process_partition(row, train), meta=train.dtypes)
with ProgressBar():
    train = dask_df_applied.compute()

样本数据:

import pandas as pd
import numpy as np

# 设置随机种子以便重现性
np.random.seed(42)

# 生成随机数据
data = {
    'product_name': ['Product A', 'Product B', 'Product B', 'Product C', 'Product D'] * 20,
    'description': np.random.choice([np.nan, 'Description'], size=100),
    'image_url': np.random.choice([np.nan, 'image_url'], size=100),
    'ean': np.random.choice([np.nan, 'EAN123456'], size=100)
}

# 创建DataFrame
train = pd.DataFrame(data)

请注意,我只提供了代码部分的翻译。如果您需要更多帮助或有其他问题,请随时提出。

英文:

I have a product data frame that consists of 1838379 rows that have description image_url, eans, and product name
this dataset has duplicates in the product name I am trying to fill the nan values in description image_url, eans with the duplicated values in product name so i implemented this function

def fill_descriptions_images_ean_from_duplicates(row,train):
    import pandas as pd
    duplicated_rows = train.loc[train['product_name'] == row["product_name"]]
    if not duplicated_rows.empty:

        
        descriptions=duplicated_rows["description"].dropna()
        if not descriptions.empty:
            description=list(descriptions)[0]
            train.loc[train['product_name'] ==  row["product_name"], 'description',] = train.loc[train['product_name'] ==  row["product_name"], 'description'].fillna(description)

        images=duplicated_rows["image_url"].dropna()
        if not images.empty:
            
            image=list(images)[0]
            train.loc[train['product_name'] ==  row["product_name"], 'image_url',] = train.loc[train['product_name'] ==  row["product_name"], 'image_url'].fillna(image)
        

        eans=duplicated_rows["ean"].dropna()
        if not eans.empty:
            ean=list(eans)[0]
            train.loc[train['product_name'] ==  row["product_name"], 'ean',] = train.loc[train['product_name'] ==  row["product_name"], 'ean'].fillna(ean)

    

when I use apply it takes forever to execute so I tried using Pandaralele but pandaralele doesn't support the lambda function and it tells me that the fill_descriptions_images_ean_from_duplicates is not defined

from pandarallel import pandarallel
import psutil

psutil.cpu_count(logical=False)

pandarallel.initialize()
train.parallel_apply(lambda row: fill_descriptions_images_ean_from_duplicates(row, train), axis=1)

so i tried using dask but nothing happend either the progressbar is stuck

def process_partition(df_partition,train):
    df_partition.apply(lambda row: fill_descriptions_images_ean_from_duplicates(row, train), axis=1)
    return df_partition
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
dask_train = dd.from_pandas(train, npartitions=7)
dask_df_applied = dask_train.map_partitions(lambda row: process_partition(row, train),meta=train.dtypes)
with ProgressBar():
    train=dask_df_applied.compute()

sample data

import pandas as pd
import numpy as np

# Set the random seed for reproducibility
np.random.seed(42)

# Generate random data
data = {
    'product_name': ['Product A', 'Product B', 'Product B', 'Product C', 'Product D'] * 20,
    'description': np.random.choice([np.nan, 'Description'], size=100),
    'image_url': np.random.choice([np.nan, 'image_url'], size=100),
    'ean': np.random.choice([np.nan, 'EAN123456'], size=100)
}

# Create the DataFrame
train= pd.DataFrame(data)


答案1

得分: 0

这是我能找到的最好的东西,它将时间缩短到15分钟。

train['description'] = train.groupby('product_name')['description'].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)
train['image_url'] = train.groupby('product_name')['image_url'].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)
train['ean'] = train.groupby('product_name')['ean'].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)
英文:

this is the best thing I could find it reduces the time to 15 minutes

train['description'] = train.groupby('product_name')['description'].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)
train['image_url'] = train.groupby('product_name')['image_url'].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)
train['ean'] = train.groupby('product_name')['ean'].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)

答案2

得分: 0

你可以尝试使用parallel-pandas库。它具有比pandarallel更多的功能,并且还支持lambda函数。

import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

# 初始化parallel-pandas
ParallelPandas.initialize(n_cpu=16, split_factor=4, disable_pr_bar=True)

# 创建DataFrame
df = pd.DataFrame(np.random.random((1_000, 100))) 

df.head()
	  0	           1	        2	       3	        4
0	0.525561	0.342411	0.546397	0.016009	0.810697
1	0.206626	0.794180	0.856513	0.492897	0.446797
2	0.795895	0.790188	0.651192	0.196008	0.415761
3	0.214247	0.307092	0.873755	0.518329	0.166529
4	0.059282	0.306833	0.137190	0.206785	0.314207

# 并行的应用方法的类似例子
# 仅作为示例
df.p_apply(lambda x: x[0], axis=1)

0      0.525561
1      0.206626
2      0.795895
3      0.214247
4      0.059282
         ...   
995    0.490312
996    0.239747
997    0.893300
998    0.395077
999    0.710804
Length: 1000, dtype: float64

英文:

You can try use parallel-pandas library. It has much more functionality that pandarallel and also supports lambda functions

import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=16, split_factor=4, disable_pr_bar=True)

# create DataFrame
df = pd.DataFrame(np.random.random((1_000, 100))) 

df.head()
	  0	           1	        2	       3	        4
0	0.525561	0.342411	0.546397	0.016009	0.810697
1	0.206626	0.794180	0.856513	0.492897	0.446797
2	0.795895	0.790188	0.651192	0.196008	0.415761
3	0.214247	0.307092	0.873755	0.518329	0.166529
4	0.059282	0.306833	0.137190	0.206785	0.314207

#parallel analogue of apply method
#just as an example
df.p_apply(lambda x: x[0], axis=1)

0      0.525561
1      0.206626
2      0.795895
3      0.214247
4      0.059282
         ...   
995    0.490312
996    0.239747
997    0.893300
998    0.395077
999    0.710804
Length: 1000, dtype: float64

huangapple
  • 本文由 发表于 2023年6月19日 01:17:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76501733.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定