英文:
parallelize a function that fill missing values from duplicates in pandas dataframe
问题
我有一个产品数据框,包含1838379行,其中有描述、图片链接、EAN和产品名称。这个数据集中的产品名称存在重复,我尝试使用重复的产品名称来填充描述、图片链接和EAN中的空值,所以我实现了以下函数:
def fill_descriptions_images_ean_from_duplicates(row, train):
import pandas as pd
duplicated_rows = train.loc[train['product_name'] == row["product_name"]]
if not duplicated_rows.empty:
descriptions = duplicated_rows["description"].dropna()
if not descriptions.empty:
description = list(descriptions)[0]
train.loc[train['product_name'] == row["product_name"], 'description',] = train.loc[train['product_name'] == row["product_name"], 'description'].fillna(description)
images = duplicated_rows["image_url"].dropna()
if not images.empty:
image = list(images)[0]
train.loc[train['product_name'] == row["product_name"], 'image_url',] = train.loc[train['product_name'] == row["product_name"], 'image_url'].fillna(image)
eans = duplicated_rows["ean"].dropna()
if not eans.empty:
ean = list(eans)[0]
train.loc[train['product_name'] == row["product_name"], 'ean',] = train.loc[train['product_name'] == row["product_name"], 'ean'].fillna(ean)
当我使用apply
方法时,执行时间太长,所以我尝试使用pandarallel
,但是pandarallel
不支持lambda函数,并告诉我fill_descriptions_images_ean_from_duplicates
未定义:
from pandarallel import pandarallel
import psutil
psutil.cpu_count(logical=False)
pandarallel.initialize()
train.parallel_apply(lambda row: fill_descriptions_images_ean_from_duplicates(row, train), axis=1)
然后我尝试使用dask
,但进度条卡住了:
def process_partition(df_partition, train):
df_partition.apply(lambda row: fill_descriptions_images_ean_from_duplicates(row, train), axis=1)
return df_partition
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
dask_train = dd.from_pandas(train, npartitions=7)
dask_df_applied = dask_train.map_partitions(lambda row: process_partition(row, train), meta=train.dtypes)
with ProgressBar():
train = dask_df_applied.compute()
样本数据:
import pandas as pd
import numpy as np
# 设置随机种子以便重现性
np.random.seed(42)
# 生成随机数据
data = {
'product_name': ['Product A', 'Product B', 'Product B', 'Product C', 'Product D'] * 20,
'description': np.random.choice([np.nan, 'Description'], size=100),
'image_url': np.random.choice([np.nan, 'image_url'], size=100),
'ean': np.random.choice([np.nan, 'EAN123456'], size=100)
}
# 创建DataFrame
train = pd.DataFrame(data)
请注意,我只提供了代码部分的翻译。如果您需要更多帮助或有其他问题,请随时提出。
英文:
I have a product data frame that consists of 1838379 rows that have description image_url, eans, and product name
this dataset has duplicates in the product name I am trying to fill the nan values in description image_url, eans with the duplicated values in product name so i implemented this function
def fill_descriptions_images_ean_from_duplicates(row,train):
import pandas as pd
duplicated_rows = train.loc[train['product_name'] == row["product_name"]]
if not duplicated_rows.empty:
descriptions=duplicated_rows["description"].dropna()
if not descriptions.empty:
description=list(descriptions)[0]
train.loc[train['product_name'] == row["product_name"], 'description',] = train.loc[train['product_name'] == row["product_name"], 'description'].fillna(description)
images=duplicated_rows["image_url"].dropna()
if not images.empty:
image=list(images)[0]
train.loc[train['product_name'] == row["product_name"], 'image_url',] = train.loc[train['product_name'] == row["product_name"], 'image_url'].fillna(image)
eans=duplicated_rows["ean"].dropna()
if not eans.empty:
ean=list(eans)[0]
train.loc[train['product_name'] == row["product_name"], 'ean',] = train.loc[train['product_name'] == row["product_name"], 'ean'].fillna(ean)
when I use apply it takes forever to execute so I tried using Pandaralele but pandaralele doesn't support the lambda function and it tells me that the fill_descriptions_images_ean_from_duplicates is not defined
from pandarallel import pandarallel
import psutil
psutil.cpu_count(logical=False)
pandarallel.initialize()
train.parallel_apply(lambda row: fill_descriptions_images_ean_from_duplicates(row, train), axis=1)
so i tried using dask but nothing happend either the progressbar is stuck
def process_partition(df_partition,train):
df_partition.apply(lambda row: fill_descriptions_images_ean_from_duplicates(row, train), axis=1)
return df_partition
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
dask_train = dd.from_pandas(train, npartitions=7)
dask_df_applied = dask_train.map_partitions(lambda row: process_partition(row, train),meta=train.dtypes)
with ProgressBar():
train=dask_df_applied.compute()
sample data
import pandas as pd
import numpy as np
# Set the random seed for reproducibility
np.random.seed(42)
# Generate random data
data = {
'product_name': ['Product A', 'Product B', 'Product B', 'Product C', 'Product D'] * 20,
'description': np.random.choice([np.nan, 'Description'], size=100),
'image_url': np.random.choice([np.nan, 'image_url'], size=100),
'ean': np.random.choice([np.nan, 'EAN123456'], size=100)
}
# Create the DataFrame
train= pd.DataFrame(data)
答案1
得分: 0
这是我能找到的最好的东西,它将时间缩短到15分钟。
train['description'] = train.groupby('product_name')['description'].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)
train['image_url'] = train.groupby('product_name')['image_url'].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)
train['ean'] = train.groupby('product_name')['ean'].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)
英文:
this is the best thing I could find it reduces the time to 15 minutes
train['description'] = train.groupby('product_name')['description'].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)
train['image_url'] = train.groupby('product_name')['image_url'].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)
train['ean'] = train.groupby('product_name')['ean'].transform(lambda x: x.fillna(x.dropna().iloc[0]) if x.notnull().any() else x)
答案2
得分: 0
你可以尝试使用parallel-pandas库。它具有比pandarallel
更多的功能,并且还支持lambda
函数。
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas
# 初始化parallel-pandas
ParallelPandas.initialize(n_cpu=16, split_factor=4, disable_pr_bar=True)
# 创建DataFrame
df = pd.DataFrame(np.random.random((1_000, 100)))
df.head()
0 1 2 3 4
0 0.525561 0.342411 0.546397 0.016009 0.810697
1 0.206626 0.794180 0.856513 0.492897 0.446797
2 0.795895 0.790188 0.651192 0.196008 0.415761
3 0.214247 0.307092 0.873755 0.518329 0.166529
4 0.059282 0.306833 0.137190 0.206785 0.314207
# 并行的应用方法的类似例子
# 仅作为示例
df.p_apply(lambda x: x[0], axis=1)
0 0.525561
1 0.206626
2 0.795895
3 0.214247
4 0.059282
...
995 0.490312
996 0.239747
997 0.893300
998 0.395077
999 0.710804
Length: 1000, dtype: float64
英文:
You can try use parallel-pandas library. It has much more functionality that pandarallel
and also supports lambda
functions
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=16, split_factor=4, disable_pr_bar=True)
# create DataFrame
df = pd.DataFrame(np.random.random((1_000, 100)))
df.head()
0 1 2 3 4
0 0.525561 0.342411 0.546397 0.016009 0.810697
1 0.206626 0.794180 0.856513 0.492897 0.446797
2 0.795895 0.790188 0.651192 0.196008 0.415761
3 0.214247 0.307092 0.873755 0.518329 0.166529
4 0.059282 0.306833 0.137190 0.206785 0.314207
#parallel analogue of apply method
#just as an example
df.p_apply(lambda x: x[0], axis=1)
0 0.525561
1 0.206626
2 0.795895
3 0.214247
4 0.059282
...
995 0.490312
996 0.239747
997 0.893300
998 0.395077
999 0.710804
Length: 1000, dtype: float64
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论