英文:
cross merge/cartesian product in dask
问题
You can perform the equivalent of the cross merge in Dask using the following code:
import dask.dataframe as dd
# Create Dask DataFrames from your pandas DataFrame 'df'
ddf1 = dd.from_pandas(df, npartitions=1)
ddf2 = dd.from_pandas(df, npartitions=1)
# Perform the cross merge
merged_ddf = dd.merge(ddf1, ddf2, how='cross', suffixes=('', '_y'))
This code will give you the desired result with Dask for your large dataset.
英文:
how can I perform the equivalent of this cross merge in dask?
merged_df = pd.merge(df, df, how='cross', suffixes=('', '_y'))
To provide an example, say I have this dataframe, say dataframe A:
#Niusup Niucust
#1 a
#1 b
#1 c
#2 d
#2 e
and want to obtain this one:
#Niusup Niucust_x Niucust_y
#1 a a
#1 a b
#1 a c
#1 b a
#1 b b
#1 b c
#1 c a
#1 c b
#1 c c
#2 d d
#2 d e
#2 e d
#2 e e
I need Dask because dataframe A contains 5000000 observations and so I expect the cartesian product to contain a lot of observations.
thank you
答案1
得分: 1
以下是翻译好的代码部分:
data = {'#Niusup': {0: '#1', 1: '#1', 2: '#1', 3: '#2', 4: '#2'},
'Niucust': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'}}
df = pd.DataFrame(data)
g = df.groupby('#Niusup')
dfs = [g.get_group(x) for x in g.groups]
pd.concat([df_i.merge(df_i.drop('#Niusup', axis=1), how='cross',suffixes=('', '_y')) for df_i in dfs])
希望这对你有所帮助。
英文:
Example
data = {'#Niusup': {0: '#1', 1: '#1', 2: '#1', 3: '#2', 4: '#2'},
'Niucust': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'}}
df = pd.DataFrame(data)
Code
g = df.groupby('#Niusup')
dfs = [g.get_group(x) for x in g.groups]
pd.concat([df_i.merge(df_i.drop('#Niusup', axis=1), how='cross',suffixes=('', '_y')) for df_i in dfs])
output:
#Niusup Niucust Niucust_y
0 #1 a a
1 #1 a b
2 #1 a c
3 #1 b a
4 #1 b b
5 #1 b c
6 #1 c a
7 #1 c b
8 #1 c c
0 #2 d d
1 #2 d e
2 #2 e d
3 #2 e e
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论