在Dask中的交叉合并/笛卡尔积

huangapple go评论64阅读模式
英文:

cross merge/cartesian product in dask

问题

You can perform the equivalent of the cross merge in Dask using the following code:

import dask.dataframe as dd

# Create Dask DataFrames from your pandas DataFrame 'df'
ddf1 = dd.from_pandas(df, npartitions=1)
ddf2 = dd.from_pandas(df, npartitions=1)

# Perform the cross merge
merged_ddf = dd.merge(ddf1, ddf2, how='cross', suffixes=('', '_y'))

This code will give you the desired result with Dask for your large dataset.

英文:

how can I perform the equivalent of this cross merge in dask?

merged_df = pd.merge(df, df, how='cross', suffixes=('', '_y'))

To provide an example, say I have this dataframe, say dataframe A:

#Niusup Niucust
#1        a
#1        b 
#1        c
#2        d
#2        e

and want to obtain this one:

#Niusup Niucust_x Niucust_y
#1        a       a
#1        a       b
#1        a       c
#1        b       a 
#1        b       b
#1        b       c
#1        c       a 
#1        c       b
#1        c       c
#2        d       d 
#2        d       e
#2        e       d
#2        e       e

I need Dask because dataframe A contains 5000000 observations and so I expect the cartesian product to contain a lot of observations.

thank you

答案1

得分: 1

以下是翻译好的代码部分:

data = {'#Niusup': {0: '#1', 1: '#1', 2: '#1', 3: '#2', 4: '#2'},
 'Niucust': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'}}
df = pd.DataFrame(data)

g = df.groupby('#Niusup')
dfs = [g.get_group(x) for x in g.groups]
pd.concat([df_i.merge(df_i.drop('#Niusup', axis=1), how='cross',suffixes=('', '_y')) for df_i in dfs])

希望这对你有所帮助。

英文:

Example

data = {'#Niusup': {0: '#1', 1: '#1', 2: '#1', 3: '#2', 4: '#2'},
 'Niucust': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'}}
df = pd.DataFrame(data)

Code

g = df.groupby('#Niusup')
dfs = [g.get_group(x) for x in g.groups]
pd.concat([df_i.merge(df_i.drop('#Niusup', axis=1), how='cross',suffixes=('', '_y')) for df_i in dfs])

output:

	#Niusup	Niucust	Niucust_y
0	#1	    a	    a
1	#1	    a	    b
2	#1	    a	    c
3	#1	    b	    a
4	#1	    b    	b
5	#1	    b    	c
6	#1	    c    	a
7	#1	    c    	b
8	#1	    c    	c
0	#2	    d    	d
1	#2	    d    	e
2	#2	    e    	d
3	#2	    e    	e

huangapple
  • 本文由 发表于 2023年5月17日 23:00:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76273502.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定