在Dask中的交叉合并/笛卡尔积

huangapple go评论75阅读模式
英文:

cross merge/cartesian product in dask

问题

You can perform the equivalent of the cross merge in Dask using the following code:

  1. import dask.dataframe as dd
  2. # Create Dask DataFrames from your pandas DataFrame 'df'
  3. ddf1 = dd.from_pandas(df, npartitions=1)
  4. ddf2 = dd.from_pandas(df, npartitions=1)
  5. # Perform the cross merge
  6. merged_ddf = dd.merge(ddf1, ddf2, how='cross', suffixes=('', '_y'))

This code will give you the desired result with Dask for your large dataset.

英文:

how can I perform the equivalent of this cross merge in dask?

  1. merged_df = pd.merge(df, df, how='cross', suffixes=('', '_y'))

To provide an example, say I have this dataframe, say dataframe A:

  1. #Niusup Niucust
  2. #1 a
  3. #1 b
  4. #1 c
  5. #2 d
  6. #2 e

and want to obtain this one:

  1. #Niusup Niucust_x Niucust_y
  2. #1 a a
  3. #1 a b
  4. #1 a c
  5. #1 b a
  6. #1 b b
  7. #1 b c
  8. #1 c a
  9. #1 c b
  10. #1 c c
  11. #2 d d
  12. #2 d e
  13. #2 e d
  14. #2 e e

I need Dask because dataframe A contains 5000000 observations and so I expect the cartesian product to contain a lot of observations.

thank you

答案1

得分: 1

以下是翻译好的代码部分:

  1. data = {'#Niusup': {0: '#1', 1: '#1', 2: '#1', 3: '#2', 4: '#2'},
  2. 'Niucust': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'}}
  3. df = pd.DataFrame(data)
  4. g = df.groupby('#Niusup')
  5. dfs = [g.get_group(x) for x in g.groups]
  6. pd.concat([df_i.merge(df_i.drop('#Niusup', axis=1), how='cross',suffixes=('', '_y')) for df_i in dfs])

希望这对你有所帮助。

英文:

Example

  1. data = {'#Niusup': {0: '#1', 1: '#1', 2: '#1', 3: '#2', 4: '#2'},
  2. 'Niucust': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'}}
  3. df = pd.DataFrame(data)

Code

  1. g = df.groupby('#Niusup')
  2. dfs = [g.get_group(x) for x in g.groups]
  3. pd.concat([df_i.merge(df_i.drop('#Niusup', axis=1), how='cross',suffixes=('', '_y')) for df_i in dfs])

output:

  1. #Niusup Niucust Niucust_y
  2. 0 #1 a a
  3. 1 #1 a b
  4. 2 #1 a c
  5. 3 #1 b a
  6. 4 #1 b b
  7. 5 #1 b c
  8. 6 #1 c a
  9. 7 #1 c b
  10. 8 #1 c c
  11. 0 #2 d d
  12. 1 #2 d e
  13. 2 #2 e d
  14. 3 #2 e e

huangapple
  • 本文由 发表于 2023年5月17日 23:00:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76273502.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定