2023年5月17日 23:00:12go评论94阅读模式

英文:

cross merge/cartesian product in dask

问题

You can perform the equivalent of the cross merge in Dask using the following code:

import dask.dataframe as dd
# Create Dask DataFrames from your pandas DataFrame 'df'
ddf1 = dd.from_pandas(df, npartitions=1)
ddf2 = dd.from_pandas(df, npartitions=1)
# Perform the cross merge
merged_ddf = dd.merge(ddf1, ddf2, how='cross', suffixes=('', '_y'))

This code will give you the desired result with Dask for your large dataset.

英文:

how can I perform the equivalent of this cross merge in dask?

merged_df = pd.merge(df, df, how=&#39;cross&#39;, suffixes=(&#39;&#39;, &#39;_y&#39;))

To provide an example, say I have this dataframe, say dataframe A:

#Niusup Niucust
#1        a
#1        b 
#1        c
#2        d
#2        e

and want to obtain this one:

#Niusup Niucust_x Niucust_y
#1        a       a
#1        a       b
#1        a       c
#1        b       a 
#1        b       b
#1        b       c
#1        c       a 
#1        c       b
#1        c       c
#2        d       d 
#2        d       e
#2        e       d
#2        e       e

I need Dask because dataframe A contains 5000000 observations and so I expect the cartesian product to contain a lot of observations.

thank you

答案1

得分: 1

以下是翻译好的代码部分：

data = {'#Niusup': {0: '#1', 1: '#1', 2: '#1', 3: '#2', 4: '#2'},
 'Niucust': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'}}
df = pd.DataFrame(data)
g = df.groupby('#Niusup')
dfs = [g.get_group(x) for x in g.groups]
pd.concat([df_i.merge(df_i.drop('#Niusup', axis=1), how='cross',suffixes=('', '_y')) for df_i in dfs])

希望这对你有所帮助。

英文:

Example

data = {&#39;#Niusup&#39;: {0: &#39;#1&#39;, 1: &#39;#1&#39;, 2: &#39;#1&#39;, 3: &#39;#2&#39;, 4: &#39;#2&#39;},
 &#39;Niucust&#39;: {0: &#39;a&#39;, 1: &#39;b&#39;, 2: &#39;c&#39;, 3: &#39;d&#39;, 4: &#39;e&#39;}}
df = pd.DataFrame(data)

Code

g = df.groupby(&#39;#Niusup&#39;)
dfs = [g.get_group(x) for x in g.groups]
pd.concat([df_i.merge(df_i.drop(&#39;#Niusup&#39;, axis=1), how=&#39;cross&#39;,suffixes=(&#39;&#39;, &#39;_y&#39;)) for df_i in dfs])

output:

	#Niusup	Niucust	Niucust_y
0	#1	    a	    a
1	#1	    a	    b
2	#1	    a	    c
3	#1	    b	    a
4	#1	    b    	b
5	#1	    b    	c
6	#1	    c    	a
7	#1	    c    	b
8	#1	    c    	c
0	#2	    d    	d
1	#2	    d    	e
2	#2	    e    	d
3	#2	    e    	e

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Dask中的交叉合并/笛卡尔积

问题

答案1

DuckDB Binder错误：FROM子句中未找到引用的列

我不明白为什么这是一个“语法错误”？

在Enum类中定义classmethod是否是不良做法？

使用matplotlib绘制直方图。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。