英文:
How to replicate value based on distinct column values from a different df pyspark
问题
Sure, here's the translation of the code part you provided:
我有一个类似的数据框:
df1 =
AA BB CC DD
1 X Y Z
2 M N O
3 P Q R
我有另一个数据框:
df2 =
BB CC DD
G K O
H L P
I M Q
我想要为df1的每个不同的'AA'列值复制df2的所有列和行,并得到结果数据框如下:
df =
AA BB CC DD
1 X Y Z
1 G K O
1 H L P
1 I M Q
2 M N O
2 G K O
2 H L P
2 I M Q
3 P Q R
3 G K O
3 H L P
3 I M Q
我目前正在做的是:
AAs = df1.select("AA").distinct().rdd.flatMap(lambda x: x).collect()
out= []
for i in AAs:
dff = df1.filter(col('AA')==i)
temp_df = (df1.orderBy(rand())
.withColumn('AA', lit(i))
)
out.append(temp_df)
df = reduce(DataFrame.unionAll, out)
这个方法非常耗时,并且由于这些是模拟数据框,实际数据框的维度非常大,因此集群失败了。有没有更有效的Pyspark方法?提前感谢。
Please note that the translation focuses on the code part and doesn't include any additional information or responses to your specific request.
英文:
I have a df like:
df1 =
AA BB CC DD
1 X Y Z
2 M N O
3 P Q R
I have another df like:
df2 =
BB CC DD
G K O
H L P
I M Q
I want to copy all the columns and rows of df2 for every distinct value of 'AA' column of df1 and get the resultant df as:
df =
AA BB CC DD
1 X Y Z
1 G K O
1 H L P
1 I M Q
2 M N O
2 G K O
2 H L P
2 I M Q
3 P Q R
3 G K O
3 H L P
3 I M Q
What I am doing right now is:
AAs = df1.select("AA").distinct().rdd.flatMap(lambda x: x).collect()
out= []
for i in AAs:
dff = df1.filter(col('AA')==i)
temp_df = (df1.orderBy(rand())
.withColumn('AA', lit(i))
)
out.append(temp_df)
df = reduce(DataFrame.unionAll, out)
Which is taking extremely long time and failing the cluster as these are mock dataframes, actual dataframes are quite large in dimension. Any Pysparky way of doing it? Thanks in advance.
答案1
得分: 2
这是您要翻译的内容:
这会起作用:
resultDf= df.select("AA")\
.crossJoin(df2)\
.union(df)
# 不需要对实际结果进行排序,这只是为了展示这个示例。
resultDf.orderBy("AA").show()
尽管如此,这仍然是一个庞大的操作,可能会在集群上消耗大量资源。
输入
DF1:
+---+---+---+---+
| AA| BB| CC| DD|
+---+---+---+---+
| 1| X| Y| Z|
| 2| M| N| O|
| 3| P| Q| R|
+---+---+---+---+
DF2:
+---+---+---+
| BB| CC| DD|
+---+---+---+
| G| K| O|
| H| L| P|
| I| M| Q|
+---+---+---+
输出:
+---+---+---+---+
| AA| BB| CC| DD|
+---+---+---+---+
| 1| G| K| O|
| 1| X| Y| Z|
| 1| I| M| Q|
| 1| H| L| P|
| 2| M| N| O|
| 2| I| M| Q|
| 2| H| L| P|
| 2| G| K| O|
| 3| P| Q| R|
| 3| I| M| Q|
| 3| H| L| P|
| 3| G| K| O|
+---+---+---+---+
英文:
This would work:
resultDf= df.select("AA")\
.crossJoin(df2)\
.union(df)
# No Need to order the actual result, this is just for displaying this example.
resultDf.orderBy("AA").show()
Although, this would still be a huge operation and can be expensive on the cluster.
Input
DF1:
+---+---+---+---+
| AA| BB| CC| DD|
+---+---+---+---+
| 1| X| Y| Z|
| 2| M| N| O|
| 3| P| Q| R|
+---+---+---+---+
DF2:
+---+---+---+
| BB| CC| DD|
+---+---+---+
| G| K| O|
| H| L| P|
| I| M| Q|
+---+---+---+
Output:
+---+---+---+---+
| AA| BB| CC| DD|
+---+---+---+---+
| 1| G| K| O|
| 1| X| Y| Z|
| 1| I| M| Q|
| 1| H| L| P|
| 2| M| N| O|
| 2| I| M| Q|
| 2| H| L| P|
| 2| G| K| O|
| 3| P| Q| R|
| 3| I| M| Q|
| 3| H| L| P|
| 3| G| K| O|
+---+---+---+---+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论