根据不同列值从不同的数据框中复制值。

huangapple go评论66阅读模式
英文:

How to replicate value based on distinct column values from a different df pyspark

问题

Sure, here's the translation of the code part you provided:

我有一个类似的数据框

    df1 = 
    AA	BB	CC	DD
    1	X	Y	Z
    2	M	N	O
    3	P	Q	R


我有另一个数据框

    df2 =
    BB	CC	DD
    G	K	O
    H	L	P
    I	M	Q

我想要为df1的每个不同的'AA'列值复制df2的所有列和行并得到结果数据框如下

    df = 
    AA	BB	CC	DD
    1	X	Y	Z
    1	G	K	O
    1	H	L	P
    1	I	M	Q
    2	M	N	O
    2	G	K	O
    2	H	L	P
    2	I	M	Q
    3	P	Q	R
    3	G	K	O
    3	H	L	P
    3	I	M	Q

我目前正在做的是

    AAs = df1.select("AA").distinct().rdd.flatMap(lambda x: x).collect()
    out= []
    for i in AAs:
      dff = df1.filter(col('AA')==i)
      temp_df = (df1.orderBy(rand())
            .withColumn('AA', lit(i))
            )
      out.append(temp_df)
    df = reduce(DataFrame.unionAll, out)

这个方法非常耗时并且由于这些是模拟数据框实际数据框的维度非常大因此集群失败了有没有更有效的Pyspark方法提前感谢

Please note that the translation focuses on the code part and doesn't include any additional information or responses to your specific request.

英文:

I have a df like:

df1 = 
AA	BB	CC	DD
1	X	Y	Z
2	M	N	O
3	P	Q	R

I have another df like:

df2 =
BB	CC	DD
G	K	O
H	L	P
I	M	Q

I want to copy all the columns and rows of df2 for every distinct value of 'AA' column of df1 and get the resultant df as:

df = 
AA	BB	CC	DD
1	X	Y	Z
1	G	K	O
1	H	L	P
1	I	M	Q
2	M	N	O
2	G	K	O
2	H	L	P
2	I	M	Q
3	P	Q	R
3	G	K	O
3	H	L	P
3	I	M	Q

What I am doing right now is:

AAs = df1.select("AA").distinct().rdd.flatMap(lambda x: x).collect()
out= []
for i in AAs:
  dff = df1.filter(col('AA')==i)
  temp_df = (df1.orderBy(rand())
        .withColumn('AA', lit(i))
        )
  out.append(temp_df)
df = reduce(DataFrame.unionAll, out)

Which is taking extremely long time and failing the cluster as these are mock dataframes, actual dataframes are quite large in dimension. Any Pysparky way of doing it? Thanks in advance.

答案1

得分: 2

这是您要翻译的内容:

这会起作用:

resultDf= df.select("AA")\
.crossJoin(df2)\
.union(df)

# 不需要对实际结果进行排序,这只是为了展示这个示例。
resultDf.orderBy("AA").show()

尽管如此,这仍然是一个庞大的操作,可能会在集群上消耗大量资源。

输入

DF1:

+---+---+---+---+
| AA| BB| CC| DD|
+---+---+---+---+
|  1|  X|  Y|  Z|
|  2|  M|  N|  O|
|  3|  P|  Q|  R|
+---+---+---+---+

DF2:

+---+---+---+
| BB| CC| DD|
+---+---+---+
|  G|  K|  O|
|  H|  L|  P|
|  I|  M|  Q|
+---+---+---+

输出:

+---+---+---+---+
| AA| BB| CC| DD|
+---+---+---+---+
|  1|  G|  K|  O|
|  1|  X|  Y|  Z|
|  1|  I|  M|  Q|
|  1|  H|  L|  P|
|  2|  M|  N|  O|
|  2|  I|  M|  Q|
|  2|  H|  L|  P|
|  2|  G|  K|  O|
|  3|  P|  Q|  R|
|  3|  I|  M|  Q|
|  3|  H|  L|  P|
|  3|  G|  K|  O|
+---+---+---+---+
英文:

This would work:

resultDf= df.select("AA")\
.crossJoin(df2)\
.union(df)

# No Need to order the actual result, this is just for displaying this example.
resultDf.orderBy("AA").show()

Although, this would still be a huge operation and can be expensive on the cluster.

Input

DF1:

+---+---+---+---+
| AA| BB| CC| DD|
+---+---+---+---+
|  1|  X|  Y|  Z|
|  2|  M|  N|  O|
|  3|  P|  Q|  R|
+---+---+---+---+

DF2:

+---+---+---+
| BB| CC| DD|
+---+---+---+
|  G|  K|  O|
|  H|  L|  P|
|  I|  M|  Q|
+---+---+---+

Output:

+---+---+---+---+
| AA| BB| CC| DD|
+---+---+---+---+
|  1|  G|  K|  O|
|  1|  X|  Y|  Z|
|  1|  I|  M|  Q|
|  1|  H|  L|  P|
|  2|  M|  N|  O|
|  2|  I|  M|  Q|
|  2|  H|  L|  P|
|  2|  G|  K|  O|
|  3|  P|  Q|  R|
|  3|  I|  M|  Q|
|  3|  H|  L|  P|
|  3|  G|  K|  O|
+---+---+---+---+

huangapple
  • 本文由 发表于 2023年5月8日 02:40:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76195685.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定