2023年5月8日 02:40:33go评论158阅读模式

英文:

How to replicate value based on distinct column values from a different df pyspark

问题

Sure, here's the translation of the code part you provided:

我有一个类似的数据框：

    df1 = 
    AA	BB	CC	DD
    1	X	Y	Z
    2	M	N	O
    3	P	Q	R


我有另一个数据框：

    df2 =
    BB	CC	DD
    G	K	O
    H	L	P
    I	M	Q

我想要为df1的每个不同的'AA'列值复制df2的所有列和行，并得到结果数据框如下：

    df = 
    AA	BB	CC	DD
    1	X	Y	Z
    1	G	K	O
    1	H	L	P
    1	I	M	Q
    2	M	N	O
    2	G	K	O
    2	H	L	P
    2	I	M	Q
    3	P	Q	R
    3	G	K	O
    3	H	L	P
    3	I	M	Q

我目前正在做的是：

    AAs = df1.select("AA").distinct().rdd.flatMap(lambda x: x).collect()
    out= []
    for i in AAs:
      dff = df1.filter(col('AA')==i)
      temp_df = (df1.orderBy(rand())
            .withColumn('AA', lit(i))
            )
      out.append(temp_df)
    df = reduce(DataFrame.unionAll, out)

这个方法非常耗时，并且由于这些是模拟数据框，实际数据框的维度非常大，因此集群失败了。有没有更有效的Pyspark方法？提前感谢。

Please note that the translation focuses on the code part and doesn't include any additional information or responses to your specific request.

英文:

I have a df like:

df1 = 
AA	BB	CC	DD
1	X	Y	Z
2	M	N	O
3	P	Q	R

I have another df like:

df2 =
BB	CC	DD
G	K	O
H	L	P
I	M	Q

I want to copy all the columns and rows of df2 for every distinct value of 'AA' column of df1 and get the resultant df as:

df = 
AA	BB	CC	DD
1	X	Y	Z
1	G	K	O
1	H	L	P
1	I	M	Q
2	M	N	O
2	G	K	O
2	H	L	P
2	I	M	Q
3	P	Q	R
3	G	K	O
3	H	L	P
3	I	M	Q

What I am doing right now is:

AAs = df1.select(&quot;AA&quot;).distinct().rdd.flatMap(lambda x: x).collect()
out= []
for i in AAs:
  dff = df1.filter(col(&#39;AA&#39;)==i)
  temp_df = (df1.orderBy(rand())
        .withColumn(&#39;AA&#39;, lit(i))
        )
  out.append(temp_df)
df = reduce(DataFrame.unionAll, out)

Which is taking extremely long time and failing the cluster as these are mock dataframes, actual dataframes are quite large in dimension. Any Pysparky way of doing it? Thanks in advance.

答案1

得分: 2

这是您要翻译的内容：

这会起作用：

resultDf= df.select("AA")\
.crossJoin(df2)\
.union(df)

# 不需要对实际结果进行排序，这只是为了展示这个示例。
resultDf.orderBy("AA").show()

尽管如此，这仍然是一个庞大的操作，可能会在集群上消耗大量资源。

输入

DF1:

+---+---+---+---+
| AA| BB| CC| DD|
+---+---+---+---+
|  1|  X|  Y|  Z|
|  2|  M|  N|  O|
|  3|  P|  Q|  R|
+---+---+---+---+

DF2:

+---+---+---+
| BB| CC| DD|
+---+---+---+
|  G|  K|  O|
|  H|  L|  P|
|  I|  M|  Q|
+---+---+---+

输出:

+---+---+---+---+
| AA| BB| CC| DD|
+---+---+---+---+
|  1|  G|  K|  O|
|  1|  X|  Y|  Z|
|  1|  I|  M|  Q|
|  1|  H|  L|  P|
|  2|  M|  N|  O|
|  2|  I|  M|  Q|
|  2|  H|  L|  P|
|  2|  G|  K|  O|
|  3|  P|  Q|  R|
|  3|  I|  M|  Q|
|  3|  H|  L|  P|
|  3|  G|  K|  O|
+---+---+---+---+

英文:

This would work:

resultDf= df.select(&quot;AA&quot;)\
.crossJoin(df2)\
.union(df)

# No Need to order the actual result, this is just for displaying this example.
resultDf.orderBy(&quot;AA&quot;).show()

Although, this would still be a huge operation and can be expensive on the cluster.

Input

DF1:

+---+---+---+---+
| AA| BB| CC| DD|
+---+---+---+---+
|  1|  X|  Y|  Z|
|  2|  M|  N|  O|
|  3|  P|  Q|  R|
+---+---+---+---+

DF2:

+---+---+---+
| BB| CC| DD|
+---+---+---+
|  G|  K|  O|
|  H|  L|  P|
|  I|  M|  Q|
+---+---+---+

Output:

+---+---+---+---+
| AA| BB| CC| DD|
+---+---+---+---+
|  1|  G|  K|  O|
|  1|  X|  Y|  Z|
|  1|  I|  M|  Q|
|  1|  H|  L|  P|
|  2|  M|  N|  O|
|  2|  I|  M|  Q|
|  2|  H|  L|  P|
|  2|  G|  K|  O|
|  3|  P|  Q|  R|
|  3|  I|  M|  Q|
|  3|  H|  L|  P|
|  3|  G|  K|  O|
+---+---+---+---+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

根据不同列值从不同的数据框中复制值。

问题

答案1

Excluding pydantic model fields only when returned as part of a FastAPI call.

JSON格式错误 – Amazon Personalize 的批处理推断作业输入

如何将变量值更新为当前值（当前值 * 100 / colsums()）

如何根据除X、Y、Z之外的一个值来更改3D散点图的颜色。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论