英文:
PySpark: Update column values from dataframe A with dataframe B's values with matching ID
问题
假设我们有dfA:
ID | Scores |
---|---|
A | 20 |
A | 40 |
A | 60 |
B | 10 |
B | 90 |
和dfB:
ID | Scores |
---|---|
A | 60 |
B | 90 |
期望的输出:
ID | Scores |
---|---|
A | 60 |
A | 60 |
A | 60 |
B | 90 |
B | 90 |
如何在 PySpark 中根据匹配的 ID 更新 dfA 的分数列与 dfB 的分数列相符?
英文:
Assume we have dfA:
ID | Scores |
---|---|
A | 20 |
A | 40 |
A | 60 |
B | 10 |
B | 90 |
and dfB:
ID | Scores |
---|---|
A | 60 |
B | 90 |
Expected OUTPUT:
ID | Scores |
---|---|
A | 60 |
A | 60 |
A | 60 |
B | 90 |
B | 90 |
How can I update the score column in dfA with dfB's score according to matching ID in PySpark?
答案1
得分: 1
- 从 df_1 中将列名
Scores
重命名为old_scores
。 - 使用内连接来匹配这两个数据框,使用公共键列。
- 从
df_1
中删除old_scores
列。
输出结果如下:
+---+------+
| ID|Scores|
+---+------+
| A| 60|
| A| 60|
| A| 60|
| B| 90|
| B| 90|
+---+------+
英文:
Your DataFrames
df_1
+---+------+
| ID|Scores|
+---+------+
| A| 20|
| A| 40|
| A| 60|
| B| 10|
| B| 90|
+---+------+
df_2
+---+------+
| ID|Scores|
+---+------+
| A| 60|
| B| 90|
+---+------+
- Rename the
Scores
column name toold_scores
from df_1 before joining.
df_1 = df_1.withColumnRenamed("Scores", "old_scores")
- Use inner join to match the two DataFrames using the common key column.
df = df_1.join(df_2, "ID")
- Drop the
old_scores
column fromdf_1
df.drop("old_scores").show()
Output:
+---+------+
| ID|Scores|
+---+------+
| A| 60|
| A| 60|
| A| 60|
| B| 90|
| B| 90|
+---+------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论