英文:
Fill NA in PySpark DataFrame by group with values from Pandas lookup table
问题
我有一个包含在col2
中的缺失值的PySpark DataFrame,我想根据col1
中的值进行填充。例如:
df
id col1 col2
0 A 1
1 A NA
2 B 2
3 B NA
4 B 3
我想使用给定的Pandas查找表来填充这些缺失值:
pdf_lookup
id col1 col2
0 A 4
1 B 5
因此,期望的结果将是以下PySpark DataFrame:
id col1 col2
0 A 1
1 A 4
2 B 2
3 B 5
4 B 3
最有效的方法是什么?最好是一个可扩展的解决方案,因为df
可能非常大,包含需要根据col1
进行填充的多达数百列(即col3
,...,col500
)。感谢任何建议!
英文:
I have a PySpark DataFrame with missing values in col2
that I would like to impute, based on values in col1
. For example:
df
id col1 col2
0 A 1
1 A NA
2 B 2
3 B NA
4 B 3
I would like to impute these missing values using a given Pandas lookup table:
pdf_lookup
id col1 col2
0 A 4
1 B 5
So the desired result would be the following PySpark DataFrame:
id col1 col2
0 A 1
1 A 4
2 B 2
3 B 5
4 B 3
What would be the most efficient way to do this? A scalable solution would be ideal since df
may be very large with up to hundreds of columns (i.e. col3
, ..., col500
) that need to be imputed based on col1
. Any suggestions would be appreciated!
答案1
得分: 1
你可以使用连接(join)然后使用coalesce来保留两列中非空值的方法来完成:
pdf_lookup = pdf_lookup.select(col("col1"), col("col2").alias("col2_tmp"))
df.join(pdf_lookup, ["col1"], "left").withColumn("col2", coalesce(col("col2"), col("col2_tmp"))).drop("col2_tmp").show()
英文:
You can do it with a join then a coalesce to keep only the non-null values of 2 columns:
pdf_lookup = pdf_lookup.select(col("col1"), col("col2").alias("col2_tmp"))
df.join(pdf_lookup, ["col1"], "left").withColumn("col2", coalesce(col("col2"), col("col2_tmp"))).drop("col2_tmp").show()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论