英文:
Fill NA in PySpark DataFrame by group with values from Pandas lookup table
问题
我有一个包含在col2中的缺失值的PySpark DataFrame,我想根据col1中的值进行填充。例如:
df
id    col1    col2
0      A       1
1      A      NA
2      B       2
3      B      NA
4      B       3
我想使用给定的Pandas查找表来填充这些缺失值:
pdf_lookup
id    col1    col2
0      A       4
1      B       5
因此,期望的结果将是以下PySpark DataFrame:
id    col1    col2
0      A       1
1      A       4
2      B       2
3      B       5
4      B       3
最有效的方法是什么?最好是一个可扩展的解决方案,因为df可能非常大,包含需要根据col1进行填充的多达数百列(即col3,...,col500)。感谢任何建议!
英文:
I have a PySpark DataFrame with missing values in col2 that I would like to impute, based on values in col1. For example:
df
id    col1    col2
0      A       1
1      A      NA
2      B       2
3      B      NA
4      B       3
I would like to impute these missing values using a given Pandas lookup table:
pdf_lookup
id    col1    col2
0      A       4
1      B       5
So the desired result would be the following PySpark DataFrame:
id    col1    col2
0      A       1
1      A       4
2      B       2
3      B       5
4      B       3
What would be the most efficient way to do this? A scalable solution would be ideal since df may be very large with up to hundreds of columns (i.e. col3, ..., col500) that need to be imputed based on col1. Any suggestions would be appreciated!
答案1
得分: 1
你可以使用连接(join)然后使用coalesce来保留两列中非空值的方法来完成:
pdf_lookup = pdf_lookup.select(col("col1"), col("col2").alias("col2_tmp"))
df.join(pdf_lookup, ["col1"], "left").withColumn("col2", coalesce(col("col2"), col("col2_tmp"))).drop("col2_tmp").show()
英文:
You can do it with a join then a coalesce to keep only the non-null values of 2 columns:
pdf_lookup = pdf_lookup.select(col("col1"), col("col2").alias("col2_tmp"))
df.join(pdf_lookup, ["col1"], "left").withColumn("col2", coalesce(col("col2"), col("col2_tmp"))).drop("col2_tmp").show()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论