用Pandas查找表填充PySpark DataFrame中的NA值。

huangapple go评论76阅读模式
英文:

Fill NA in PySpark DataFrame by group with values from Pandas lookup table

问题

我有一个包含在col2中的缺失值的PySpark DataFrame,我想根据col1中的值进行填充。例如:

df
id    col1    col2
0      A       1
1      A      NA
2      B       2
3      B      NA
4      B       3

我想使用给定的Pandas查找表来填充这些缺失值:

pdf_lookup
id    col1    col2
0      A       4
1      B       5

因此,期望的结果将是以下PySpark DataFrame:

id    col1    col2
0      A       1
1      A       4
2      B       2
3      B       5
4      B       3

最有效的方法是什么?最好是一个可扩展的解决方案,因为df可能非常大,包含需要根据col1进行填充的多达数百列(即col3,...,col500)。感谢任何建议!

英文:

I have a PySpark DataFrame with missing values in col2 that I would like to impute, based on values in col1. For example:

df
id    col1    col2
0      A       1
1      A      NA
2      B       2
3      B      NA
4      B       3

I would like to impute these missing values using a given Pandas lookup table:

pdf_lookup
id    col1    col2
0      A       4
1      B       5

So the desired result would be the following PySpark DataFrame:

id    col1    col2
0      A       1
1      A       4
2      B       2
3      B       5
4      B       3

What would be the most efficient way to do this? A scalable solution would be ideal since df may be very large with up to hundreds of columns (i.e. col3, ..., col500) that need to be imputed based on col1. Any suggestions would be appreciated!

答案1

得分: 1

你可以使用连接(join)然后使用coalesce来保留两列中非空值的方法来完成:

pdf_lookup = pdf_lookup.select(col("col1"), col("col2").alias("col2_tmp"))
df.join(pdf_lookup, ["col1"], "left").withColumn("col2", coalesce(col("col2"), col("col2_tmp"))).drop("col2_tmp").show()
英文:

You can do it with a join then a coalesce to keep only the non-null values of 2 columns:

pdf_lookup = pdf_lookup.select(col("col1"), col("col2").alias("col2_tmp"))
df.join(pdf_lookup, ["col1"], "left").withColumn("col2", coalesce(col("col2"), col("col2_tmp"))).drop("col2_tmp").show()

huangapple
  • 本文由 发表于 2023年6月8日 23:48:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76433635.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定