问题

我有一个包含在col2中的缺失值的PySpark DataFrame，我想根据col1中的值进行填充。例如：

df
id    col1    col2
0      A       1
1      A      NA
2      B       2
3      B      NA
4      B       3

我想使用给定的Pandas查找表来填充这些缺失值：

pdf_lookup
id    col1    col2
0      A       4
1      B       5

因此，期望的结果将是以下PySpark DataFrame：

id    col1    col2
0      A       1
1      A       4
2      B       2
3      B       5
4      B       3

最有效的方法是什么？最好是一个可扩展的解决方案，因为df可能非常大，包含需要根据col1进行填充的多达数百列（即col3，...，col500）。感谢任何建议！

英文:

I have a PySpark DataFrame with missing values in col2 that I would like to impute, based on values in col1. For example:

df
id    col1    col2
0      A       1
1      A      NA
2      B       2
3      B      NA
4      B       3

I would like to impute these missing values using a given Pandas lookup table:

pdf_lookup
id    col1    col2
0      A       4
1      B       5

So the desired result would be the following PySpark DataFrame:

id    col1    col2
0      A       1
1      A       4
2      B       2
3      B       5
4      B       3

What would be the most efficient way to do this? A scalable solution would be ideal since df may be very large with up to hundreds of columns (i.e. col3, ..., col500) that need to be imputed based on col1. Any suggestions would be appreciated!

答案1

得分: 1

你可以使用连接（join）然后使用coalesce来保留两列中非空值的方法来完成：

pdf_lookup = pdf_lookup.select(col("col1"), col("col2").alias("col2_tmp"))
df.join(pdf_lookup, ["col1"], "left").withColumn("col2", coalesce(col("col2"), col("col2_tmp"))).drop("col2_tmp").show()

英文:

You can do it with a join then a coalesce to keep only the non-null values of 2 columns:

pdf_lookup = pdf_lookup.select(col(&quot;col1&quot;), col(&quot;col2&quot;).alias(&quot;col2_tmp&quot;))
df.join(pdf_lookup, [&quot;col1&quot;], &quot;left&quot;).withColumn(&quot;col2&quot;, coalesce(col(&quot;col2&quot;), col(&quot;col2_tmp&quot;))).drop(&quot;col2_tmp&quot;).show()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

用Pandas查找表填充PySpark DataFrame中的NA值。

问题

答案1

Different number of partitions after spark.read & filter depending on Databricks runtime

Python converting a column in df of strings in format "%M:%S.%f" into float of number of seconds

Memory issues running spark locally in Intellij (scala)

Databricks：SQL透视不起作用 – 但Python可以

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论