问题

我想将给定的稀疏数据框压缩为单个记录，通过删除空值并根据版本选择最新的值来实现。以下是带有示例数据的说明。

假设有一个输入数据框，格式如下：

key	version	A	B	C
Key1	1	A1	Null	Null
Key1	1	Null	B1	Null
Key1	1	Null	Null	C1
key1	2	A2	Null	Null
key1	2	Null	Null	C2

应该转换为以下格式：

key	A	B	C
Key1	A2	B1	C2

请注意，输出数据框不包含版本列。对于列A，有两个值A1和A2，我们应该选择具有最新版本（2）的值。

英文:

I would like to condense the given sparse data frame for the given key into a single record by removing null values and selecting the latest value based on version. below is an illustration with sample data

Let's say there is an input data frame in the below format

key	version	A	B	C
Key1	1	A1	Null	Null
Key1	1	Null	B1	Null
Key1	1	Null	Null	C1
key1	2	A2	Null	Null
key1	2	Null	Null	C2

should get converted to the below format

key	A	B	C
Key1	A2	B1	C2

Not that the output data frame doesn't have version column. For column A there are two values A1, A2, we should pick values that have the latest version (2).

Thank you

答案1

得分: 1

你可以在有序窗口上使用 first 函数：

from pyspark.sql import Window
import pyspark.sql.functions as F

w = Window.partitionBy('key').orderBy(F.desc('version')) \
    .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

df2 = df.select('key',
    *[F.first(F.col(c), ignorenulls=True).over(w).alias(c) for c in ['A', 'B', 'C']]
).distinct()

英文:

You can use first function over an ordered window:

from pyspark.sql import Window
import pyspark.sql.functions as F

w = Window.partitionBy(&#39;key&#39;).orderBy(F.desc(&#39;version&#39;)) \
    .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

df2 = df.select(&#39;key&#39;,
    *[F.first(F.col(c), ignorenulls=True).over(w).alias(c) for c in [&#39;A&#39;, &#39;B&#39;, &#39;C&#39;]]
).distinct()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

压缩Spark DataFrame，选择最新的数值并移除空值。

问题

答案1

Pyspark的regexp_extract无法识别’=’作为一个字符？

How can I unpivot two sets of columns in Spark?

pyspark – 在select语句内的if语句

Pyspark – 创建我们的Python包

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论