压缩Spark DataFrame,选择最新的数值并移除空值。

huangapple go评论53阅读模式
英文:

Condense spark dataframe by selecting latest value and removing the nulls

问题

我想将给定的稀疏数据框压缩为单个记录,通过删除空值并根据版本选择最新的值来实现。以下是带有示例数据的说明。

假设有一个输入数据框,格式如下:

key version A B C
Key1 1 A1 Null Null
Key1 1 Null B1 Null
Key1 1 Null Null C1
key1 2 A2 Null Null
key1 2 Null Null C2

应该转换为以下格式:

key A B C
Key1 A2 B1 C2

请注意,输出数据框不包含版本列。对于列A,有两个值A1和A2,我们应该选择具有最新版本(2)的值。

英文:

I would like to condense the given sparse data frame for the given key into a single record by removing null values and selecting the latest value based on version. below is an illustration with sample data

Let's say there is an input data frame in the below format

key version A B C
Key1 1 A1 Null Null
Key1 1 Null B1 Null
Key1 1 Null Null C1
key1 2 A2 Null Null
key1 2 Null Null C2

should get converted to the below format

key A B C
Key1 A2 B1 C2

Not that the output data frame doesn't have version column. For column A there are two values A1, A2, we should pick values that have the latest version (2).

Thank you

答案1

得分: 1

你可以在有序窗口上使用 first 函数:

from pyspark.sql import Window
import pyspark.sql.functions as F

w = Window.partitionBy('key').orderBy(F.desc('version')) \
    .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

df2 = df.select('key',
    *[F.first(F.col(c), ignorenulls=True).over(w).alias(c) for c in ['A', 'B', 'C']]
).distinct()
英文:

You can use first function over an ordered window:

from pyspark.sql import Window
import pyspark.sql.functions as F

w = Window.partitionBy('key').orderBy(F.desc('version')) \
    .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

df2 = df.select('key',
    *[F.first(F.col(c), ignorenulls=True).over(w).alias(c) for c in ['A', 'B', 'C']]
).distinct()

huangapple
  • 本文由 发表于 2023年7月20日 20:38:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76729981.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定