在PySpark数据框中合并行以填充空列。

huangapple go评论69阅读模式
英文:

Combine rows in pyspark dataframe to fill in empty columns

问题

我有以下的pyspark数据框

| 车 | 时间 | 值1 | 值2 | 值3 |
|-----|------|------|------|-----|
| 1   | 1    | 空   | 1.5  | 空   |
| 1   | 1    | 3.5  | 空   | 空   |
| 1   | 1    | 空   | 空   | 3.4 |
| 1   | 2    | 2.5  | 空   | 空   |
| 1   | 2    | 空   | 6.0  | 空   |
| 1   | 2    | 空   | 空   | 7.3 |

我想要填补空缺并且使用车/时间列作为一种键来合并这些行。具体来说,如果两行(或更多)的车/时间列相同,那么将所有行合并为一行。可以保证对于重复的行,只有Val1/Val2/Val之一会被填充。永远不会出现两行在车/时间列中具有相同的值,但在另一列中具有不同/非空的值的情况。因此,结果数据框应该如下所示。

| 车 | 时间 | 值1 | 值2 | 值3 |
|-----|------|------|------|-----|
| 1   | 1    | 3.5  | 1.5  | 3.4 |
| 1   | 2    | 2.5  | 6.0  | 7.3 |

提前感谢您的帮助
英文:

I have the following pyspark dataframe

Car Time Val1 Val2 Val 3
1 1 None 1.5 None
1 1 3.5 None None
1 1 None None 3.4
1 2 2.5 None None
1 2 None 6.0 None
1 2 None None 7.3

I want to fill in the gaps and combine these rows using the car/time column as a key of sorts. Specifically, if the car/time column for two (or more) rows is identical, then combine all the rows into one. It is guaranteed that only one of Val1/Val2/Val will be filled out for duplicate rows. You will never have a case where two rows have the same values in the car/time column, but different/not None values in another column. The resulting dataframe therefore should look like this.

Car Time Val1 Val2 Val3
1 1 3.5 1.5 3.4
1 2 2.5 6.0 7.3

Thanks in advance for your help

答案1

得分: 2

你可以使用带有ignorenulls标志设置为true的聚合函数first进行分组。

import pyspark.sql.functions as F
from pyspark.sql import Window

data = [
    {"Car": 1, "Time": 1, "Val1": None, "Val2": 1.5, "Val3": None},
    {"Car": 1, "Time": 1, "Val1": 3.5, "Val2": None, "Val3": None},
    {"Car": 1, "Time": 1, "Val1": None, "Val2": None, "Val3": 3.4},
    {"Car": 1, "Time": 2, "Val1": 2.5, "Val2": None, "Val3": None},
    {"Car": 1, "Time": 2, "Val1": None, "Val2": 6.0, "Val3": None},
    {"Car": 1, "Time": 2, "Val1": None, "Val2": None, "Val3": 7.3},
    {"Car": 2, "Time": 3, "Val1": None, "Val2": None, "Val3": 9.2},
]

df = spark.createDataFrame(data)

df.groupBy("Car", "Time").agg(
    F.first("Val1", ignorenulls=True).alias("Val1"),
    F.first("Val2", ignorenulls=True).alias("Val2"),
    F.first("Val3", ignorenulls=True).alias("Val3"),
).show()

输出如下:

+---+----+----+----+----+
|Car|Time|Val1|Val2|Val3|
+---+----+----+----+----+
|  1|   1| 3.5| 1.5| 3.4|
|  1|   2| 2.5| 6.0| 7.3|
|  2|   3|null|null| 9.2|
+---+----+----+----+----+

额外添加的一行只是为了检查在只有一个条目时的行为,我认为它没有问题。

英文:

You can use group by with aggregate function First with flag ingnorenulls set to true

import pyspark.sql.functions as F
from pyspark.sql import Window

data = [
    {"Car": 1, "Time": 1, "Val1": None, "Val2": 1.5, "Val3": None},
    {"Car": 1, "Time": 1, "Val1": 3.5, "Val2": None, "Val3": None},
    {"Car": 1, "Time": 1, "Val1": None, "Val2": None, "Val3": 3.4},
    {"Car": 1, "Time": 2, "Val1": 2.5, "Val2": None, "Val3": None},
    {"Car": 1, "Time": 2, "Val1": None, "Val2": 6.0, "Val3": None},
    {"Car": 1, "Time": 2, "Val1": None, "Val2": None, "Val3": 7.3},
    {"Car": 2, "Time": 3, "Val1": None, "Val2": None, "Val3": 9.2},
]

df = spark.createDataFrame(data)

df.groupBy("Car", "Time").agg(
    F.first("Val1", ignorenulls=True).alias("Val1"),
    F.first("Val2", ignorenulls=True).alias("Val1"),
    F.first("Val3", ignorenulls=True).alias("Val1"),
).show()

I added ona extra line just to check how it behaves with only one entry, imo its fine

output is

+---+----+----+----+----+
|Car|Time|Val1|Val1|Val1|
+---+----+----+----+----+
|  1|   1| 3.5| 1.5| 3.4|
|  1|   2| 2.5| 6.0| 7.3|
|  2|   3|null|null| 9.2|
+---+----+----+----+----+

huangapple
  • 本文由 发表于 2023年2月14日 00:46:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/75438839.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定