在PySpark数据框中合并行以填充空列。

huangapple go评论111阅读模式
英文:

Combine rows in pyspark dataframe to fill in empty columns

问题

  1. 我有以下的pyspark数据框
  2. | | 时间 | 1 | 2 | 3 |
  3. |-----|------|------|------|-----|
  4. | 1 | 1 | | 1.5 | |
  5. | 1 | 1 | 3.5 | | |
  6. | 1 | 1 | | | 3.4 |
  7. | 1 | 2 | 2.5 | | |
  8. | 1 | 2 | | 6.0 | |
  9. | 1 | 2 | | | 7.3 |
  10. 我想要填补空缺并且使用车/时间列作为一种键来合并这些行。具体来说,如果两行(或更多)的车/时间列相同,那么将所有行合并为一行。可以保证对于重复的行,只有Val1/Val2/Val之一会被填充。永远不会出现两行在车/时间列中具有相同的值,但在另一列中具有不同/非空的值的情况。因此,结果数据框应该如下所示。
  11. | | 时间 | 1 | 2 | 3 |
  12. |-----|------|------|------|-----|
  13. | 1 | 1 | 3.5 | 1.5 | 3.4 |
  14. | 1 | 2 | 2.5 | 6.0 | 7.3 |
  15. 提前感谢您的帮助
英文:

I have the following pyspark dataframe

Car Time Val1 Val2 Val 3
1 1 None 1.5 None
1 1 3.5 None None
1 1 None None 3.4
1 2 2.5 None None
1 2 None 6.0 None
1 2 None None 7.3

I want to fill in the gaps and combine these rows using the car/time column as a key of sorts. Specifically, if the car/time column for two (or more) rows is identical, then combine all the rows into one. It is guaranteed that only one of Val1/Val2/Val will be filled out for duplicate rows. You will never have a case where two rows have the same values in the car/time column, but different/not None values in another column. The resulting dataframe therefore should look like this.

Car Time Val1 Val2 Val3
1 1 3.5 1.5 3.4
1 2 2.5 6.0 7.3

Thanks in advance for your help

答案1

得分: 2

你可以使用带有ignorenulls标志设置为true的聚合函数first进行分组。

  1. import pyspark.sql.functions as F
  2. from pyspark.sql import Window
  3. data = [
  4. {"Car": 1, "Time": 1, "Val1": None, "Val2": 1.5, "Val3": None},
  5. {"Car": 1, "Time": 1, "Val1": 3.5, "Val2": None, "Val3": None},
  6. {"Car": 1, "Time": 1, "Val1": None, "Val2": None, "Val3": 3.4},
  7. {"Car": 1, "Time": 2, "Val1": 2.5, "Val2": None, "Val3": None},
  8. {"Car": 1, "Time": 2, "Val1": None, "Val2": 6.0, "Val3": None},
  9. {"Car": 1, "Time": 2, "Val1": None, "Val2": None, "Val3": 7.3},
  10. {"Car": 2, "Time": 3, "Val1": None, "Val2": None, "Val3": 9.2},
  11. ]
  12. df = spark.createDataFrame(data)
  13. df.groupBy("Car", "Time").agg(
  14. F.first("Val1", ignorenulls=True).alias("Val1"),
  15. F.first("Val2", ignorenulls=True).alias("Val2"),
  16. F.first("Val3", ignorenulls=True).alias("Val3"),
  17. ).show()

输出如下:

  1. +---+----+----+----+----+
  2. |Car|Time|Val1|Val2|Val3|
  3. +---+----+----+----+----+
  4. | 1| 1| 3.5| 1.5| 3.4|
  5. | 1| 2| 2.5| 6.0| 7.3|
  6. | 2| 3|null|null| 9.2|
  7. +---+----+----+----+----+

额外添加的一行只是为了检查在只有一个条目时的行为,我认为它没有问题。

英文:

You can use group by with aggregate function First with flag ingnorenulls set to true

  1. import pyspark.sql.functions as F
  2. from pyspark.sql import Window
  3. data = [
  4. {"Car": 1, "Time": 1, "Val1": None, "Val2": 1.5, "Val3": None},
  5. {"Car": 1, "Time": 1, "Val1": 3.5, "Val2": None, "Val3": None},
  6. {"Car": 1, "Time": 1, "Val1": None, "Val2": None, "Val3": 3.4},
  7. {"Car": 1, "Time": 2, "Val1": 2.5, "Val2": None, "Val3": None},
  8. {"Car": 1, "Time": 2, "Val1": None, "Val2": 6.0, "Val3": None},
  9. {"Car": 1, "Time": 2, "Val1": None, "Val2": None, "Val3": 7.3},
  10. {"Car": 2, "Time": 3, "Val1": None, "Val2": None, "Val3": 9.2},
  11. ]
  12. df = spark.createDataFrame(data)
  13. df.groupBy("Car", "Time").agg(
  14. F.first("Val1", ignorenulls=True).alias("Val1"),
  15. F.first("Val2", ignorenulls=True).alias("Val1"),
  16. F.first("Val3", ignorenulls=True).alias("Val1"),
  17. ).show()

I added ona extra line just to check how it behaves with only one entry, imo its fine

output is

  1. +---+----+----+----+----+
  2. |Car|Time|Val1|Val1|Val1|
  3. +---+----+----+----+----+
  4. | 1| 1| 3.5| 1.5| 3.4|
  5. | 1| 2| 2.5| 6.0| 7.3|
  6. | 2| 3|null|null| 9.2|
  7. +---+----+----+----+----+

huangapple
  • 本文由 发表于 2023年2月14日 00:46:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/75438839.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定