2023年2月14日 00:46:45go评论111阅读模式

英文:

Combine rows in pyspark dataframe to fill in empty columns

问题

我有以下的pyspark数据框
| 车 | 时间 | 值1 | 值2 | 值3 |
|-----|------|------|------|-----|
| 1   | 1    | 空   | 1.5  | 空   |
| 1   | 1    | 3.5  | 空   | 空   |
| 1   | 1    | 空   | 空   | 3.4 |
| 1   | 2    | 2.5  | 空   | 空   |
| 1   | 2    | 空   | 6.0  | 空   |
| 1   | 2    | 空   | 空   | 7.3 |
我想要填补空缺并且使用车/时间列作为一种键来合并这些行。具体来说，如果两行（或更多）的车/时间列相同，那么将所有行合并为一行。可以保证对于重复的行，只有Val1/Val2/Val之一会被填充。永远不会出现两行在车/时间列中具有相同的值，但在另一列中具有不同/非空的值的情况。因此，结果数据框应该如下所示。
| 车 | 时间 | 值1 | 值2 | 值3 |
|-----|------|------|------|-----|
| 1   | 1    | 3.5  | 1.5  | 3.4 |
| 1   | 2    | 2.5  | 6.0  | 7.3 |
提前感谢您的帮助

英文:

I have the following pyspark dataframe

Car	Time	Val1	Val2	Val 3
1	1	None	1.5	None
1	1	3.5	None	None
1	1	None	None	3.4
1	2	2.5	None	None
1	2	None	6.0	None
1	2	None	None	7.3

I want to fill in the gaps and combine these rows using the car/time column as a key of sorts. Specifically, if the car/time column for two (or more) rows is identical, then combine all the rows into one. It is guaranteed that only one of Val1/Val2/Val will be filled out for duplicate rows. You will never have a case where two rows have the same values in the car/time column, but different/not None values in another column. The resulting dataframe therefore should look like this.

Car	Time	Val1	Val2	Val3
1	1	3.5	1.5	3.4
1	2	2.5	6.0	7.3

Thanks in advance for your help

答案1

得分: 2

你可以使用带有ignorenulls标志设置为true的聚合函数first进行分组。

import pyspark.sql.functions as F
from pyspark.sql import Window
data = [
    {"Car": 1, "Time": 1, "Val1": None, "Val2": 1.5, "Val3": None},
    {"Car": 1, "Time": 1, "Val1": 3.5, "Val2": None, "Val3": None},
    {"Car": 1, "Time": 1, "Val1": None, "Val2": None, "Val3": 3.4},
    {"Car": 1, "Time": 2, "Val1": 2.5, "Val2": None, "Val3": None},
    {"Car": 1, "Time": 2, "Val1": None, "Val2": 6.0, "Val3": None},
    {"Car": 1, "Time": 2, "Val1": None, "Val2": None, "Val3": 7.3},
    {"Car": 2, "Time": 3, "Val1": None, "Val2": None, "Val3": 9.2},
]
df = spark.createDataFrame(data)
df.groupBy("Car", "Time").agg(
    F.first("Val1", ignorenulls=True).alias("Val1"),
    F.first("Val2", ignorenulls=True).alias("Val2"),
    F.first("Val3", ignorenulls=True).alias("Val3"),
).show()

输出如下：

+---+----+----+----+----+
|Car|Time|Val1|Val2|Val3|
+---+----+----+----+----+
|  1|   1| 3.5| 1.5| 3.4|
|  1|   2| 2.5| 6.0| 7.3|
|  2|   3|null|null| 9.2|
+---+----+----+----+----+

额外添加的一行只是为了检查在只有一个条目时的行为，我认为它没有问题。

英文:

You can use group by with aggregate function First with flag ingnorenulls set to true

import pyspark.sql.functions as F
from pyspark.sql import Window
data = [
    {&quot;Car&quot;: 1, &quot;Time&quot;: 1, &quot;Val1&quot;: None, &quot;Val2&quot;: 1.5, &quot;Val3&quot;: None},
    {&quot;Car&quot;: 1, &quot;Time&quot;: 1, &quot;Val1&quot;: 3.5, &quot;Val2&quot;: None, &quot;Val3&quot;: None},
    {&quot;Car&quot;: 1, &quot;Time&quot;: 1, &quot;Val1&quot;: None, &quot;Val2&quot;: None, &quot;Val3&quot;: 3.4},
    {&quot;Car&quot;: 1, &quot;Time&quot;: 2, &quot;Val1&quot;: 2.5, &quot;Val2&quot;: None, &quot;Val3&quot;: None},
    {&quot;Car&quot;: 1, &quot;Time&quot;: 2, &quot;Val1&quot;: None, &quot;Val2&quot;: 6.0, &quot;Val3&quot;: None},
    {&quot;Car&quot;: 1, &quot;Time&quot;: 2, &quot;Val1&quot;: None, &quot;Val2&quot;: None, &quot;Val3&quot;: 7.3},
    {&quot;Car&quot;: 2, &quot;Time&quot;: 3, &quot;Val1&quot;: None, &quot;Val2&quot;: None, &quot;Val3&quot;: 9.2},
]
df = spark.createDataFrame(data)
df.groupBy(&quot;Car&quot;, &quot;Time&quot;).agg(
    F.first(&quot;Val1&quot;, ignorenulls=True).alias(&quot;Val1&quot;),
    F.first(&quot;Val2&quot;, ignorenulls=True).alias(&quot;Val1&quot;),
    F.first(&quot;Val3&quot;, ignorenulls=True).alias(&quot;Val1&quot;),
).show()

I added ona extra line just to check how it behaves with only one entry, imo its fine

output is

+---+----+----+----+----+
|Car|Time|Val1|Val1|Val1|
+---+----+----+----+----+
|  1|   1| 3.5| 1.5| 3.4|
|  1|   2| 2.5| 6.0| 7.3|
|  2|   3|null|null| 9.2|
+---+----+----+----+----+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在PySpark数据框中合并行以填充空列。

问题

答案1

Python, 将字符串附加为值

Import "numpy" could not be resolved; ipynb in vscode

Numpy：将最终维度的切片与另一个数组相乘

如何使用Python在Azure Synapse中为Salesforce创建关联服务

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。