英文:
Merge rows in spark scala Dataframe and apply aggregate function
问题
我有一个以下的数据框:
| notification_id| el1| el2|is_deleted|
+---------------+----------+----------+----------+
|notificationId1|element1_1|element1_2| false|
|notificationId2|element2_1|element2_2| false|
|notificationId3|element3_1|element3_2| false|
|notificationId1| null| null| true|
|notificationId4| null| null| true|
+---------------+----------+----------+----------+
在这个示例中,主键是notification_id
。
- 具有
is_deleted = true
的行始终除了主键以外的其他列都具有空值。 - 具有
is_deleted = false
的行具有唯一的主键。
我想要合并具有相同主键的行,以获得具有合并is_deleted
列的数据框:
| notification_id| el1| el2|is_deleted|
+---------------+----------+----------+----------+
|notificationId1|element1_1|element1_2| true|
|notificationId2|element2_1|element2_2| false|
|notificationId3|element3_1|element3_2| false|
|notificationId4| null| null| true|
+---------------+----------+----------+----------+
英文:
I have a following Dataframe:
| notification_id| el1| el2|is_deleted|
+---------------+----------+----------+----------+
|notificationId1|element1_1|element1_2| false|
|notificationId2|element2_1|element2_2| false|
|notificationId3|element3_1|element3_2| false|
|notificationId1| null| null| true|
|notificationId4| null| null| true|
+---------------+----------+----------+----------+
The primary key in this example is notification_id
.
- The rows that have
is_deleted = true
, always have null values for other column except primary key. - The rows with
is_deleted = false
have a unique primary key.
I would like to merge the rows with the same primary key in order to obtain dataframe with merged is_delete
column:
| notification_id| el1| el2|is_deleted|
+---------------+----------+----------+----------+
|notificationId1|element1_1|element1_2| true|
|notificationId2|element2_1|element2_2| false|
|notificationId3|element3_1|element3_2| false|
|notificationId4| null| null| true|
+---------------+----------+----------+----------+
答案1
得分: 0
你可以按主键分组,然后在is_deleted
列上使用any()
聚合器,如果具有相同主键的行中有一个is_deleted
列的值为true
,则会返回true
:
val df_result = df_in.groupBy("notification_id").agg(
first("el1", ignoreNulls = true).alias("el1"),
first("el2", ignoreNulls = true).alias("el2"),
expr("any(is_deleted)").alias("is_deleted")
)
英文:
You can group by the primary key and use an any()
aggregator on the is_deleted
column, which will yield true
if any of the rows with the same primary key have a true
value for is_deleted
:
val df_result = df_in.groupBy("notification_id").agg(
first("el1", ignoreNulls = true).alias("el1"),
first("el2", ignoreNulls = true).alias("el2"),
expr("any(is_deleted)").alias("is_deleted")
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论