合并 Spark Scala 数据框中的行并应用聚合函数。

huangapple go评论50阅读模式
英文:

Merge rows in spark scala Dataframe and apply aggregate function

问题

我有一个以下的数据框:

| notification_id|       el1|       el2|is_deleted| 
+---------------+----------+----------+----------+ 
|notificationId1|element1_1|element1_2|     false| 
|notificationId2|element2_1|element2_2|     false| 
|notificationId3|element3_1|element3_2|     false| 
|notificationId1|      null|      null|      true| 
|notificationId4|      null|      null|      true| 
+---------------+----------+----------+----------+

在这个示例中,主键是notification_id

  • 具有is_deleted = true的行始终除了主键以外的其他列都具有空值。
  • 具有is_deleted = false的行具有唯一的主键。

我想要合并具有相同主键的行,以获得具有合并is_deleted列的数据框:

| notification_id|       el1|       el2|is_deleted| 
+---------------+----------+----------+----------+ 
|notificationId1|element1_1|element1_2|      true| 
|notificationId2|element2_1|element2_2|     false| 
|notificationId3|element3_1|element3_2|     false| 
|notificationId4|      null|      null|      true| 
+---------------+----------+----------+----------+
英文:

I have a following Dataframe:

| notification_id|       el1|       el2|is_deleted| 
+---------------+----------+----------+----------+ 
|notificationId1|element1_1|element1_2|     false| 
|notificationId2|element2_1|element2_2|     false| 
|notificationId3|element3_1|element3_2|     false| 
|notificationId1|      null|      null|      true| 
|notificationId4|      null|      null|      true| 
+---------------+----------+----------+----------+

The primary key in this example is notification_id.

  • The rows that have is_deleted = true, always have null values for other column except primary key.
  • The rows with is_deleted = false have a unique primary key.

I would like to merge the rows with the same primary key in order to obtain dataframe with merged is_delete column:

| notification_id|       el1|       el2|is_deleted| 
+---------------+----------+----------+----------+ 
|notificationId1|element1_1|element1_2|      true| 
|notificationId2|element2_1|element2_2|     false| 
|notificationId3|element3_1|element3_2|     false| 
|notificationId4|      null|      null|      true| 
+---------------+----------+----------+----------+

答案1

得分: 0

你可以按主键分组,然后在is_deleted列上使用any()聚合器,如果具有相同主键的行中有一个is_deleted列的值为true,则会返回true

val df_result = df_in.groupBy("notification_id").agg(
  first("el1", ignoreNulls = true).alias("el1"),
  first("el2", ignoreNulls = true).alias("el2"),
  expr("any(is_deleted)").alias("is_deleted")
)
英文:

You can group by the primary key and use an any() aggregator on the is_deleted column, which will yield true if any of the rows with the same primary key have a true value for is_deleted:

val df_result = df_in.groupBy("notification_id").agg(
  first("el1", ignoreNulls = true).alias("el1"),
  first("el2", ignoreNulls = true).alias("el2"),
  expr("any(is_deleted)").alias("is_deleted")
)

huangapple
  • 本文由 发表于 2023年2月9日 00:59:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75389189.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定