合并 Spark Scala 数据框中的行并应用聚合函数。

huangapple go评论86阅读模式
英文:

Merge rows in spark scala Dataframe and apply aggregate function

问题

我有一个以下的数据框:

  1. | notification_id| el1| el2|is_deleted|
  2. +---------------+----------+----------+----------+
  3. |notificationId1|element1_1|element1_2| false|
  4. |notificationId2|element2_1|element2_2| false|
  5. |notificationId3|element3_1|element3_2| false|
  6. |notificationId1| null| null| true|
  7. |notificationId4| null| null| true|
  8. +---------------+----------+----------+----------+

在这个示例中,主键是notification_id

  • 具有is_deleted = true的行始终除了主键以外的其他列都具有空值。
  • 具有is_deleted = false的行具有唯一的主键。

我想要合并具有相同主键的行,以获得具有合并is_deleted列的数据框:

  1. | notification_id| el1| el2|is_deleted|
  2. +---------------+----------+----------+----------+
  3. |notificationId1|element1_1|element1_2| true|
  4. |notificationId2|element2_1|element2_2| false|
  5. |notificationId3|element3_1|element3_2| false|
  6. |notificationId4| null| null| true|
  7. +---------------+----------+----------+----------+
英文:

I have a following Dataframe:

  1. | notification_id| el1| el2|is_deleted|
  2. +---------------+----------+----------+----------+
  3. |notificationId1|element1_1|element1_2| false|
  4. |notificationId2|element2_1|element2_2| false|
  5. |notificationId3|element3_1|element3_2| false|
  6. |notificationId1| null| null| true|
  7. |notificationId4| null| null| true|
  8. +---------------+----------+----------+----------+

The primary key in this example is notification_id.

  • The rows that have is_deleted = true, always have null values for other column except primary key.
  • The rows with is_deleted = false have a unique primary key.

I would like to merge the rows with the same primary key in order to obtain dataframe with merged is_delete column:

  1. | notification_id| el1| el2|is_deleted|
  2. +---------------+----------+----------+----------+
  3. |notificationId1|element1_1|element1_2| true|
  4. |notificationId2|element2_1|element2_2| false|
  5. |notificationId3|element3_1|element3_2| false|
  6. |notificationId4| null| null| true|
  7. +---------------+----------+----------+----------+

答案1

得分: 0

你可以按主键分组,然后在is_deleted列上使用any()聚合器,如果具有相同主键的行中有一个is_deleted列的值为true,则会返回true

  1. val df_result = df_in.groupBy("notification_id").agg(
  2. first("el1", ignoreNulls = true).alias("el1"),
  3. first("el2", ignoreNulls = true).alias("el2"),
  4. expr("any(is_deleted)").alias("is_deleted")
  5. )
英文:

You can group by the primary key and use an any() aggregator on the is_deleted column, which will yield true if any of the rows with the same primary key have a true value for is_deleted:

  1. val df_result = df_in.groupBy("notification_id").agg(
  2. first("el1", ignoreNulls = true).alias("el1"),
  3. first("el2", ignoreNulls = true).alias("el2"),
  4. expr("any(is_deleted)").alias("is_deleted")
  5. )

huangapple
  • 本文由 发表于 2023年2月9日 00:59:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75389189.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定