2023年2月9日 00:59:55go评论92阅读模式

英文:

Merge rows in spark scala Dataframe and apply aggregate function

问题

我有一个以下的数据框：

| notification_id|       el1|       el2|is_deleted| 
+---------------+----------+----------+----------+ 
|notificationId1|element1_1|element1_2|     false| 
|notificationId2|element2_1|element2_2|     false| 
|notificationId3|element3_1|element3_2|     false| 
|notificationId1|      null|      null|      true| 
|notificationId4|      null|      null|      true| 
+---------------+----------+----------+----------+

在这个示例中，主键是notification_id。

具有is_deleted = true的行始终除了主键以外的其他列都具有空值。
具有is_deleted = false的行具有唯一的主键。

我想要合并具有相同主键的行，以获得具有合并is_deleted列的数据框：

| notification_id|       el1|       el2|is_deleted| 
+---------------+----------+----------+----------+ 
|notificationId1|element1_1|element1_2|      true| 
|notificationId2|element2_1|element2_2|     false| 
|notificationId3|element3_1|element3_2|     false| 
|notificationId4|      null|      null|      true| 
+---------------+----------+----------+----------+

英文:

I have a following Dataframe:

| notification_id|       el1|       el2|is_deleted| 
+---------------+----------+----------+----------+ 
|notificationId1|element1_1|element1_2|     false| 
|notificationId2|element2_1|element2_2|     false| 
|notificationId3|element3_1|element3_2|     false| 
|notificationId1|      null|      null|      true| 
|notificationId4|      null|      null|      true| 
+---------------+----------+----------+----------+

The primary key in this example is notification_id.

The rows that have is_deleted = true, always have null values for other column except primary key.
The rows with is_deleted = false have a unique primary key.

I would like to merge the rows with the same primary key in order to obtain dataframe with merged is_delete column:

| notification_id|       el1|       el2|is_deleted| 
+---------------+----------+----------+----------+ 
|notificationId1|element1_1|element1_2|      true| 
|notificationId2|element2_1|element2_2|     false| 
|notificationId3|element3_1|element3_2|     false| 
|notificationId4|      null|      null|      true| 
+---------------+----------+----------+----------+

答案1

得分: 0

你可以按主键分组，然后在is_deleted列上使用any()聚合器，如果具有相同主键的行中有一个is_deleted列的值为true，则会返回true：

val df_result = df_in.groupBy("notification_id").agg(
  first("el1", ignoreNulls = true).alias("el1"),
  first("el2", ignoreNulls = true).alias("el2"),
  expr("any(is_deleted)").alias("is_deleted")
)

英文:

You can group by the primary key and use an any() aggregator on the is_deleted column, which will yield true if any of the rows with the same primary key have a true value for is_deleted:

val df_result = df_in.groupBy(&quot;notification_id&quot;).agg(
  first(&quot;el1&quot;, ignoreNulls = true).alias(&quot;el1&quot;),
  first(&quot;el2&quot;, ignoreNulls = true).alias(&quot;el2&quot;),
  expr(&quot;any(is_deleted)&quot;).alias(&quot;is_deleted&quot;)
)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

合并 Spark Scala 数据框中的行并应用聚合函数。

问题

答案1

数据处理在Spark中使用Java

I have more data in a kafka topic but when i extract data using my pyspark application, I am getting only 1 row extracted, how to fix?

使用正则表达式从数据框中提取特定的ID列

AttributeError: ‘Series’ 对象没有 ‘iterrows’ 属性 – Python

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。