2023年4月11日 14:39:01go评论148阅读模式

英文:

Is is possible to performa group by taking in all the fields in aggregate?

问题

I am on apache spark 3.3.2. Here is a sample code

val df: Dataset[Row] = ???

df
 .groupBy($&quot;someKey&quot;)
 .agg(collect_set(???)) //I want to collect all the columns here including the key.

As mentioned in the comment I want to collect all the columns and not have to specify all the columns again. Is there a way to do this?

英文:

I am on apache spark 3.3.2. Here is a sample code

val df: Dataset[Row] = ???

df
 .groupBy($&quot;someKey&quot;)
 .agg(collect_set(???)) //I want to collect all the columns here including the key.

As mentioned in the comment I want to collect all the columns and not have to specify all the columns again. Is there a way to do this?

答案1

得分: 1

你可以使用 df.columns 访问数据框中的列列表。然后，你可以处理它以生成你想要的汇总列表：

# 假设你想按“someKey”分组，并收集所有其他列的值。
from pyspark.sql import functions as F
result = df\
    .groupBy("someKey")\
    .agg(*[F.collect_set(c).alias(c) for c in df.columns if c != "someKey"])

注意：如果你希望收集“someKey”列，可以删除if c != "someKey"。

在Scala中，agg函数的签名如下：

def agg(expr: Column, exprs: Column*): DataFrame

因此，我们无法直接展开一个列表，但有一个简单的解决方法：

val aggs = df.columns
    .map(c => collect_set(c) as c)
    .filter(_ != "someKey") // 可选
val result = df.groupBy("someKey").agg(aggs.head, aggs.tail: _*)

英文:

You can use df.columns to access the list of columns of your dataframe. Then you can process it to generate the list of aggregations you want:

# let&#39;s say that you want to group by &quot;someKey&quot;, and collect the values
# of all the other columns.
from pyspark.sql import functions as F
result = df\
    .groupBy(&quot;someKey&quot;)\
    .agg(*[F.collect_set(c).alias(c) for c in df.columns if c != &quot;someKey&quot;])

NB: You may remove if c != "someKey" in case you want to collect the someKey column as well.

In scala, the agg function signature is as follows:
> def agg(expr: Column, exprs: Column*): DataFrame

Therefore, we cannot unpack a list directly but there is a simple workaround:

val aggs = df.columns
    .map(c =&gt; collect_set(c) as c)
    .filter( _ != &quot;someKey&quot;) // optionnaly
val result = df.groupBy(&quot;someKey&quot;).agg(aggs.head, aggs.tail : _* )

</details>



# 答案2
**得分**: 1

如果你的意图是按相同键聚合所有匹配的元素作为json对象列表，你可以执行类似以下的操作：

```scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame

val df = spark.sqlContext.createDataFrame(Seq(
  ("steak", "1990-01-01", "2022-03-30", 150),
  ("steak", "2000-01-02", "2021-01-13", 180),
  ("fish",  "1990-01-01", "2001-02-01", 100)
)).toDF("key", "startDate", "endDate", "price")

df.show()

df
 .groupBy("key")
 .agg(collect_set(struct($"*")).as("value"))
 .show(false)

输出：

+-----+----------+----------+-----+
|  key| startDate|   endDate|price|
+-----+----------+----------+-----+
|steak|1990-01-01|2022-03-30|  150|
|steak|2000-01-02|2021-01-13|  180|
| fish|1990-01-01|2001-02-01|  100|
+-----+----------+----------+-----+

+-----+----------------------------------------------------------------------------+
|key  |value                                                                       |
+-----+----------------------------------------------------------------------------+
|steak|[{steak, 1990-01-01, 2022-03-30, 150}, {steak, 2000-01-02, 2021-01-13, 180}]|
|fish |[{fish, 1990-01-01, 2001-02-01, 100}]                                       |
+-----+----------------------------------------------------------------------------+

英文:

If your intention is to aggregate all elements that match the same key as a list of json objects you can perform something like:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame

val df = spark.sqlContext.createDataFrame(Seq(
      (&quot;steak&quot;, &quot;1990-01-01&quot;, &quot;2022-03-30&quot;, 150),
      (&quot;steak&quot;, &quot;2000-01-02&quot;, &quot;2021-01-13&quot;, 180),
      (&quot;fish&quot;,  &quot;1990-01-01&quot;, &quot;2001-02-01&quot;, 100)
    )).toDF(&quot;key&quot;, &quot;startDate&quot;, &quot;endDate&quot;, &quot;price&quot;)

df.show()

df
 .groupBy(&quot;key&quot;)
 .agg(collect_set(struct($&quot;*&quot;)).as(&quot;value&quot;))
 .show(false)

output:

+-----+----------+----------+-----+
|  key| startDate|   endDate|price|
+-----+----------+----------+-----+
|steak|1990-01-01|2022-03-30|  150|
|steak|2000-01-02|2021-01-13|  180|
| fish|1990-01-01|2001-02-01|  100|
+-----+----------+----------+-----+

+-----+----------------------------------------------------------------------------+
|key  |value                                                                       |
+-----+----------------------------------------------------------------------------+
|steak|[{steak, 1990-01-01, 2022-03-30, 150}, {steak, 2000-01-02, 2021-01-13, 180}]|
|fish |[{fish, 1990-01-01, 2001-02-01, 100}]                                       |
+-----+----------------------------------------------------------------------------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

是的，可以执行一个聚合操作，将所有字段都包括在分组中。

问题

答案1

如何将pyspark（在本地模式下）连接到bigquery？

pyspark 解析嵌套的 JSON，忽略所有键。

如何将年周数转换为LocalDate

在PySpark中添加字符到字符计数。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论