是的,可以执行一个聚合操作,将所有字段都包括在分组中。

huangapple go评论73阅读模式
英文:

Is is possible to performa group by taking in all the fields in aggregate?

问题

I am on apache spark 3.3.2. Here is a sample code

val df: Dataset[Row] = ???

df
 .groupBy($"someKey")
 .agg(collect_set(???)) //I want to collect all the columns here including the key.

As mentioned in the comment I want to collect all the columns and not have to specify all the columns again. Is there a way to do this?

英文:

I am on apache spark 3.3.2. Here is a sample code

val df: Dataset[Row] = ???

df
 .groupBy($"someKey")
 .agg(collect_set(???)) //I want to collect all the columns here including the key.

As mentioned in the comment I want to collect all the columns and not have to specify all the columns again. Is there a way to do this?

答案1

得分: 1

你可以使用 df.columns 访问数据框中的列列表。然后,你可以处理它以生成你想要的汇总列表:

# 假设你想按“someKey”分组,并收集所有其他列的值。
from pyspark.sql import functions as F
result = df\
    .groupBy("someKey")\
    .agg(*[F.collect_set(c).alias(c) for c in df.columns if c != "someKey"])

注意:如果你希望收集“someKey”列,可以删除if c != "someKey"

在Scala中,agg函数的签名如下:

def agg(expr: Column, exprs: Column*): DataFrame

因此,我们无法直接展开一个列表,但有一个简单的解决方法:

val aggs = df.columns
    .map(c => collect_set(c) as c)
    .filter(_ != "someKey") // 可选
val result = df.groupBy("someKey").agg(aggs.head, aggs.tail: _*)
英文:

You can use df.columns to access the list of columns of your dataframe. Then you can process it to generate the list of aggregations you want:

# let's say that you want to group by "someKey", and collect the values
# of all the other columns.
from pyspark.sql import functions as F
result = df\
    .groupBy("someKey")\
    .agg(*[F.collect_set(c).alias(c) for c in df.columns if c != "someKey"])

NB: You may remove if c != "someKey" in case you want to collect the someKey column as well.

In scala, the agg function signature is as follows:
> def agg(expr: Column, exprs: Column*): DataFrame

Therefore, we cannot unpack a list directly but there is a simple workaround:

val aggs = df.columns
    .map(c => collect_set(c) as c)
    .filter( _ != "someKey") // optionnaly
val result = df.groupBy("someKey").agg(aggs.head, aggs.tail : _* )

</details>



# 答案2
**得分**: 1

如果你的意图是按相同键聚合所有匹配的元素作为json对象列表你可以执行类似以下的操作

```scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame

val df = spark.sqlContext.createDataFrame(Seq(
  ("steak", "1990-01-01", "2022-03-30", 150),
  ("steak", "2000-01-02", "2021-01-13", 180),
  ("fish",  "1990-01-01", "2001-02-01", 100)
)).toDF("key", "startDate", "endDate", "price")

df.show()

df
 .groupBy("key")
 .agg(collect_set(struct($"*")).as("value"))
 .show(false)

输出:

+-----+----------+----------+-----+
|  key| startDate|   endDate|price|
+-----+----------+----------+-----+
|steak|1990-01-01|2022-03-30|  150|
|steak|2000-01-02|2021-01-13|  180|
| fish|1990-01-01|2001-02-01|  100|
+-----+----------+----------+-----+

+-----+----------------------------------------------------------------------------+
|key  |value                                                                       |
+-----+----------------------------------------------------------------------------+
|steak|[{steak, 1990-01-01, 2022-03-30, 150}, {steak, 2000-01-02, 2021-01-13, 180}]|
|fish |[{fish, 1990-01-01, 2001-02-01, 100}]                                       |
+-----+----------------------------------------------------------------------------+
英文:

If your intention is to aggregate all elements that match the same key as a list of json objects you can perform something like:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame

val df = spark.sqlContext.createDataFrame(Seq(
      (&quot;steak&quot;, &quot;1990-01-01&quot;, &quot;2022-03-30&quot;, 150),
      (&quot;steak&quot;, &quot;2000-01-02&quot;, &quot;2021-01-13&quot;, 180),
      (&quot;fish&quot;,  &quot;1990-01-01&quot;, &quot;2001-02-01&quot;, 100)
    )).toDF(&quot;key&quot;, &quot;startDate&quot;, &quot;endDate&quot;, &quot;price&quot;)

df.show()

df
 .groupBy(&quot;key&quot;)
 .agg(collect_set(struct($&quot;*&quot;)).as(&quot;value&quot;))
 .show(false)

output:

+-----+----------+----------+-----+
|  key| startDate|   endDate|price|
+-----+----------+----------+-----+
|steak|1990-01-01|2022-03-30|  150|
|steak|2000-01-02|2021-01-13|  180|
| fish|1990-01-01|2001-02-01|  100|
+-----+----------+----------+-----+

+-----+----------------------------------------------------------------------------+
|key  |value                                                                       |
+-----+----------------------------------------------------------------------------+
|steak|[{steak, 1990-01-01, 2022-03-30, 150}, {steak, 2000-01-02, 2021-01-13, 180}]|
|fish |[{fish, 1990-01-01, 2001-02-01, 100}]                                       |
+-----+----------------------------------------------------------------------------+

huangapple
  • 本文由 发表于 2023年4月11日 14:39:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/75983034.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定