英文:
Is is possible to performa group by taking in all the fields in aggregate?
问题
I am on apache spark 3.3.2
. Here is a sample code
val df: Dataset[Row] = ???
df
.groupBy($"someKey")
.agg(collect_set(???)) //I want to collect all the columns here including the key.
As mentioned in the comment I want to collect all the columns and not have to specify all the columns again. Is there a way to do this?
英文:
I am on apache spark 3.3.2
. Here is a sample code
val df: Dataset[Row] = ???
df
.groupBy($"someKey")
.agg(collect_set(???)) //I want to collect all the columns here including the key.
As mentioned in the comment I want to collect all the columns and not have to specify all the columns again. Is there a way to do this?
答案1
得分: 1
你可以使用 df.columns
访问数据框中的列列表。然后,你可以处理它以生成你想要的汇总列表:
# 假设你想按“someKey”分组,并收集所有其他列的值。
from pyspark.sql import functions as F
result = df\
.groupBy("someKey")\
.agg(*[F.collect_set(c).alias(c) for c in df.columns if c != "someKey"])
注意:如果你希望收集“someKey”列,可以删除if c != "someKey"
。
在Scala中,agg
函数的签名如下:
def agg(expr: Column, exprs: Column*): DataFrame
因此,我们无法直接展开一个列表,但有一个简单的解决方法:
val aggs = df.columns
.map(c => collect_set(c) as c)
.filter(_ != "someKey") // 可选
val result = df.groupBy("someKey").agg(aggs.head, aggs.tail: _*)
英文:
You can use df.columns
to access the list of columns of your dataframe. Then you can process it to generate the list of aggregations you want:
# let's say that you want to group by "someKey", and collect the values
# of all the other columns.
from pyspark.sql import functions as F
result = df\
.groupBy("someKey")\
.agg(*[F.collect_set(c).alias(c) for c in df.columns if c != "someKey"])
NB: You may remove if c != "someKey"
in case you want to collect the someKey
column as well.
In scala, the agg
function signature is as follows:
> def agg(expr: Column, exprs: Column*): DataFrame
Therefore, we cannot unpack a list directly but there is a simple workaround:
val aggs = df.columns
.map(c => collect_set(c) as c)
.filter( _ != "someKey") // optionnaly
val result = df.groupBy("someKey").agg(aggs.head, aggs.tail : _* )
</details>
# 答案2
**得分**: 1
如果你的意图是按相同键聚合所有匹配的元素作为json对象列表,你可以执行类似以下的操作:
```scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
val df = spark.sqlContext.createDataFrame(Seq(
("steak", "1990-01-01", "2022-03-30", 150),
("steak", "2000-01-02", "2021-01-13", 180),
("fish", "1990-01-01", "2001-02-01", 100)
)).toDF("key", "startDate", "endDate", "price")
df.show()
df
.groupBy("key")
.agg(collect_set(struct($"*")).as("value"))
.show(false)
输出:
+-----+----------+----------+-----+
| key| startDate| endDate|price|
+-----+----------+----------+-----+
|steak|1990-01-01|2022-03-30| 150|
|steak|2000-01-02|2021-01-13| 180|
| fish|1990-01-01|2001-02-01| 100|
+-----+----------+----------+-----+
+-----+----------------------------------------------------------------------------+
|key |value |
+-----+----------------------------------------------------------------------------+
|steak|[{steak, 1990-01-01, 2022-03-30, 150}, {steak, 2000-01-02, 2021-01-13, 180}]|
|fish |[{fish, 1990-01-01, 2001-02-01, 100}] |
+-----+----------------------------------------------------------------------------+
英文:
If your intention is to aggregate all elements that match the same key as a list of json objects you can perform something like:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
val df = spark.sqlContext.createDataFrame(Seq(
("steak", "1990-01-01", "2022-03-30", 150),
("steak", "2000-01-02", "2021-01-13", 180),
("fish", "1990-01-01", "2001-02-01", 100)
)).toDF("key", "startDate", "endDate", "price")
df.show()
df
.groupBy("key")
.agg(collect_set(struct($"*")).as("value"))
.show(false)
output:
+-----+----------+----------+-----+
| key| startDate| endDate|price|
+-----+----------+----------+-----+
|steak|1990-01-01|2022-03-30| 150|
|steak|2000-01-02|2021-01-13| 180|
| fish|1990-01-01|2001-02-01| 100|
+-----+----------+----------+-----+
+-----+----------------------------------------------------------------------------+
|key |value |
+-----+----------------------------------------------------------------------------+
|steak|[{steak, 1990-01-01, 2022-03-30, 150}, {steak, 2000-01-02, 2021-01-13, 180}]|
|fish |[{fish, 1990-01-01, 2001-02-01, 100}] |
+-----+----------------------------------------------------------------------------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论