2023年4月6日 22:24:44go评论102阅读模式

英文:

Groupby, Window and rolling average in Spark

问题

我想在大型数据集上使用Pyspark进行分组和滚动平均。由于不熟悉Pyspark，我很难看出我的错误在哪里。为什么这不起作用？

data = pd.DataFrame({'group':['A']*5+['B']*5,
'order':[1,2,3,4,5, 1,2,3,4,5],
'value':[23, 54, 65, 64, 78, 98, 78, 76, 77, 57]})

spark_df = spark.createDataFrame(data)

window_spec = Window.partitionBy("group").orderBy("order").rowsBetween(-1, 0)

计算col_value的滚动平均值

rolling_avg = avg(col("value")).over(window_spec).alias("value_rolling_avg")

按col_group分组并计算col_value的滚动平均值

spark_df.groupby("group").agg(rolling_avg).show()

AnalysisException: [COLUMN_NOT_IN_GROUP_BY_CLAUSE] 表达式"order"既不在分组中，也不是聚合函数。如果您不在乎获取哪个值，请添加到分组中或使用first()（或first_value()）包装它。


<details>
<summary>英文:</summary>

I want to do a groupby and do a rolling average on huge dataset using pyspark. Not used to pyspark I struggle to see my mistake here. Why doesn&#39;t this work?

data = pd.DataFrame({'group':['A']*5+['B']*5,
'order':[1,2,3,4,5, 1,2,3,4,5],
'value':[23, 54, 65, 64, 78, 98, 78, 76, 77, 57]})

spark_df = spark.createDataFrame(data)

window_spec = Window.partitionBy("group").orderBy("order").rowsBetween(-1, 0)

Calculate the rolling average of col_value

rolling_avg = avg(col("value")).over(window_spec).alias("value_rolling_avg")

Group by col_group and col_date and calculate the rolling average of col_value

spark_df.groupby("group").agg(rolling_avg).show()

AnalysisException: [COLUMN_NOT_IN_GROUP_BY_CLAUSE] The expression "order" is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value()) if you don't care which value you get.;


</details>


# 答案1
**得分**: 1

你正在混合使用窗口函数和聚合函数。 

你可以简单地选择`rolling_avg`来获得你的结果：

spark_df.select(rolling_avg).show()

+-----------------+
|value_rolling_avg|
+-----------------+
| 23.0|
| 38.5|
| 59.5|
| 64.5|
| 71.0|
| 98.0|
| 88.0|
| 77.0|
| 76.5|
| 67.0|
+-----------------+


或者

spark_df.withColumn("value_rolling_avg", rolling_avg).show()

+------+------+------+-----------------+
|group_|order_|value_|value_rolling_avg|
+------+------+------+-----------------+
| A| 1| 23| 23.0|
| A| 2| 54| 38.5|
| A| 3| 65| 59.5|
| A| 4| 64| 64.5|
| A| 5| 78| 71.0|
| B| 1| 98| 98.0|
| B| 2| 78| 88.0|
| B| 3| 76| 77.0|
| B| 4| 77| 76.5|
| B| 5| 57| 67.0|
+------+------+------+-----------------+



<details>
<summary>英文:</summary>

You are mixing window functions and aggregation functions. 

you can simply select `rolling_avg` to get your result :

spark_df.select(rolling_avg).show()

+-----------------+
|value_rolling_avg|
+-----------------+
| 23.0|
| 38.5|
| 59.5|
| 64.5|
| 71.0|
| 98.0|
| 88.0|
| 77.0|
| 76.5|
| 67.0|
+-----------------+

OR

spark_df.withColumn("value_rolling_avg", rolling_avg).show()


</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Groupby, Window and rolling average in Spark

问题

计算col_value的滚动平均值

按col_group分组并计算col_value的滚动平均值

Calculate the rolling average of col_value

Group by col_group and col_date and calculate the rolling average of col_value

从列表中的字符串开头删除数字字符。

Elpy-rpc in Emacs gives 'exited abnormally with code 1' error and unexpected output. How can I fix it?

遍历一个包含未知数量的 numpy 数组作为其值的 n 维 numpy 数组。

PUT请求与子文档

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论