问题

以下是代码部分的翻译：

我在使用pyspark数据框上的mean函数与groupBy（或window）时遇到了问题。下面的代码返回一个名为mean_df的数据框，对于列mean_change1，值为NaN，NaN，NaN。

我不明白为什么会出现这种情况。是因为df中存在NaN吗？

import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
import pyspark.sql.functions as F
np.random.seed(10)
data = pd.DataFrame({'ID': list(range(1,101)) + list(range(1,101)) + list(range(1,101)),
                     'date':[pd.to_datetime('2021-07-30')]*100 + [pd.to_datetime('2022-12-31')] * 100 + [pd.to_datetime('2023-04-30')]*100,
                     'value1': [i for i in np.random.normal(98, 4.5, 100)] + [np.nan] *3 + [i for i in np.random.normal(100, 5, 97)] + [i for i in np.random.normal(120, 8, 95)] +[5.3, 9000, 160, -5222, 158],
                     'value2':[i for i in np.random.normal(52, 11, 100)] + [i for i in np.random.normal(50, 10, 100)] + [i for i in np.random.normal(49, 10, 100)]
             })
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
# 计算变化
window = Window.partitionBy("ID").orderBy("date")
df = df.withColumn("previous_value1", F.lag("value1", 1).over(window))
df = df.withColumn("change1", df["value1"] - df["previous_value1"])
mean_df = df.groupBy("date").agg(F.mean("change1").alias("mean_change1"))
mean_df.toPandas()

英文:

I have an issue using mean together with groupBy (or window) on a pyspark dataframe. The code below returns a dataframe called mean_df, for the columns mean_change1, the values is NaN, NaN, NaN.

I don't understand why this is the case. Is it due to the presence of NaN in df?

import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
import pyspark.sql.functions as F
np.random.seed(10)
data = pd.DataFrame({&#39;ID&#39;:list(range(1,101)) + list(range(1,101)) + list(range(1,101)),
                     &#39;date&#39;:[pd.to_datetime(&#39;2021-07-30&#39;)]*100 + [pd.to_datetime(&#39;2022-12-31&#39;)] * 100 + [pd.to_datetime(&#39;2023-04-30&#39;)]*100,
                     &#39;value1&#39;: [i for i in np.random.normal(98, 4.5, 100)] + [np.nan] *3 + [i for i in np.random.normal(100, 5, 97)] + [i for i in np.random.normal(120, 8, 95)] +[5.3, 9000, 160, -5222, 158],
                     &#39;value2&#39;:[i for i in np.random.normal(52, 11, 100)] + [i for i in np.random.normal(50, 10, 100)] + [i for i in np.random.normal(49, 10, 100)]
             })
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
# Calculate change
window = Window.partitionBy(&quot;ID&quot;).orderBy(&quot;date&quot;)
df = df.withColumn(&quot;previous_value1&quot;, F.lag(&quot;value1&quot;, 1).over(window))
df = df.withColumn(&quot;change1&quot;, df[&quot;value1&quot;] - df[&quot;previous_value1&quot;])
mean_df = df.groupBy(&quot;date&quot;).agg(F.mean(&quot;change1&quot;).alias(&quot;mean_change1&quot;))
mean_df.toPandas()

答案1

得分: 1

你可以将 NaN 替换为 null，并计算平均值。

window = wd.partitionBy("id").orderBy("date")
data1_sdf = data_sdf. \
    replace(np.nan, None). \
    withColumn("previous_value1", func.lag("value1", 1).over(window)). \
    withColumn("change1", func.col("value1") - func.col("previous_value1"))
data1_sdf. \
    groupBy("date"). \
    agg(func.sum('change1').alias("sum_change1"), 
        func.count('change1').alias("cnt_change1"), 
        func.mean('change1').alias("mean_change1")
        ). \
    show()
# +-------------------+------------------+-----------+------------------+
# |               date|       sum_change1|cnt_change1|      mean_change1|
# +-------------------+------------------+-----------+------------------+
# |2023-04-30 00:00:00| 5355.116077984317|         97|55.207382247260995|
# |2021-07-30 00:00:00|              null|          0|              null|
# |2022-12-31 00:00:00|164.43277052556064|         97|1.6951832012944397|
# +-------------------+------------------+-----------+------------------+

请注意，我已经将代码部分保留为原文。

英文:

you could replace the NaN with nulls, and then calculate mean.

window = wd.partitionBy(&quot;id&quot;).orderBy(&quot;date&quot;)
data1_sdf = data_sdf. \
    replace(np.nan, None). \
    withColumn(&quot;previous_value1&quot;, func.lag(&quot;value1&quot;, 1).over(window)). \
    withColumn(&quot;change1&quot;, func.col(&quot;value1&quot;) - func.col(&quot;previous_value1&quot;))
data1_sdf. \
    groupBy(&quot;date&quot;). \
    agg(func.sum(&#39;change1&#39;).alias(&quot;sum_change1&quot;), 
        func.count(&#39;change1&#39;).alias(&quot;cnt_change1&quot;), 
        func.mean(&#39;change1&#39;).alias(&quot;mean_change1&quot;)
        ). \
    show()
# +-------------------+------------------+-----------+------------------+
# |               date|       sum_change1|cnt_change1|      mean_change1|
# +-------------------+------------------+-----------+------------------+
# |2023-04-30 00:00:00| 5355.116077984317|         97|55.207382247260995|
# |2021-07-30 00:00:00|              null|          0|              null|
# |2022-12-31 00:00:00|164.43277052556064|         97|1.6951832012944397|
# +-------------------+------------------+-----------+------------------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有关使用pyspark的均值和分组的问题。

问题

答案1

如何在自适应阈值处理后清除图像边缘的黑色框架/边界/轮廓

可以暗示一个函数参数不应被修改吗？

Spark writes parquet with partitionBy throws FileAlreadyExistsException in its own temporary working space

Go的`image.Bytes`和Python的PIL库的`tobytes`方法生成的字节数据是不同的。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。