Pandas滚动应用以意外方式返回NaN。

huangapple go评论59阅读模式
英文:

Pandas rolling apply returning NaN in unexpected ways

问题

我有一个数据框,其中一个列包含NaN值,我正在尝试计算该列中有效数字的滚动平均值。我尝试使用滚动应用程序来完成这个任务,使用自定义函数并传递滚动窗口原始值。我看到的替代方法是将NaN替换为0,但据我了解,这会人为降低平均值(更多条目,同样的总和,更小的平均值)。

import pandas as pd
import numpy as np

mySpecialNumber = 50

def nanMean(arr):
    return arr[~np.isnan(arr)].mean()

df = pd.read_csv('myfile.csv')
df['rollingAVG'] = df['nanColumn'].rolling(mySpecialNumber).apply(nanMean, raw=True)

上述代码将 df['rollingAVG'] 填充为NaN值,只有NaNs。然而,我可以使用:

singleAVG = nanMean(df['nanColumn'][0:mySpecialNumber].values)

得到一个完全合理的平均值,它准确地反映了手工计算的结果。

希望这不是重复;我找不到其他类似的内容,所以我觉得我可能搞错了。我也愿意尝试其他仍然能产生我想要的结果的替代方法,但这似乎是我能找到的最直接的方式。

我使用的操作系统是macOS 13.4.1,运行python 3.11.4,pandas 2.0.3,和numpy 1.25.0。

提前感谢。

编辑: 正如建议的,我之前尝试过使用

df['rollingAVG'] = df['nanColumn'].rolling(mySpecialNumber).mean()

以及

df['rollingAVG'] = df['nanColumn'].rolling(mySpecialNumber).apply(np.mean, raw=True)

但问题仍然存在;df['rollingAVG'] 仍然充满了NaN值。

我还可以保证数组中充满了NaN值;我不仅仅是在看前面或后面的50个条目,我已经设置了 np.set_printoptions(threshold=sys.maxsize),并可以验证 df['rollingAVG'].values 是一个约有1300个条目的数组,其中充满了NaN值。

英文:

I have a dataframe where one of my columns contains NaNs and I'm attempting to find the rolling average of the valid numbers in the column. I'm attempting to do this with rolling apply, using a custom function and passing the rolling window raw. Alternatives I've seen are to replace the NaNs with 0, but as I understand it this would artificially deflate my mean (more entries, same sum, smaller average).

import pandas as pd
import numpy as np

mySpecialNumber = 50

def nanMean(arr):
    return arr[~np.isnan(arr)].mean()

df = pd.read_csv('myfile.csv')
df['rollingAVG'] = df['nanColumn'].rolling(mySpecialNumber).apply(nanMean, raw=True)

The above fills df['rollingAVG'] with NaN values, only NaNs. Whereas I can use:

singleAVG = nanMean(df['nanColumn'][0:mySpecialNumber].values)

And get a totally reasonable average that accurately reflects the by-hand calculation.

Hope this isn't a duplicate; I couldn't find anything else like it around, so I suspect I must be goofing. I'm also open to alternatives that still produce the results I'm looking for, but this seemed like the most straightforward way I could find.

I'm on macOS 13.4.1, running python 3.11.4, pandas 2.0.3, and numpy 1.25.0.

Thanks in advance.

EDIT: As suggested I have previously tried to use

df['rollingAVG'] = df['nanColumn'].rolling(mySpecialNumber).mean()

as well as

df['rollingAVG'] = df['nanColumn'].rolling(mySpecialNumber).apply(np.mean, raw=True)

but the problem persists; df['rollingAVG'] is still full of NaNs.

I can also guarantee the array is full of NaNs; I'm not just looking at the first or last 50 entries, I have set np.set_printoptions(threshold=sys.maxsize) and can verify that df['rollingAVG'].values is a ~1300 entry array full of NaNs.

答案1

得分: 1

你应该为 min_periods 参数提供一个值。请查看 pandas 文档,了解在 .rolling() 函数中的这个参数:

> min_periods: int,默认为 None 要求窗口中的最小观测数量以获得一个值;否则结果为 np.nan
> 对于由偏移量指定的窗口,min_periods 默认为 1。
> 对于由整数指定的窗口,min_periods 默认为窗口的大小。

在你的情况下,min_periods 默认为窗口大小,即 mySpecialNumber 的值,如果窗口中的任何一个值为 NaN,则结果为 NaN。请查看这个例子:

df = pd.DataFrame({'a': [None, 2, 3]})

In [12]: df.rolling(3)['a'].mean()
Out[12]: 
0   NaN
1   NaN
2   NaN
Name: a, dtype: float64

如果你传递 min_periods,你将开始看到输出:

In [13]: df.rolling(3, min_periods=1)['a'].mean()
Out[13]: 
0    NaN
1    2.0
2    2.5
Name: a, dtype: float64

我建议根据你的数据集设置一个合理的 min_periods,具体取决于可能缺失的值数量。

英文:

You should be providing a value to min_periods. Please check pandas documentation for this argument in .rolling() function:

> min_periods: int, default None Minimum number of observations in
> window required to have a value; otherwise, result is np.nan.
> For a window that is specified by an offset, min_periods will default
> to 1. For a window that is specified by an integer, min_periods
> will default to the size of the window.

In your case, min_periods defaults to window size, which is the value of mySpecialNumber, which results in output NaN if any of your 50 values in the window is NaN. Check this example:

df = pd.DataFrame({'a': [None, 2, 3]})

In [12]: df.rolling(3)['a'].mean()
Out[12]: 
0   NaN
1   NaN
2   NaN
Name: a, dtype: float64

If you pass min_periods, you'll start to see outputs:

In [13]: df.rolling(3, min_periods=1)['a'].mean()
Out[13]: 
0    NaN
1    2.0
2    2.5
Name: a, dtype: float64

I would recommend setting min_periods to something that make sense for your dataset, depending on how many values could be missing.

huangapple
  • 本文由 发表于 2023年7月5日 00:21:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76614424.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定