2020年1月4日 12:45:09go评论111阅读模式

英文:

Resampling dataframe is producing unexpected results

问题

以下是翻译好的内容：

长话短说，适当的重新采样频率/规则是什么？有时我得到一个数据框，其中大部分都是NaN，有时候它效果很好。我以为我掌握了它。

以下是一个示例，

我正在处理大量数据，并且正在更改我的重新采样频率，并注意到由于某种原因，某些重新采样规则只会在每一行中产生一个元素具有值，其余元素都具有NaN值。

例如，

df = pd.DataFrame()
df['date'] = pd.date_range(start='1/1/2018', end='5/08/2018')

创建一些示例数据，

df['data1'] = np.random.randint(1, 10, df.shape[0])
df['data2'] = np.random.randint(1, 10, df.shape[0])
df['data3'] = np.arange(len(df))

数据如下，

print(df.head())
print(df.shape)

                data1  data2  data3
date                            
2018-01-01      7      7      0
2018-01-02      8      8      1
2018-01-03      2      7      2
2018-01-04      2      2      3
2018-01-05      2      5      4
(128, 3)

当我使用偏移别名重新采样数据时，我得到了意外的结果。

以下是每3分钟重新采样数据的示例。

resampled = df.resample('3T').mean()
print(resampled.head())
print(resampled.shape)

                     data1  data2  data3
date                                    
2018-01-01 00:00:00    4.0    5.0    0.0
2018-01-01 00:03:00    NaN    NaN    NaN
2018-01-01 00:06:00    NaN    NaN    NaN
2018-01-01 00:09:00    NaN    NaN    NaN
2018-01-01 00:12:00    NaN    NaN    NaN

除了第一行外，大多数行都填充了NaN。我认为这是由于我的重新采样规则没有索引造成的。这个理解正确吗？'24H'是这些数据的最小间隔，但任何小于它的东西都会在一行中留下NaN。

DataFrame能够以小于日期时间分辨率的增量重新采样吗？

在过去，我曾尝试重新采样一个跨越一年的大型数据集，日期时间索引格式化为%Y:%j:%H:%M:%S（年份:天数#:小时:分钟:秒，注意：足够接近而不啰嗦）。尝试每15或30天重新采样也产生了非常相似的NaN结果。我以为这是因为日期格式奇怪而没有月份，但df.head()显示索引具有正确的日期。

英文:

Long question short, what is an appropriate resampling freq/rule? Sometimes I get a dataframe mostly filled with NaNs, sometimes it works great. I thought I had a handle on it.

Below is an example,

I am processing a lot of data and was changing my resample frequency and notice that for reason certain resample rules produce only 1 element in each row to have a value, the rest of elements to have NaN's.

For example,

df = pd.DataFrame()
df[&#39;date&#39;]=pd.date_range(start=&#39;1/1/2018&#39;, end=&#39;5/08/2018&#39;)

Creating some example data,

df[&#39;data1&#39;]=np.random.randint(1, 10, df.shape[0])
df[&#39;data2&#39;]=np.random.randint(1, 10, df.shape[0])
df[&#39;data3&#39;] = np.arange(len(df))

The data looks like,

print(df.head())
print(df.shape)
            data1  data2  data3
date                           
2018-01-01      7      7      0
2018-01-02      8      8      1
2018-01-03      2      7      2
2018-01-04      2      2      3
2018-01-05      2      5      4
(128, 3)

When I resample the data using offset aliases I get an unexpected results.

Below I resample the data every 3 minutes.

resampled=df.resample(&#39;3T&#39;).mean()
print(resampled.head())
print(resampled.shape)
                     data1  data2  data3
date                                    
2018-01-01 00:00:00    4.0    5.0    0.0
2018-01-01 00:03:00    NaN    NaN    NaN
2018-01-01 00:06:00    NaN    NaN    NaN
2018-01-01 00:09:00    NaN    NaN    NaN
2018-01-01 00:12:00    NaN    NaN    NaN

Most of the rows are filled with NaN besides the first. I believe this due to that there is no index for my resampling rule. Is this correct? '24H' is the smallest interval for this data, but anything less leaves NaN in a row.

Can a dataframe be resampled for increments less than the datetime resolution?

I have had trouble in the past trying to resample a large dataset that spanned over a year with the datetime index formatted as %Y:%j:%H:%M:%S (year:day #: hour: minute:second, note: close enough without being verbose). Attempting to resample every 15 or 30 days also produced very similar results with NaNs. I thought it was due to having an odd date format with no month, but df.head() showed the index with correct dates.

答案1

得分: 2

当你重新采样时，降低频率（下采样），计算结果的一个可能选项是使用mean()。它实际上意味着：

源DataFrame包含过于详细的数据。
您希望将采样频率更改为较低的频率，并计算当前采样周期内一些源行的每列的均值。

但是当你增加采样频率（上采样）时：

您的源数据过于一般化。
您希望将频率更改为更高的频率。
计算结果的一个可能选项是在已知源值之间进行插值。

请注意，当您将每日数据上采样到3分钟频率时：

第一行将包含从2018-01-01 00:00:00到2018-01-01 00:03:00之间的数据。
下一行将包含从2018-01-01 00:03:00到2018-01-01 00:06:00之间的数据。
依此类推。

因此，基于您的源数据：

第一行包含自2018-01-01（午夜时刻）开始的数据。
由于在00:03:00到00:06:00（2018-01-01）之间没有可用的源数据，第二行只包含NaN值。
对于其他行也是如此，直到2018-01-01 23:57:00（这些时间段没有源数据）。
下一行，对于2018-01-02 00:00:00可以用源数据填充。
依此类推。

这种行为并没有什么奇怪的。重新采样就是这样工作的。由于您实际上是对源数据进行上采样，也许您应该插值缺失的值？

英文:

When you resample lowering the frequency (downsample), then
one of possible options to compute the result is just mean().
It actuaaly means:

The source DataFrame contains too detailed data.
You want to change the sampling frequency to some lower one and
compute e.g. a mean of each column from some number
of source rows for the current sampling period.

But when you increase the sampling frequency (upsample), then:

Your source data are too general.
You want to change the frequency to a higher one.
One of possible options to compute the result is e.g. to
interpolate between known source values.

Note that when you upsample daily data to 3-minute frequency then:

The first row will contain data between 2018-01-01 00:00:00 and
2018-01-01 00:03:00.
The next row will contain data between 2018-01-01 00:03:00 and
2018-01-01 00:06:00.
And so on.

So, based on your source data:

The first row contains data from 2018-01-01 (sharp on midnight).
Since no source data is available for the time range between
00:03:00 and 00:06:00 (on 2018-01-01), the second row contains
just NaN values.
The same pertains to further rows, up to 2018-01-01 23:57:00
(no source data for these time slices).
The next row, for 2018-01-02 00:00:00 can be filled with source data.
And so on.

There is nothing strange in this behaviour. Resample works just this way.
As you actually upsample the source data, maybe you should interpolate
the missing values?

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

重采样数据框架产生意外结果。

问题

答案1

无法在Python中保存JSON文件。

如何在R中从数据框中删除科学计数法。

创建一个包含其他列的列，作为一个JSON对象？

如何在tkinter中只播放一次gif？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。