英文:
Resampling dataframe is producing unexpected results
问题
以下是翻译好的内容:
长话短说,适当的重新采样频率/规则是什么?有时我得到一个数据框,其中大部分都是NaN,有时候它效果很好。我以为我掌握了它。
以下是一个示例,
我正在处理大量数据,并且正在更改我的重新采样频率,并注意到由于某种原因,某些重新采样规则只会在每一行中产生一个元素具有值,其余元素都具有NaN值。
例如,
df = pd.DataFrame()
df['date'] = pd.date_range(start='1/1/2018', end='5/08/2018')
创建一些示例数据,
df['data1'] = np.random.randint(1, 10, df.shape[0])
df['data2'] = np.random.randint(1, 10, df.shape[0])
df['data3'] = np.arange(len(df))
数据如下,
print(df.head())
print(df.shape)
data1 data2 data3
date
2018-01-01 7 7 0
2018-01-02 8 8 1
2018-01-03 2 7 2
2018-01-04 2 2 3
2018-01-05 2 5 4
(128, 3)
当我使用偏移别名重新采样数据时,我得到了意外的结果。
以下是每3分钟重新采样数据的示例。
resampled = df.resample('3T').mean()
print(resampled.head())
print(resampled.shape)
data1 data2 data3
date
2018-01-01 00:00:00 4.0 5.0 0.0
2018-01-01 00:03:00 NaN NaN NaN
2018-01-01 00:06:00 NaN NaN NaN
2018-01-01 00:09:00 NaN NaN NaN
2018-01-01 00:12:00 NaN NaN NaN
除了第一行外,大多数行都填充了NaN。我认为这是由于我的重新采样规则没有索引造成的。这个理解正确吗?'24H'是这些数据的最小间隔,但任何小于它的东西都会在一行中留下NaN。
DataFrame能够以小于日期时间分辨率的增量重新采样吗?
在过去,我曾尝试重新采样一个跨越一年的大型数据集,日期时间索引格式化为%Y:%j:%H:%M:%S(年份:天数#:小时:分钟:秒,注意:足够接近而不啰嗦)。尝试每15或30天重新采样也产生了非常相似的NaN结果。我以为这是因为日期格式奇怪而没有月份,但df.head()显示索引具有正确的日期。
英文:
Long question short, what is an appropriate resampling freq/rule? Sometimes I get a dataframe mostly filled with NaNs, sometimes it works great. I thought I had a handle on it.
Below is an example,
I am processing a lot of data and was changing my resample frequency and notice that for reason certain resample rules produce only 1 element in each row to have a value, the rest of elements to have NaN's.
For example,
df = pd.DataFrame()
df['date']=pd.date_range(start='1/1/2018', end='5/08/2018')
Creating some example data,
df['data1']=np.random.randint(1, 10, df.shape[0])
df['data2']=np.random.randint(1, 10, df.shape[0])
df['data3'] = np.arange(len(df))
The data looks like,
print(df.head())
print(df.shape)
data1 data2 data3
date
2018-01-01 7 7 0
2018-01-02 8 8 1
2018-01-03 2 7 2
2018-01-04 2 2 3
2018-01-05 2 5 4
(128, 3)
When I resample the data using offset aliases I get an unexpected results.
Below I resample the data every 3 minutes.
resampled=df.resample('3T').mean()
print(resampled.head())
print(resampled.shape)
data1 data2 data3
date
2018-01-01 00:00:00 4.0 5.0 0.0
2018-01-01 00:03:00 NaN NaN NaN
2018-01-01 00:06:00 NaN NaN NaN
2018-01-01 00:09:00 NaN NaN NaN
2018-01-01 00:12:00 NaN NaN NaN
Most of the rows are filled with NaN besides the first. I believe this due to that there is no index for my resampling rule. Is this correct? '24H' is the smallest interval for this data, but anything less leaves NaN in a row.
Can a dataframe be resampled for increments less than the datetime resolution?
I have had trouble in the past trying to resample a large dataset that spanned over a year with the datetime index formatted as %Y:%j:%H:%M:%S (year:day #: hour: minute:second, note: close enough without being verbose). Attempting to resample every 15 or 30 days also produced very similar results with NaNs. I thought it was due to having an odd date format with no month, but df.head() showed the index with correct dates.
答案1
得分: 2
当你重新采样时,降低频率(下采样),计算结果的一个可能选项是使用mean()。它实际上意味着:
- 源DataFrame包含过于详细的数据。
- 您希望将采样频率更改为较低的频率,并计算当前采样周期内一些源行的每列的均值。
但是当你增加采样频率(上采样)时:
- 您的源数据过于一般化。
- 您希望将频率更改为更高的频率。
- 计算结果的一个可能选项是在已知源值之间进行插值。
请注意,当您将每日数据上采样到3分钟频率时:
- 第一行将包含从2018-01-01 00:00:00到2018-01-01 00:03:00之间的数据。
- 下一行将包含从2018-01-01 00:03:00到2018-01-01 00:06:00之间的数据。
- 依此类推。
因此,基于您的源数据:
- 第一行包含自2018-01-01(午夜时刻)开始的数据。
- 由于在00:03:00到00:06:00(2018-01-01)之间没有可用的源数据,第二行只包含NaN值。
- 对于其他行也是如此,直到2018-01-01 23:57:00(这些时间段没有源数据)。
- 下一行,对于2018-01-02 00:00:00可以用源数据填充。
- 依此类推。
这种行为并没有什么奇怪的。重新采样就是这样工作的。由于您实际上是对源数据进行上采样,也许您应该插值缺失的值?
英文:
When you resample lowering the frequency (downsample), then
one of possible options to compute the result is just mean().
It actuaaly means:
- The source DataFrame contains too detailed data.
- You want to change the sampling frequency to some lower one and
compute e.g. a mean of each column from some number
of source rows for the current sampling period.
But when you increase the sampling frequency (upsample), then:
- Your source data are too general.
- You want to change the frequency to a higher one.
- One of possible options to compute the result is e.g. to
interpolate between known source values.
Note that when you upsample daily data to 3-minute frequency then:
- The first row will contain data between 2018-01-01 00:00:00 and
2018-01-01 00:03:00. - The next row will contain data between 2018-01-01 00:03:00 and
2018-01-01 00:06:00. - And so on.
So, based on your source data:
- The first row contains data from 2018-01-01 (sharp on midnight).
- Since no source data is available for the time range between
00:03:00 and 00:06:00 (on 2018-01-01), the second row contains
just NaN values. - The same pertains to further rows, up to 2018-01-01 23:57:00
(no source data for these time slices). - The next row, for 2018-01-02 00:00:00 can be filled with source data.
- And so on.
There is nothing strange in this behaviour. Resample works just this way.
As you actually upsample the source data, maybe you should interpolate
the missing values?
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论