重采样数据框架产生意外结果。

huangapple go评论78阅读模式
英文:

Resampling dataframe is producing unexpected results

问题

以下是翻译好的内容:

长话短说,适当的重新采样频率/规则是什么?有时我得到一个数据框,其中大部分都是NaN,有时候它效果很好。我以为我掌握了它。

以下是一个示例,

我正在处理大量数据,并且正在更改我的重新采样频率,并注意到由于某种原因,某些重新采样规则只会在每一行中产生一个元素具有值,其余元素都具有NaN值。

例如,

df = pd.DataFrame()
df['date'] = pd.date_range(start='1/1/2018', end='5/08/2018')

创建一些示例数据,

df['data1'] = np.random.randint(1, 10, df.shape[0])
df['data2'] = np.random.randint(1, 10, df.shape[0])
df['data3'] = np.arange(len(df))

数据如下,

print(df.head())
print(df.shape)
                data1  data2  data3
date                            
2018-01-01      7      7      0
2018-01-02      8      8      1
2018-01-03      2      7      2
2018-01-04      2      2      3
2018-01-05      2      5      4
(128, 3)

当我使用偏移别名重新采样数据时,我得到了意外的结果。

以下是每3分钟重新采样数据的示例。

resampled = df.resample('3T').mean()

print(resampled.head())
print(resampled.shape)
                     data1  data2  data3
date                                    
2018-01-01 00:00:00    4.0    5.0    0.0
2018-01-01 00:03:00    NaN    NaN    NaN
2018-01-01 00:06:00    NaN    NaN    NaN
2018-01-01 00:09:00    NaN    NaN    NaN
2018-01-01 00:12:00    NaN    NaN    NaN

除了第一行外,大多数行都填充了NaN。我认为这是由于我的重新采样规则没有索引造成的。这个理解正确吗?'24H'是这些数据的最小间隔,但任何小于它的东西都会在一行中留下NaN。

DataFrame能够以小于日期时间分辨率的增量重新采样吗?

在过去,我曾尝试重新采样一个跨越一年的大型数据集,日期时间索引格式化为%Y:%j:%H:%M:%S(年份:天数#:小时:分钟:秒,注意:足够接近而不啰嗦)。尝试每15或30天重新采样也产生了非常相似的NaN结果。我以为这是因为日期格式奇怪而没有月份,但df.head()显示索引具有正确的日期。

英文:

Long question short, what is an appropriate resampling freq/rule? Sometimes I get a dataframe mostly filled with NaNs, sometimes it works great. I thought I had a handle on it.

Below is an example,

I am processing a lot of data and was changing my resample frequency and notice that for reason certain resample rules produce only 1 element in each row to have a value, the rest of elements to have NaN's.

For example,

df = pd.DataFrame()
df['date']=pd.date_range(start='1/1/2018', end='5/08/2018')

Creating some example data,

df['data1']=np.random.randint(1, 10, df.shape[0])
df['data2']=np.random.randint(1, 10, df.shape[0])
df['data3'] = np.arange(len(df))

The data looks like,

print(df.head())
print(df.shape)

            data1  data2  data3
date                           
2018-01-01      7      7      0
2018-01-02      8      8      1
2018-01-03      2      7      2
2018-01-04      2      2      3
2018-01-05      2      5      4
(128, 3)

When I resample the data using offset aliases I get an unexpected results.

Below I resample the data every 3 minutes.

resampled=df.resample('3T').mean()

print(resampled.head())
print(resampled.shape)

                     data1  data2  data3
date                                    
2018-01-01 00:00:00    4.0    5.0    0.0
2018-01-01 00:03:00    NaN    NaN    NaN
2018-01-01 00:06:00    NaN    NaN    NaN
2018-01-01 00:09:00    NaN    NaN    NaN
2018-01-01 00:12:00    NaN    NaN    NaN

Most of the rows are filled with NaN besides the first. I believe this due to that there is no index for my resampling rule. Is this correct? '24H' is the smallest interval for this data, but anything less leaves NaN in a row.

Can a dataframe be resampled for increments less than the datetime resolution?

I have had trouble in the past trying to resample a large dataset that spanned over a year with the datetime index formatted as %Y:%j:%H:%M:%S (year:day #: hour: minute:second, note: close enough without being verbose). Attempting to resample every 15 or 30 days also produced very similar results with NaNs. I thought it was due to having an odd date format with no month, but df.head() showed the index with correct dates.

答案1

得分: 2

当你重新采样时,降低频率(下采样),计算结果的一个可能选项是使用mean()。它实际上意味着:

  • 源DataFrame包含过于详细的数据。
  • 您希望将采样频率更改为较低的频率,并计算当前采样周期内一些源行的每列的均值。

但是当你增加采样频率(上采样)时:

  • 您的源数据过于一般化。
  • 您希望将频率更改为更高的频率。
  • 计算结果的一个可能选项是在已知源值之间进行插值。

请注意,当您将每日数据上采样到3分钟频率时:

  • 第一行将包含从2018-01-01 00:00:002018-01-01 00:03:00之间的数据。
  • 下一行将包含从2018-01-01 00:03:002018-01-01 00:06:00之间的数据。
  • 依此类推。

因此,基于您的源数据:

  • 第一行包含自2018-01-01(午夜时刻)开始的数据。
  • 由于在00:03:0000:06:002018-01-01)之间没有可用的源数据,第二行只包含NaN值。
  • 对于其他行也是如此,直到2018-01-01 23:57:00(这些时间段没有源数据)。
  • 下一行,对于2018-01-02 00:00:00可以用源数据填充。
  • 依此类推。

这种行为并没有什么奇怪的。重新采样就是这样工作的。由于您实际上是对源数据进行上采样,也许您应该插值缺失的值?

英文:

When you resample lowering the frequency (downsample), then
one of possible options to compute the result is just mean().
It actuaaly means:

  • The source DataFrame contains too detailed data.
  • You want to change the sampling frequency to some lower one and
    compute e.g. a mean of each column from some number
    of source rows for the current sampling period.

But when you increase the sampling frequency (upsample), then:

  • Your source data are too general.
  • You want to change the frequency to a higher one.
  • One of possible options to compute the result is e.g. to
    interpolate between known source values.

Note that when you upsample daily data to 3-minute frequency then:

  • The first row will contain data between 2018-01-01 00:00:00 and
    2018-01-01 00:03:00.
  • The next row will contain data between 2018-01-01 00:03:00 and
    2018-01-01 00:06:00.
  • And so on.

So, based on your source data:

  • The first row contains data from 2018-01-01 (sharp on midnight).
  • Since no source data is available for the time range between
    00:03:00 and 00:06:00 (on 2018-01-01), the second row contains
    just NaN values.
  • The same pertains to further rows, up to 2018-01-01 23:57:00
    (no source data for these time slices).
  • The next row, for 2018-01-02 00:00:00 can be filled with source data.
  • And so on.

There is nothing strange in this behaviour. Resample works just this way.
As you actually upsample the source data, maybe you should interpolate
the missing values?

huangapple
  • 本文由 发表于 2020年1月4日 12:45:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/59587963.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定