重采样数据框以添加缺失的日期。

huangapple go评论75阅读模式
英文:

Resample dataframe to add missing dates

问题

我有一个包含多个字符串列、一个日期列和一个整数值列的数据框。

我想要为每组文本列“ffill”缺失的日期。缺失的日期是数据框中最小日期到最大日期的所有日期。我认为使用示例更容易理解。

示例输入:

group        rtype     location   hardware     date          value
my-group     type-s     NY         DTop        2020-08-05     10
my-group     type-s     NY         DTop        2020-08-07     20
my-group     type-s     NY         DTop        2020-08-10     30
my-group     type-s     NY         Tower       2020-08-01     40
my-group     type-s     NY         Tower       2020-08-07     50
ot-group     type-t     NY         LTop        2020-08-08     90

此数据框的最小和最大日期:(起始日期)2020-08-01 -(结束日期)2020-08-10

示例输出:

group       rtype       location   hardware    date          value
my-group    type-s        NY          DTop     2020-08-01    0
my-group    type-s        NY          DTop     2020-08-02    0
my-group    type-s        NY          DTop     2020-08-03    0
my-group    type-s        NY          DTop     2020-08-04    0
my-group    type-s        NY          DTop     2020-08-05    10
my-group    type-s        NY          DTop     2020-08-06    10
my-group    type-s        NY          DTop     2020-08-07    20
my-group    type-s        NY          DTop     2020-08-08    20
my-group    type-s        NY          DTop     2020-08-09    20
my-group    type-s        NY          DTop     2020-08-10    30
my-group    type-s        NY          Tower    2020-08-01    40
my-group    type-s        NY          Tower    2020-08-02    40
my-group    type-s        NY          Tower    2020-08-03    40
my-group    type-s        NY          Tower    2020-08-04    40
my-group    type-s        NY          Tower    2020-08-05    40
my-group    type-s        NY          Tower    2020-08-06    40
my-group    type-s        NY          Tower    2020-08-07    50
my-group    type-s        NY          Tower    2020-08-08    50
my-group    type-s        NY          Tower    2020-08-09    50
my-group    type-s        NY          Tower    2020-08-10    50
ot-group    type-t        NY          LTop     2020-08-01    0
ot-group    type-t        NY          LTop     2020-08-02    0
ot-group    type-t        NY          LTop     2020-08-03    0
ot-group    type-t        NY          LTop     2020-08-04    0
ot-group    type-t        NY          LTop     2020-08-05    0
ot-group    type-t        NY          LTop     2020-08-06    0
ot-group    type-t        NY          LTop     2020-08-07    0
ot-group    type-t        NY          LTop     2020-08-08    90
ot-group    type-t        NY          LTop     2020-08-09    90
ot-group    type-t        NY          LTop     2020-08-10    90

在这个示例中,我将位置保持不变,以避免额外的长输出。

我可以使用pd.date_range()来获取所需的日期。

我尝试使用多索引的resample,但遇到了错误(类似于此问题)。

我尝试了此答案中提到的方法,但似乎不起作用。

以下是我使用的代码:

import pandas as pd

df = pd.read_csv('data.csv')
df.set_index('date', inplace=True)
date_range = pd.date_range(df.index.min(), df.index.max(), freq='D')

print(len(date_range), date_range)
def reindex_by_date(df):    
    return df.reindex(date_range).ffill()

df = df.groupby(['group','rtype','location','hardware']).apply(reindex_by_date).reset_index([0,1,2,3], drop=True)
print(df.to_string())

此代码的输出:

10 DatetimeIndex(['2020-08-01', '2020-08-02', '2020-08-03', '2020-08-04',
               '2020-08-05', '2020-08-06', '2020-08-07', '2020-08-08',
               '2020-08-09', '2020-08-10'],
              dtype='datetime64[ns]', freq='D')
            group  rtype  location  hardware  value
2020-08-01    NaN    NaN       NaN       NaN    NaN
2020-08-02    NaN    NaN       NaN       NaN    NaN
2020-08-03    NaN    NaN       NaN       NaN    NaN
2020-08-04    NaN    NaN       NaN       NaN    NaN
2020-08-05    NaN    NaN       NaN       NaN    NaN
2020-08-06    NaN    NaN       NaN       NaN    NaN
2020-08-07    NaN    NaN       NaN       NaN    NaN
2020-08-08    NaN    NaN       NaN       NaN    NaN
2020-08-09    NaN    NaN       NaN       NaN    NaN
2020-08-10    NaN    NaN       NaN       NaN    NaN
2020-08-01    NaN    NaN       NaN       NaN    NaN
2020-08-02    NaN    NaN       NaN       NaN    NaN
2020-08-03    NaN    NaN       NaN       NaN    NaN
2020-08-04    NaN    NaN      

<details>
<summary>英文:</summary>

I have a dataframe with multiple string columns, one date column and one int value column.

I want to `ffill` the missing dates for each group of text columns. The missing dates are all dates from the min date to max date in the dataframe. I think this is better explained using an example.

Sample Input:

group rtype location hardware date value
my-group type-s NY DTop 2020-08-05 10
my-group type-s NY DTop 2020-08-07 20
my-group type-s NY DTop 2020-08-10 30
my-group type-s NY Tower 2020-08-01 40
my-group type-s NY Tower 2020-08-07 50
ot-group type-t NY LTop 2020-08-08 90

Min and Max date for this dataframe: (start_date) 2020-08-01 - (end_date) 2020-08-10

Sample Output:

group rtype location hardware date value
my-group type-s NY DTop 2020-08-01 0
my-group type-s NY DTop 2020-08-02 0
my-group type-s NY DTop 2020-08-03 0
my-group type-s NY DTop 2020-08-04 0
my-group type-s NY DTop 2020-08-05 10
my-group type-s NY DTop 2020-08-06 10
my-group type-s NY DTop 2020-08-07 20
my-group type-s NY DTop 2020-08-08 20
my-group type-s NY DTop 2020-08-09 20
my-group type-s NY DTop 2020-08-10 30
my-group type-s NY Tower 2020-08-01 40
my-group type-s NY Tower 2020-08-02 40
my-group type-s NY Tower 2020-08-03 40
my-group type-s NY Tower 2020-08-04 40
my-group type-s NY Tower 2020-08-05 40
my-group type-s NY Tower 2020-08-06 40
my-group type-s NY Tower 2020-08-07 50
my-group type-s NY Tower 2020-08-08 50
my-group type-s NY Tower 2020-08-09 50
my-group type-s NY Tower 2020-08-10 50
ot-group type-t NY LTop 2020-08-01 0
ot-group type-t NY LTop 2020-08-02 0
ot-group type-t NY LTop 2020-08-03 0
ot-group type-t NY LTop 2020-08-04 0
ot-group type-t NY LTop 2020-08-05 0
ot-group type-t NY LTop 2020-08-06 0
ot-group type-t NY LTop 2020-08-07 0
ot-group type-t NY LTop 2020-08-08 90
ot-group type-t NY LTop 2020-08-09 90
ot-group type-t NY LTop 2020-08-10 90


In this example, I kept the location fixed to avoid an extra long output.
I am able to get the dates I want using `pd.date_range()`.

I tried using `resample` with multiindex but I run into errors (similar to [this](https://stackoverflow.com/questions/15799162/resampling-within-a-pandas-multiindex)).

I tried the approach mentioned in [this answer](https://stackoverflow.com/a/32275705/1872234) but it doesn&#39;t seem to work:

My code using:

import pandas as pd

df = pd.read_csv('data.csv')
df.set_index('date', inplace=True)
date_range = pd.date_range(df.index.min(), df.index.max(), freq='D')

print(len(date_range), date_range)
def reindex_by_date(df):
return df.reindex(date_range).ffill()

df = df.groupby(['group','rtype','location','hardware']).apply(reindex_by_date).reset_index([0,1,2,3], drop=True)
print(df.to_string())


Output of this code:

10 DatetimeIndex(['2020-08-01', '2020-08-02', '2020-08-03', '2020-08-04',
'2020-08-05', '2020-08-06', '2020-08-07', '2020-08-08',
'2020-08-09', '2020-08-10'],
dtype='datetime64[ns]', freq='D')
group rtype location hardware value
2020-08-01 NaN NaN NaN NaN NaN
2020-08-02 NaN NaN NaN NaN NaN
2020-08-03 NaN NaN NaN NaN NaN
2020-08-04 NaN NaN NaN NaN NaN
2020-08-05 NaN NaN NaN NaN NaN
2020-08-06 NaN NaN NaN NaN NaN
2020-08-07 NaN NaN NaN NaN NaN
2020-08-08 NaN NaN NaN NaN NaN
2020-08-09 NaN NaN NaN NaN NaN
2020-08-10 NaN NaN NaN NaN NaN
2020-08-01 NaN NaN NaN NaN NaN
2020-08-02 NaN NaN NaN NaN NaN
2020-08-03 NaN NaN NaN NaN NaN
2020-08-04 NaN NaN NaN NaN NaN
2020-08-05 NaN NaN NaN NaN NaN
2020-08-06 NaN NaN NaN NaN NaN
2020-08-07 NaN NaN NaN NaN NaN
2020-08-08 NaN NaN NaN NaN NaN
2020-08-09 NaN NaN NaN NaN NaN
2020-08-10 NaN NaN NaN NaN NaN
2020-08-01 NaN NaN NaN NaN NaN
2020-08-02 NaN NaN NaN NaN NaN
2020-08-03 NaN NaN NaN NaN NaN
2020-08-04 NaN NaN NaN NaN NaN
2020-08-05 NaN NaN NaN NaN NaN
2020-08-06 NaN NaN NaN NaN NaN
2020-08-07 NaN NaN NaN NaN NaN
2020-08-08 NaN NaN NaN NaN NaN
2020-08-09 NaN NaN NaN NaN NaN
2020-08-10 NaN NaN NaN NaN NaN


Can someone help please?

**EDIT:**
After fixing the DatetimeIndex issue, and using `fillna(0)`:

df = pd.read_csv('data.csv', parse_dates=['date'])
df.set_index('date', inplace=True)
date_range = pd.date_range(df.index.min(), df.index.max(), freq='D')

print(len(date_range), date_range)
def reindex_by_date(df):
return df.reindex(date_range).ffill().fillna(0)

df = df.groupby(['group','rtype','location','hardware']).apply(reindex_by_date).reset_index([0,1,2,3], drop=True).reset_index().rename(columns={'index': 'date'})

print(df.to_string())


Output:
           group   rtype location hardware  value

2020-08-01 0 0 0 0 0.0
2020-08-02 0 0 0 0 0.0
2020-08-03 0 0 0 0 0.0
2020-08-04 0 0 0 0 0.0
2020-08-05 my-group type-s NY DTop 10.0
2020-08-06 my-group type-s NY DTop 10.0
2020-08-07 my-group type-s NY DTop 20.0
2020-08-08 my-group type-s NY DTop 20.0
2020-08-09 my-group type-s NY DTop 20.0
2020-08-10 my-group type-s NY DTop 30.0
2020-08-01 my-group type-s NY Tower 40.0
2020-08-02 my-group type-s NY Tower 40.0
2020-08-03 my-group type-s NY Tower 40.0
2020-08-04 my-group type-s NY Tower 40.0
2020-08-05 my-group type-s NY Tower 40.0
2020-08-06 my-group type-s NY Tower 40.0
2020-08-07 my-group type-s NY Tower 50.0
2020-08-08 my-group type-s NY Tower 50.0
2020-08-09 my-group type-s NY Tower 50.0
2020-08-10 my-group type-s NY Tower 50.0
2020-08-01 0 0 0 0 0.0
2020-08-02 0 0 0 0 0.0
2020-08-03 0 0 0 0 0.0
2020-08-04 0 0 0 0 0.0
2020-08-05 0 0 0 0 0.0
2020-08-06 0 0 0 0 0.0
2020-08-07 0 0 0 0 0.0
2020-08-08 ot-group type-t NY LTop 90.0
2020-08-09 ot-group type-t NY LTop 90.0
2020-08-10 ot-group type-t NY LTop 90.0


</details>


# 答案1
**得分**: 3

你忘记将`date`列转换为`DatetimeIndex`:

```python
df = pd.read_csv('data.csv', parse_dates=['date'])
# 在此处添加你的代码

另一种方法是使用pivot_table

# 如果你使用了 pd.read_csv(..., parse_dates=['date']) 则不需要以下行
df['date'] = pd.to_datetime(df['date'])
dti = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
cols = ['group', 'rtype', 'location', 'hardware']
out = (df.pivot_table(index='date', columns=cols, values='value')
         .reindex(dti.rename('date')).ffill().fillna(0)
         .unstack().rename('value').reset_index())

输出:

>>> out
       group   rtype location hardware       date  value
0   my-group  type-s       NY     DTop 2020-08-01    0.0
1   my-group  type-s       NY     DTop 2020-08-02    0.0
2   my-group  type-s       NY     DTop 2020-08-03    0.0
3   my-group  type-s       NY     DTop 2020-08-04    0.0
4   my-group  type-s       NY     DTop 2020-08-05   10.0
5   my-group  type-s       NY     DTop 2020-08-06   10.0
6   my-group  type-s       NY     DTop 2020-08-07   20.0
7   my-group  type-s       NY     DTop 2020-08-08   20.0
8   my-group  type-s       NY     DTop 2020-08-09   20.0
9   my-group  type-s       NY     DTop 2020-08-10   30.0
10  my-group  type-s       NY    Tower 2020-08-01   40.0
11  my-group  type-s       NY    Tower 2020-08-02   40.0
12  my-group  type-s       NY    Tower 2020-08-03   40.0
13  my-group  type-s       NY    Tower 2020-08-04   40.0
14  my-group  type-s       NY    Tower 2020-08-05   40.0
15  my-group  type-s       NY    Tower 2020-08-06   40.0
16  my-group  type-s       NY    Tower 2020-08-07   50.0
17  my-group  type-s       NY    Tower 2020-08-08   50.0
18  my-group  type-s       NY    Tower 2020-08-09   50.0
19  my-group  type-s       NY    Tower 2020-08-10   50.0
20  ot-group  type-t       NY     LTop 2020-08-01    0.0
21  ot-group  type-t       NY     LTop 2020-08-02    0.0
22  ot-group  type-t       NY     LTop 2020-08-03    0.0
23  ot-group  type-t       NY     LTop 2020-08-04    0.0
24  ot-group  type-t       NY     LTop 2020-08-05    0.0
25  ot-group  type-t       NY     LTop 2020-08-06    0.0
26  ot-group  type-t       NY     LTop 2020-08-07    0.0
27  ot-group  type-t       NY     LTop 2020-08-08   90.0
28  ot-group  type-t       NY     LTop 2020-08-09   90.0
29  ot-group  type-t       NY     LTop 2020-08-10   90.0

然而,只使用pivot_table而不使用unstack可能更好:

>>> (df.pivot_table(index='date', columns=cols, values='value')
       .reindex(dti.rename('date')).ffill().fillna(0))

group      my-group       ot-group
rtype        type-s         type-t
location         NY             NY
hardware       DTop Tower     LTop
date                              
2020-08-01      0.0  40.0      0.0
2020-08-02      0.0  40.0      0.0
2020-08-03      0.0  40.0      0.0
2020-08-04      0.0  40.0      0.0
2020-08-05     10.0  40.0      0.0
2020-08-06     10.0  40.0      0.0
2020-08-07     20.0  50.0      0.0
2020-08-08     20.0  50.0     90.0
2020-08-09     20.0  50.0     90.0
2020-08-10     30.0  50.0     90.0
英文:

You forgot to convert date column as DatetimeIndex:

df = pd.read_csv(&#39;data.csv&#39;, parse_dates=[&#39;date&#39;])
# your code here

Another way with pivot_table:

# Not needed if you used pd.read_csv(..., parse_dates=[&#39;date&#39;])
df[&#39;date&#39;] = pd.to_datetime(df[&#39;date&#39;])
dti = pd.date_range(df[&#39;date&#39;].min(), df[&#39;date&#39;].max(), freq=&#39;D&#39;)
cols = [&#39;group&#39;,&#39;rtype&#39;,&#39;location&#39;,&#39;hardware&#39;]
out = (df.pivot_table(index=&#39;date&#39;, columns=cols, values=&#39;value&#39;)
.reindex(dti.rename(&#39;date&#39;)).ffill().fillna(0)
.unstack().rename(&#39;value&#39;).reset_index())

Output:

&gt;&gt;&gt; out
group   rtype location hardware       date  value
0   my-group  type-s       NY     DTop 2020-08-01    0.0
1   my-group  type-s       NY     DTop 2020-08-02    0.0
2   my-group  type-s       NY     DTop 2020-08-03    0.0
3   my-group  type-s       NY     DTop 2020-08-04    0.0
4   my-group  type-s       NY     DTop 2020-08-05   10.0
5   my-group  type-s       NY     DTop 2020-08-06   10.0
6   my-group  type-s       NY     DTop 2020-08-07   20.0
7   my-group  type-s       NY     DTop 2020-08-08   20.0
8   my-group  type-s       NY     DTop 2020-08-09   20.0
9   my-group  type-s       NY     DTop 2020-08-10   30.0
10  my-group  type-s       NY    Tower 2020-08-01   40.0
11  my-group  type-s       NY    Tower 2020-08-02   40.0
12  my-group  type-s       NY    Tower 2020-08-03   40.0
13  my-group  type-s       NY    Tower 2020-08-04   40.0
14  my-group  type-s       NY    Tower 2020-08-05   40.0
15  my-group  type-s       NY    Tower 2020-08-06   40.0
16  my-group  type-s       NY    Tower 2020-08-07   50.0
17  my-group  type-s       NY    Tower 2020-08-08   50.0
18  my-group  type-s       NY    Tower 2020-08-09   50.0
19  my-group  type-s       NY    Tower 2020-08-10   50.0
20  ot-group  type-t       NY     LTop 2020-08-01    0.0
21  ot-group  type-t       NY     LTop 2020-08-02    0.0
22  ot-group  type-t       NY     LTop 2020-08-03    0.0
23  ot-group  type-t       NY     LTop 2020-08-04    0.0
24  ot-group  type-t       NY     LTop 2020-08-05    0.0
25  ot-group  type-t       NY     LTop 2020-08-06    0.0
26  ot-group  type-t       NY     LTop 2020-08-07    0.0
27  ot-group  type-t       NY     LTop 2020-08-08   90.0
28  ot-group  type-t       NY     LTop 2020-08-09   90.0
29  ot-group  type-t       NY     LTop 2020-08-10   90.0

However, it could be better to use only pivot_table without unstack:

&gt;&gt;&gt; (df.pivot_table(index=&#39;date&#39;, columns=cols, values=&#39;value&#39;)
.reindex(dti.rename(&#39;date&#39;)).ffill().fillna(0))
group      my-group       ot-group
rtype        type-s         type-t
location         NY             NY
hardware       DTop Tower     LTop
date                              
2020-08-01      0.0  40.0      0.0
2020-08-02      0.0  40.0      0.0
2020-08-03      0.0  40.0      0.0
2020-08-04      0.0  40.0      0.0
2020-08-05     10.0  40.0      0.0
2020-08-06     10.0  40.0      0.0
2020-08-07     20.0  50.0      0.0
2020-08-08     20.0  50.0     90.0
2020-08-09     20.0  50.0     90.0
2020-08-10     30.0  50.0     90.0

答案2

得分: 0

以下是代码部分的翻译:

# pip install pyjanitor
import janitor
import pandas as pd

df = pd.read_clipboard(parse_dates=['date'])
dti = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
cols = ['group', 'rtype', 'location', 'hardware']

(df
.complete(cols, {'date': dti})
.assign(value = lambda f: f.value.ffill().fillna(0, downcast='infer'))
)

请注意,这是代码的翻译,没有其他内容。

英文:

One option is with complete from pyjanitor:

# pip install pyjanitor
import janitor 
import pandas as pd

df = pd.read_clipboard(parse_dates=[&#39;date&#39;])
dti = pd.date_range(df[&#39;date&#39;].min(), df[&#39;date&#39;].max(), freq=&#39;D&#39;)
cols = [&#39;group&#39;,&#39;rtype&#39;,&#39;location&#39;,&#39;hardware&#39;]

(df
.complete(cols, {&#39;date&#39;: dti})
.assign(value = lambda f: f.value.ffill().fillna(0, downcast=&#39;infer&#39;))
)

       group   rtype location hardware       date  value
0   my-group  type-s       NY     DTop 2020-08-01      0
1   my-group  type-s       NY     DTop 2020-08-02      0
2   my-group  type-s       NY     DTop 2020-08-03      0
3   my-group  type-s       NY     DTop 2020-08-04      0
4   my-group  type-s       NY     DTop 2020-08-05     10
5   my-group  type-s       NY     DTop 2020-08-06     10
6   my-group  type-s       NY     DTop 2020-08-07     20
7   my-group  type-s       NY     DTop 2020-08-08     20
8   my-group  type-s       NY     DTop 2020-08-09     20
9   my-group  type-s       NY     DTop 2020-08-10     30
10  my-group  type-s       NY    Tower 2020-08-01     40
11  my-group  type-s       NY    Tower 2020-08-02     40
12  my-group  type-s       NY    Tower 2020-08-03     40
13  my-group  type-s       NY    Tower 2020-08-04     40
14  my-group  type-s       NY    Tower 2020-08-05     40
15  my-group  type-s       NY    Tower 2020-08-06     40
16  my-group  type-s       NY    Tower 2020-08-07     50
17  my-group  type-s       NY    Tower 2020-08-08     50
18  my-group  type-s       NY    Tower 2020-08-09     50
19  my-group  type-s       NY    Tower 2020-08-10     50
20  ot-group  type-t       NY     LTop 2020-08-01     50
21  ot-group  type-t       NY     LTop 2020-08-02     50
22  ot-group  type-t       NY     LTop 2020-08-03     50
23  ot-group  type-t       NY     LTop 2020-08-04     50
24  ot-group  type-t       NY     LTop 2020-08-05     50
25  ot-group  type-t       NY     LTop 2020-08-06     50
26  ot-group  type-t       NY     LTop 2020-08-07     50
27  ot-group  type-t       NY     LTop 2020-08-08     90
28  ot-group  type-t       NY     LTop 2020-08-09     90
29  ot-group  type-t       NY     LTop 2020-08-10     90

@corralien's suggestion of a pivot table without an unstack might be a better one, it all depends on your end goal

huangapple
  • 本文由 发表于 2023年5月25日 12:17:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76328903.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定