英文:
Resample dataframe to add missing dates
问题
我有一个包含多个字符串列、一个日期列和一个整数值列的数据框。
我想要为每组文本列“ffill”缺失的日期。缺失的日期是数据框中最小日期到最大日期的所有日期。我认为使用示例更容易理解。
示例输入:
group rtype location hardware date value
my-group type-s NY DTop 2020-08-05 10
my-group type-s NY DTop 2020-08-07 20
my-group type-s NY DTop 2020-08-10 30
my-group type-s NY Tower 2020-08-01 40
my-group type-s NY Tower 2020-08-07 50
ot-group type-t NY LTop 2020-08-08 90
此数据框的最小和最大日期:(起始日期)2020-08-01 -(结束日期)2020-08-10
示例输出:
group rtype location hardware date value
my-group type-s NY DTop 2020-08-01 0
my-group type-s NY DTop 2020-08-02 0
my-group type-s NY DTop 2020-08-03 0
my-group type-s NY DTop 2020-08-04 0
my-group type-s NY DTop 2020-08-05 10
my-group type-s NY DTop 2020-08-06 10
my-group type-s NY DTop 2020-08-07 20
my-group type-s NY DTop 2020-08-08 20
my-group type-s NY DTop 2020-08-09 20
my-group type-s NY DTop 2020-08-10 30
my-group type-s NY Tower 2020-08-01 40
my-group type-s NY Tower 2020-08-02 40
my-group type-s NY Tower 2020-08-03 40
my-group type-s NY Tower 2020-08-04 40
my-group type-s NY Tower 2020-08-05 40
my-group type-s NY Tower 2020-08-06 40
my-group type-s NY Tower 2020-08-07 50
my-group type-s NY Tower 2020-08-08 50
my-group type-s NY Tower 2020-08-09 50
my-group type-s NY Tower 2020-08-10 50
ot-group type-t NY LTop 2020-08-01 0
ot-group type-t NY LTop 2020-08-02 0
ot-group type-t NY LTop 2020-08-03 0
ot-group type-t NY LTop 2020-08-04 0
ot-group type-t NY LTop 2020-08-05 0
ot-group type-t NY LTop 2020-08-06 0
ot-group type-t NY LTop 2020-08-07 0
ot-group type-t NY LTop 2020-08-08 90
ot-group type-t NY LTop 2020-08-09 90
ot-group type-t NY LTop 2020-08-10 90
在这个示例中,我将位置保持不变,以避免额外的长输出。
我可以使用pd.date_range()
来获取所需的日期。
我尝试使用多索引的resample
,但遇到了错误(类似于此问题)。
我尝试了此答案中提到的方法,但似乎不起作用。
以下是我使用的代码:
import pandas as pd
df = pd.read_csv('data.csv')
df.set_index('date', inplace=True)
date_range = pd.date_range(df.index.min(), df.index.max(), freq='D')
print(len(date_range), date_range)
def reindex_by_date(df):
return df.reindex(date_range).ffill()
df = df.groupby(['group','rtype','location','hardware']).apply(reindex_by_date).reset_index([0,1,2,3], drop=True)
print(df.to_string())
此代码的输出:
10 DatetimeIndex(['2020-08-01', '2020-08-02', '2020-08-03', '2020-08-04',
'2020-08-05', '2020-08-06', '2020-08-07', '2020-08-08',
'2020-08-09', '2020-08-10'],
dtype='datetime64[ns]', freq='D')
group rtype location hardware value
2020-08-01 NaN NaN NaN NaN NaN
2020-08-02 NaN NaN NaN NaN NaN
2020-08-03 NaN NaN NaN NaN NaN
2020-08-04 NaN NaN NaN NaN NaN
2020-08-05 NaN NaN NaN NaN NaN
2020-08-06 NaN NaN NaN NaN NaN
2020-08-07 NaN NaN NaN NaN NaN
2020-08-08 NaN NaN NaN NaN NaN
2020-08-09 NaN NaN NaN NaN NaN
2020-08-10 NaN NaN NaN NaN NaN
2020-08-01 NaN NaN NaN NaN NaN
2020-08-02 NaN NaN NaN NaN NaN
2020-08-03 NaN NaN NaN NaN NaN
2020-08-04 NaN NaN
<details>
<summary>英文:</summary>
I have a dataframe with multiple string columns, one date column and one int value column.
I want to `ffill` the missing dates for each group of text columns. The missing dates are all dates from the min date to max date in the dataframe. I think this is better explained using an example.
Sample Input:
group rtype location hardware date value
my-group type-s NY DTop 2020-08-05 10
my-group type-s NY DTop 2020-08-07 20
my-group type-s NY DTop 2020-08-10 30
my-group type-s NY Tower 2020-08-01 40
my-group type-s NY Tower 2020-08-07 50
ot-group type-t NY LTop 2020-08-08 90
Min and Max date for this dataframe: (start_date) 2020-08-01 - (end_date) 2020-08-10
Sample Output:
group rtype location hardware date value
my-group type-s NY DTop 2020-08-01 0
my-group type-s NY DTop 2020-08-02 0
my-group type-s NY DTop 2020-08-03 0
my-group type-s NY DTop 2020-08-04 0
my-group type-s NY DTop 2020-08-05 10
my-group type-s NY DTop 2020-08-06 10
my-group type-s NY DTop 2020-08-07 20
my-group type-s NY DTop 2020-08-08 20
my-group type-s NY DTop 2020-08-09 20
my-group type-s NY DTop 2020-08-10 30
my-group type-s NY Tower 2020-08-01 40
my-group type-s NY Tower 2020-08-02 40
my-group type-s NY Tower 2020-08-03 40
my-group type-s NY Tower 2020-08-04 40
my-group type-s NY Tower 2020-08-05 40
my-group type-s NY Tower 2020-08-06 40
my-group type-s NY Tower 2020-08-07 50
my-group type-s NY Tower 2020-08-08 50
my-group type-s NY Tower 2020-08-09 50
my-group type-s NY Tower 2020-08-10 50
ot-group type-t NY LTop 2020-08-01 0
ot-group type-t NY LTop 2020-08-02 0
ot-group type-t NY LTop 2020-08-03 0
ot-group type-t NY LTop 2020-08-04 0
ot-group type-t NY LTop 2020-08-05 0
ot-group type-t NY LTop 2020-08-06 0
ot-group type-t NY LTop 2020-08-07 0
ot-group type-t NY LTop 2020-08-08 90
ot-group type-t NY LTop 2020-08-09 90
ot-group type-t NY LTop 2020-08-10 90
In this example, I kept the location fixed to avoid an extra long output.
I am able to get the dates I want using `pd.date_range()`.
I tried using `resample` with multiindex but I run into errors (similar to [this](https://stackoverflow.com/questions/15799162/resampling-within-a-pandas-multiindex)).
I tried the approach mentioned in [this answer](https://stackoverflow.com/a/32275705/1872234) but it doesn't seem to work:
My code using:
import pandas as pd
df = pd.read_csv('data.csv')
df.set_index('date', inplace=True)
date_range = pd.date_range(df.index.min(), df.index.max(), freq='D')
print(len(date_range), date_range)
def reindex_by_date(df):
return df.reindex(date_range).ffill()
df = df.groupby(['group','rtype','location','hardware']).apply(reindex_by_date).reset_index([0,1,2,3], drop=True)
print(df.to_string())
Output of this code:
10 DatetimeIndex(['2020-08-01', '2020-08-02', '2020-08-03', '2020-08-04',
'2020-08-05', '2020-08-06', '2020-08-07', '2020-08-08',
'2020-08-09', '2020-08-10'],
dtype='datetime64[ns]', freq='D')
group rtype location hardware value
2020-08-01 NaN NaN NaN NaN NaN
2020-08-02 NaN NaN NaN NaN NaN
2020-08-03 NaN NaN NaN NaN NaN
2020-08-04 NaN NaN NaN NaN NaN
2020-08-05 NaN NaN NaN NaN NaN
2020-08-06 NaN NaN NaN NaN NaN
2020-08-07 NaN NaN NaN NaN NaN
2020-08-08 NaN NaN NaN NaN NaN
2020-08-09 NaN NaN NaN NaN NaN
2020-08-10 NaN NaN NaN NaN NaN
2020-08-01 NaN NaN NaN NaN NaN
2020-08-02 NaN NaN NaN NaN NaN
2020-08-03 NaN NaN NaN NaN NaN
2020-08-04 NaN NaN NaN NaN NaN
2020-08-05 NaN NaN NaN NaN NaN
2020-08-06 NaN NaN NaN NaN NaN
2020-08-07 NaN NaN NaN NaN NaN
2020-08-08 NaN NaN NaN NaN NaN
2020-08-09 NaN NaN NaN NaN NaN
2020-08-10 NaN NaN NaN NaN NaN
2020-08-01 NaN NaN NaN NaN NaN
2020-08-02 NaN NaN NaN NaN NaN
2020-08-03 NaN NaN NaN NaN NaN
2020-08-04 NaN NaN NaN NaN NaN
2020-08-05 NaN NaN NaN NaN NaN
2020-08-06 NaN NaN NaN NaN NaN
2020-08-07 NaN NaN NaN NaN NaN
2020-08-08 NaN NaN NaN NaN NaN
2020-08-09 NaN NaN NaN NaN NaN
2020-08-10 NaN NaN NaN NaN NaN
Can someone help please?
**EDIT:**
After fixing the DatetimeIndex issue, and using `fillna(0)`:
df = pd.read_csv('data.csv', parse_dates=['date'])
df.set_index('date', inplace=True)
date_range = pd.date_range(df.index.min(), df.index.max(), freq='D')
print(len(date_range), date_range)
def reindex_by_date(df):
return df.reindex(date_range).ffill().fillna(0)
df = df.groupby(['group','rtype','location','hardware']).apply(reindex_by_date).reset_index([0,1,2,3], drop=True).reset_index().rename(columns={'index': 'date'})
print(df.to_string())
Output:
group rtype location hardware value
2020-08-01 0 0 0 0 0.0
2020-08-02 0 0 0 0 0.0
2020-08-03 0 0 0 0 0.0
2020-08-04 0 0 0 0 0.0
2020-08-05 my-group type-s NY DTop 10.0
2020-08-06 my-group type-s NY DTop 10.0
2020-08-07 my-group type-s NY DTop 20.0
2020-08-08 my-group type-s NY DTop 20.0
2020-08-09 my-group type-s NY DTop 20.0
2020-08-10 my-group type-s NY DTop 30.0
2020-08-01 my-group type-s NY Tower 40.0
2020-08-02 my-group type-s NY Tower 40.0
2020-08-03 my-group type-s NY Tower 40.0
2020-08-04 my-group type-s NY Tower 40.0
2020-08-05 my-group type-s NY Tower 40.0
2020-08-06 my-group type-s NY Tower 40.0
2020-08-07 my-group type-s NY Tower 50.0
2020-08-08 my-group type-s NY Tower 50.0
2020-08-09 my-group type-s NY Tower 50.0
2020-08-10 my-group type-s NY Tower 50.0
2020-08-01 0 0 0 0 0.0
2020-08-02 0 0 0 0 0.0
2020-08-03 0 0 0 0 0.0
2020-08-04 0 0 0 0 0.0
2020-08-05 0 0 0 0 0.0
2020-08-06 0 0 0 0 0.0
2020-08-07 0 0 0 0 0.0
2020-08-08 ot-group type-t NY LTop 90.0
2020-08-09 ot-group type-t NY LTop 90.0
2020-08-10 ot-group type-t NY LTop 90.0
</details>
# 答案1
**得分**: 3
你忘记将`date`列转换为`DatetimeIndex`:
```python
df = pd.read_csv('data.csv', parse_dates=['date'])
# 在此处添加你的代码
另一种方法是使用pivot_table
:
# 如果你使用了 pd.read_csv(..., parse_dates=['date']) 则不需要以下行
df['date'] = pd.to_datetime(df['date'])
dti = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
cols = ['group', 'rtype', 'location', 'hardware']
out = (df.pivot_table(index='date', columns=cols, values='value')
.reindex(dti.rename('date')).ffill().fillna(0)
.unstack().rename('value').reset_index())
输出:
>>> out
group rtype location hardware date value
0 my-group type-s NY DTop 2020-08-01 0.0
1 my-group type-s NY DTop 2020-08-02 0.0
2 my-group type-s NY DTop 2020-08-03 0.0
3 my-group type-s NY DTop 2020-08-04 0.0
4 my-group type-s NY DTop 2020-08-05 10.0
5 my-group type-s NY DTop 2020-08-06 10.0
6 my-group type-s NY DTop 2020-08-07 20.0
7 my-group type-s NY DTop 2020-08-08 20.0
8 my-group type-s NY DTop 2020-08-09 20.0
9 my-group type-s NY DTop 2020-08-10 30.0
10 my-group type-s NY Tower 2020-08-01 40.0
11 my-group type-s NY Tower 2020-08-02 40.0
12 my-group type-s NY Tower 2020-08-03 40.0
13 my-group type-s NY Tower 2020-08-04 40.0
14 my-group type-s NY Tower 2020-08-05 40.0
15 my-group type-s NY Tower 2020-08-06 40.0
16 my-group type-s NY Tower 2020-08-07 50.0
17 my-group type-s NY Tower 2020-08-08 50.0
18 my-group type-s NY Tower 2020-08-09 50.0
19 my-group type-s NY Tower 2020-08-10 50.0
20 ot-group type-t NY LTop 2020-08-01 0.0
21 ot-group type-t NY LTop 2020-08-02 0.0
22 ot-group type-t NY LTop 2020-08-03 0.0
23 ot-group type-t NY LTop 2020-08-04 0.0
24 ot-group type-t NY LTop 2020-08-05 0.0
25 ot-group type-t NY LTop 2020-08-06 0.0
26 ot-group type-t NY LTop 2020-08-07 0.0
27 ot-group type-t NY LTop 2020-08-08 90.0
28 ot-group type-t NY LTop 2020-08-09 90.0
29 ot-group type-t NY LTop 2020-08-10 90.0
然而,只使用pivot_table
而不使用unstack
可能更好:
>>> (df.pivot_table(index='date', columns=cols, values='value')
.reindex(dti.rename('date')).ffill().fillna(0))
group my-group ot-group
rtype type-s type-t
location NY NY
hardware DTop Tower LTop
date
2020-08-01 0.0 40.0 0.0
2020-08-02 0.0 40.0 0.0
2020-08-03 0.0 40.0 0.0
2020-08-04 0.0 40.0 0.0
2020-08-05 10.0 40.0 0.0
2020-08-06 10.0 40.0 0.0
2020-08-07 20.0 50.0 0.0
2020-08-08 20.0 50.0 90.0
2020-08-09 20.0 50.0 90.0
2020-08-10 30.0 50.0 90.0
英文:
You forgot to convert date
column as DatetimeIndex
:
df = pd.read_csv('data.csv', parse_dates=['date'])
# your code here
Another way with pivot_table
:
# Not needed if you used pd.read_csv(..., parse_dates=['date'])
df['date'] = pd.to_datetime(df['date'])
dti = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
cols = ['group','rtype','location','hardware']
out = (df.pivot_table(index='date', columns=cols, values='value')
.reindex(dti.rename('date')).ffill().fillna(0)
.unstack().rename('value').reset_index())
Output:
>>> out
group rtype location hardware date value
0 my-group type-s NY DTop 2020-08-01 0.0
1 my-group type-s NY DTop 2020-08-02 0.0
2 my-group type-s NY DTop 2020-08-03 0.0
3 my-group type-s NY DTop 2020-08-04 0.0
4 my-group type-s NY DTop 2020-08-05 10.0
5 my-group type-s NY DTop 2020-08-06 10.0
6 my-group type-s NY DTop 2020-08-07 20.0
7 my-group type-s NY DTop 2020-08-08 20.0
8 my-group type-s NY DTop 2020-08-09 20.0
9 my-group type-s NY DTop 2020-08-10 30.0
10 my-group type-s NY Tower 2020-08-01 40.0
11 my-group type-s NY Tower 2020-08-02 40.0
12 my-group type-s NY Tower 2020-08-03 40.0
13 my-group type-s NY Tower 2020-08-04 40.0
14 my-group type-s NY Tower 2020-08-05 40.0
15 my-group type-s NY Tower 2020-08-06 40.0
16 my-group type-s NY Tower 2020-08-07 50.0
17 my-group type-s NY Tower 2020-08-08 50.0
18 my-group type-s NY Tower 2020-08-09 50.0
19 my-group type-s NY Tower 2020-08-10 50.0
20 ot-group type-t NY LTop 2020-08-01 0.0
21 ot-group type-t NY LTop 2020-08-02 0.0
22 ot-group type-t NY LTop 2020-08-03 0.0
23 ot-group type-t NY LTop 2020-08-04 0.0
24 ot-group type-t NY LTop 2020-08-05 0.0
25 ot-group type-t NY LTop 2020-08-06 0.0
26 ot-group type-t NY LTop 2020-08-07 0.0
27 ot-group type-t NY LTop 2020-08-08 90.0
28 ot-group type-t NY LTop 2020-08-09 90.0
29 ot-group type-t NY LTop 2020-08-10 90.0
However, it could be better to use only pivot_table
without unstack
:
>>> (df.pivot_table(index='date', columns=cols, values='value')
.reindex(dti.rename('date')).ffill().fillna(0))
group my-group ot-group
rtype type-s type-t
location NY NY
hardware DTop Tower LTop
date
2020-08-01 0.0 40.0 0.0
2020-08-02 0.0 40.0 0.0
2020-08-03 0.0 40.0 0.0
2020-08-04 0.0 40.0 0.0
2020-08-05 10.0 40.0 0.0
2020-08-06 10.0 40.0 0.0
2020-08-07 20.0 50.0 0.0
2020-08-08 20.0 50.0 90.0
2020-08-09 20.0 50.0 90.0
2020-08-10 30.0 50.0 90.0
答案2
得分: 0
以下是代码部分的翻译:
# pip install pyjanitor
import janitor
import pandas as pd
df = pd.read_clipboard(parse_dates=['date'])
dti = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
cols = ['group', 'rtype', 'location', 'hardware']
(df
.complete(cols, {'date': dti})
.assign(value = lambda f: f.value.ffill().fillna(0, downcast='infer'))
)
请注意,这是代码的翻译,没有其他内容。
英文:
One option is with complete from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
df = pd.read_clipboard(parse_dates=['date'])
dti = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
cols = ['group','rtype','location','hardware']
(df
.complete(cols, {'date': dti})
.assign(value = lambda f: f.value.ffill().fillna(0, downcast='infer'))
)
group rtype location hardware date value
0 my-group type-s NY DTop 2020-08-01 0
1 my-group type-s NY DTop 2020-08-02 0
2 my-group type-s NY DTop 2020-08-03 0
3 my-group type-s NY DTop 2020-08-04 0
4 my-group type-s NY DTop 2020-08-05 10
5 my-group type-s NY DTop 2020-08-06 10
6 my-group type-s NY DTop 2020-08-07 20
7 my-group type-s NY DTop 2020-08-08 20
8 my-group type-s NY DTop 2020-08-09 20
9 my-group type-s NY DTop 2020-08-10 30
10 my-group type-s NY Tower 2020-08-01 40
11 my-group type-s NY Tower 2020-08-02 40
12 my-group type-s NY Tower 2020-08-03 40
13 my-group type-s NY Tower 2020-08-04 40
14 my-group type-s NY Tower 2020-08-05 40
15 my-group type-s NY Tower 2020-08-06 40
16 my-group type-s NY Tower 2020-08-07 50
17 my-group type-s NY Tower 2020-08-08 50
18 my-group type-s NY Tower 2020-08-09 50
19 my-group type-s NY Tower 2020-08-10 50
20 ot-group type-t NY LTop 2020-08-01 50
21 ot-group type-t NY LTop 2020-08-02 50
22 ot-group type-t NY LTop 2020-08-03 50
23 ot-group type-t NY LTop 2020-08-04 50
24 ot-group type-t NY LTop 2020-08-05 50
25 ot-group type-t NY LTop 2020-08-06 50
26 ot-group type-t NY LTop 2020-08-07 50
27 ot-group type-t NY LTop 2020-08-08 90
28 ot-group type-t NY LTop 2020-08-09 90
29 ot-group type-t NY LTop 2020-08-10 90
@corralien's suggestion of a pivot table without an unstack might be a better one, it all depends on your end goal
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论