英文:
Replacing the value of 2020 rows by average of 2020 and 2021 data in pandas
问题
output = pd.DataFrame({
'group': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'A', 'B'],
'date': ['2020-01-31', '2021-01-31', '2020-03-31', '2021-03-31', '2020-01-31', '2021-01-31', '2020-03-31', '2021-03-31', '2022-01-31', '2022-02-28'],
'value': [15, 20, 35, 40, 10, 20, 30, 40, 10, 20]
})
英文:
I have a dataframe with columns group,date and value. my dates are at monthly level.
Since my data for 2020 is affected by Covid, I would like to impute it with average of 2020 and 2021 values.
In the dataframe how do I perform this operation?
input = pd.DataFrame({
'group': ['A', 'A', 'A','A', 'B','B','B','B','A','B'],
'date': ['2020-01-31', '2021-01-31', '2020-03-31','2021-03-31', '2020-01-31', '2021-01-31', '2020-03-31','2021-03-31','2022-01-31', '2022-02-28'],
'value': [10, 20, 30, 40,0, 20, 20, 40,10, 20 ]
})
desired output
output= pd.DataFrame({
'group': ['A', 'A', 'A','A', 'B','B','B','B','A','B'],
'date': ['2020-01-31', '2021-01-31', '2020-03-31','2021-03-31', '2020-01-31', '2021-01-31', '2020-03-31','2021-03-31','2022-01-31', '2022-02-28'],
'value': [15, 20, 35, 40, 10, 20, 30, 40,10, 20 ]
})
答案1
得分: 1
使用:
# 转换为日期时间
input['date'] = pd.to_datetime(input['date'])
# 提取年份
y = input['date'].dt.year
# 仅筛选2020年和2021年
mask = y.isin([2020, 2021])
df = input[mask]
# 获取每个组和日期的均值以及 MM-DD
s = df.groupby(['group', df['date'].dt.strftime('%m-%d')])['value'].transform('mean')
# 对于2020年,用均值替换值
df = input.assign(value=input['value'].mask(y.eq(2020) & mask, s))
print(df)
输出结果如下:
group date value
0 A 2020-01-31 15
1 A 2021-01-31 20
2 A 2020-03-31 35
3 A 2021-03-31 40
4 B 2020-01-31 10
5 B 2021-01-31 20
6 B 2020-03-31 30
7 B 2021-03-31 40
8 A 2022-01-31 10
9 B 2022-02-28 20
英文:
Use:
#convert to datetimes
input['date'] = pd.to_datetime(input['date'])
#extract years
y = input['date'].dt.year
#filter only 2020, 2021 years
mask = y.isin([2020,2021])
df = input[mask]
#get means per group and MM-DD from dates
s = df.groupby(['group', df['date'].dt.strftime('%m-%d')])['value'].transform('mean')
#for 2020 replace values by means
df = input.assign(value = input['value'].mask(y.eq(2020) & mask, s))
print (df)
group date value
0 A 2020-01-31 15
1 A 2021-01-31 20
2 A 2020-03-31 35
3 A 2021-03-31 40
4 B 2020-01-31 10
5 B 2021-01-31 20
6 B 2020-03-31 30
7 B 2021-03-31 40
8 A 2022-01-31 10
9 B 2022-02-28 20
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论