用pandas将2020年和2021年数据的平均值替换2020行的数值。

huangapple go评论54阅读模式
英文:

Replacing the value of 2020 rows by average of 2020 and 2021 data in pandas

问题

output = pd.DataFrame({
'group': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'A', 'B'],
'date': ['2020-01-31', '2021-01-31', '2020-03-31', '2021-03-31', '2020-01-31', '2021-01-31', '2020-03-31', '2021-03-31', '2022-01-31', '2022-02-28'],
'value': [15, 20, 35, 40, 10, 20, 30, 40, 10, 20]
})

英文:

I have a dataframe with columns group,date and value. my dates are at monthly level.
Since my data for 2020 is affected by Covid, I would like to impute it with average of 2020 and 2021 values.
In the dataframe how do I perform this operation?


input = pd.DataFrame({
    'group': ['A', 'A', 'A','A', 'B','B','B','B','A','B'],
    'date': ['2020-01-31', '2021-01-31', '2020-03-31','2021-03-31', '2020-01-31', '2021-01-31', '2020-03-31','2021-03-31','2022-01-31', '2022-02-28'],
    'value': [10, 20, 30, 40,0, 20, 20, 40,10, 20 ]
})

desired output

output= pd.DataFrame({
    'group': ['A', 'A', 'A','A', 'B','B','B','B','A','B'],
    'date': ['2020-01-31', '2021-01-31', '2020-03-31','2021-03-31', '2020-01-31', '2021-01-31', '2020-03-31','2021-03-31','2022-01-31', '2022-02-28'],
    'value': [15, 20, 35, 40, 10, 20, 30, 40,10, 20 ]
})

答案1

得分: 1

使用:

# 转换为日期时间
input['date'] = pd.to_datetime(input['date'])

# 提取年份
y = input['date'].dt.year

# 仅筛选2020年和2021年
mask = y.isin([2020, 2021])
df = input[mask]

# 获取每个组和日期的均值以及 MM-DD
s = df.groupby(['group', df['date'].dt.strftime('%m-%d')])['value'].transform('mean')

# 对于2020年,用均值替换值
df = input.assign(value=input['value'].mask(y.eq(2020) & mask, s))
print(df)

输出结果如下:

  group       date  value
0     A 2020-01-31     15
1     A 2021-01-31     20
2     A 2020-03-31     35
3     A 2021-03-31     40
4     B 2020-01-31     10
5     B 2021-01-31     20
6     B 2020-03-31     30
7     B 2021-03-31     40
8     A 2022-01-31     10
9     B 2022-02-28     20
英文:

Use:

#convert to datetimes
input['date'] = pd.to_datetime(input['date'])

#extract years
y = input['date'].dt.year

#filter only 2020, 2021 years
mask = y.isin([2020,2021])
df = input[mask]

#get means per group and MM-DD from dates
s = df.groupby(['group', df['date'].dt.strftime('%m-%d')])['value'].transform('mean')

#for 2020 replace values by means
df = input.assign(value = input['value'].mask(y.eq(2020) & mask, s))
print (df)
  group       date  value
0     A 2020-01-31     15
1     A 2021-01-31     20
2     A 2020-03-31     35
3     A 2021-03-31     40
4     B 2020-01-31     10
5     B 2021-01-31     20
6     B 2020-03-31     30
7     B 2021-03-31     40
8     A 2022-01-31     10
9     B 2022-02-28     20

huangapple
  • 本文由 发表于 2023年3月9日 20:09:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/75684420.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定