英文:
Pandas shift that takes into account groups
问题
我有时间顺序的数据(每个客户的月度汇总)。
df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
'volume' : [1,2,3,4,5,6,7,8,9,10,12],
'num_transactions': [3,4,5,6,7,8,9,10,11,12,13],
'label': [1,1,1,0,1,1,0,0,0,0,0]})
数据框按用户和月份升序排列。
有一列“label”,实质上是一个分类变量。
我想引入一列“next_month_label”,在其中存储该用户的下个月的标签值。
我使用了shift,然后我意识到它没有考虑到客户1的数据随后被客户2的数据接替。所以,实际上,客户1的最后一行正在“借用”客户2的第一行的标签。相反,客户1的最后一行的“next_month_label”字段应该保持为空/ null。
如何做到这一点?
预期结果应该如下所示:
df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
'volume' : [1,2,3,4,5,6,7,8,9,10,12],
'num_transactions': [3,4,5,6,7,8,9,10,11,12,13],
'label': [1,1,1,0,1,1,0,0,0,0,0],
'next_month_label': [1,1,0,1,1,None,0,0,0,0,None]})
英文:
I have chronological data (monthly aggregation per customer).
df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
'volume' : [1,2,3,4,5,6,7,8,9,10,12],
'num_transactions': [3,4,5,6,7,8,9,10,11,12,13],
'label': [1,1,1,0,1,1,0,0,0,0,0]})
The dataframe is sorted out by user and month, ascending.
There is a column "label" which is, essentially, a categorical variable.
I want to introduce a column "next_month_label" where I store the label value for the next month for that user.
I used shift and then I realised that it does not consider the fact that the data for customer1 is then followed by that of customer2. So, essentially, the last row for customer1 is "borrowing" the label of the first row of customer2. Instead, the field "next_month_label" for the last row of customer1 should stay empty / null.
How to do that?
The expected result should look like this:
df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
'volume' : [1,2,3,4,5,6,7,8,9,10,12],
'num_transactions': [3,4,5,6,7,8,9,10,11,12,13],
'label': [1,1,1,0,1,1,0,0,0,0,0],
'next_month_label': [1,1,0,1,1,NaN,0,0,0,0,NaN],
})
答案1
得分: 1
让我知道这段代码是否给你提供了所需的结果:
df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
'volume' : [1,2,3,4,5,6,7,8,9,10,12],
'num_transactions': [3,4,5,6,7,8,9,10,11,12,13],
'label': [1,1,1,0,1,1,0,0,0,0,0]})
df['next_month_label'] = df.groupby('cust_id')['label'].shift(-1)
print(df)
输出结果为:
cust_id period volume num_transactions label next_month_label
0 1 200010 1 3 1 1.0
1 1 200011 2 4 1 1.0
2 1 200012 3 5 1 0.0
3 1 200101 4 6 0 1.0
4 1 200102 5 7 1 1.0
5 1 200103 6 8 1 NaN
6 2 200010 7 9 0 0.0
7 2 200011 8 10 0 0.0
8 2 200012 9 11 0 0.0
9 2 200101 10 12 0 0.0
10 2 200103 12 13 0 NaN
英文:
Let me know if this code gives you required result:
df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
'volume' : [1,2,3,4,5,6,7,8,9,10,12],
'num_transactions': [3,4,5,6,7,8,9,10,11,12,13],
'label': [1,1,1,0,1,1,0,0,0,0,0]})
df['next_month_label'] = df.groupby('cust_id')['label'].shift(-1)
print(df)
cust_id period volume num_transactions label next_month_label
0 1 200010 1 3 1 1.0
1 1 200011 2 4 1 1.0
2 1 200012 3 5 1 0.0
3 1 200101 4 6 0 1.0
4 1 200102 5 7 1 1.0
5 1 200103 6 8 1 NaN
6 2 200010 7 9 0 0.0
7 2 200011 8 10 0 0.0
8 2 200012 9 11 0 0.0
9 2 200101 10 12 0 0.0
10 2 200103 12 13 0 NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论