Pandas 的 shift 函数,考虑了分组。

huangapple go评论51阅读模式
英文:

Pandas shift that takes into account groups

问题

我有时间顺序的数据(每个客户的月度汇总)。

df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
                 'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
                 'volume' : [1,2,3,4,5,6,7,8,9,10,12],
                 'num_transactions': [3,4,5,6,7,8,9,10,11,12,13],
                 'label': [1,1,1,0,1,1,0,0,0,0,0]})

数据框按用户和月份升序排列。

有一列“label”,实质上是一个分类变量。

我想引入一列“next_month_label”,在其中存储该用户的下个月的标签值。

我使用了shift,然后我意识到它没有考虑到客户1的数据随后被客户2的数据接替。所以,实际上,客户1的最后一行正在“借用”客户2的第一行的标签。相反,客户1的最后一行的“next_month_label”字段应该保持为空/ null。

如何做到这一点?

预期结果应该如下所示:

df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
                 'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
                 'volume' : [1,2,3,4,5,6,7,8,9,10,12],
                 'num_transactions': [3,4,5,6,7,8,9,10,11,12,13],
                 'label': [1,1,1,0,1,1,0,0,0,0,0],
                 'next_month_label': [1,1,0,1,1,None,0,0,0,0,None]})
英文:

I have chronological data (monthly aggregation per customer).

df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
                 'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
                 'volume' : [1,2,3,4,5,6,7,8,9,10,12],
                 'num_transactions': [3,4,5,6,7,8,9,10,11,12,13],
                 'label': [1,1,1,0,1,1,0,0,0,0,0]})

The dataframe is sorted out by user and month, ascending.

There is a column "label" which is, essentially, a categorical variable.

I want to introduce a column "next_month_label" where I store the label value for the next month for that user.

I used shift and then I realised that it does not consider the fact that the data for customer1 is then followed by that of customer2. So, essentially, the last row for customer1 is "borrowing" the label of the first row of customer2. Instead, the field "next_month_label" for the last row of customer1 should stay empty / null.

How to do that?

The expected result should look like this:

df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
                 'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
                 'volume' : [1,2,3,4,5,6,7,8,9,10,12],
                 'num_transactions': [3,4,5,6,7,8,9,10,11,12,13],
                 'label': [1,1,1,0,1,1,0,0,0,0,0],
                 'next_month_label': [1,1,0,1,1,NaN,0,0,0,0,NaN],
})

答案1

得分: 1

让我知道这段代码是否给你提供了所需的结果:

df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
                 'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
                 'volume' : [1,2,3,4,5,6,7,8,9,10,12],
                 'num_transactions': [3,4,5,6,7,8,9,10,11,12,13],
                 'label': [1,1,1,0,1,1,0,0,0,0,0]})

df['next_month_label'] = df.groupby('cust_id')['label'].shift(-1)

print(df)

输出结果为:

    cust_id  period  volume  num_transactions  label  next_month_label
0         1  200010       1                 3      1               1.0
1         1  200011       2                 4      1               1.0
2         1  200012       3                 5      1               0.0
3         1  200101       4                 6      0               1.0
4         1  200102       5                 7      1               1.0
5         1  200103       6                 8      1               NaN
6         2  200010       7                 9      0               0.0
7         2  200011       8                10      0               0.0
8         2  200012       9                11      0               0.0
9         2  200101      10                12      0               0.0
10        2  200103      12                13      0               NaN
英文:

Let me know if this code gives you required result:

df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
                 'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
                 'volume' : [1,2,3,4,5,6,7,8,9,10,12],
                 'num_transactions': [3,4,5,6,7,8,9,10,11,12,13],
                 'label': [1,1,1,0,1,1,0,0,0,0,0]})

df['next_month_label'] = df.groupby('cust_id')['label'].shift(-1)

print(df)

 cust_id  period  volume  num_transactions  label  next_month_label
0         1  200010       1                 3      1               1.0
1         1  200011       2                 4      1               1.0
2         1  200012       3                 5      1               0.0
3         1  200101       4                 6      0               1.0
4         1  200102       5                 7      1               1.0
5         1  200103       6                 8      1               NaN
6         2  200010       7                 9      0               0.0
7         2  200011       8                10      0               0.0
8         2  200012       9                11      0               0.0
9         2  200101      10                12      0               0.0
10        2  200103      12                13      0               NaN

huangapple
  • 本文由 发表于 2023年7月20日 18:17:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76728871.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定