2023年7月20日 18:17:12go评论121阅读模式

英文:

Pandas shift that takes into account groups

问题

我有时间顺序的数据（每个客户的月度汇总）。

df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
                 'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
                 'volume' : [1,2,3,4,5,6,7,8,9,10,12],
                 'num_transactions': [3,4,5,6,7,8,9,10,11,12,13],
                 'label': [1,1,1,0,1,1,0,0,0,0,0]})

数据框按用户和月份升序排列。

有一列“label”，实质上是一个分类变量。

我想引入一列“next_month_label”，在其中存储该用户的下个月的标签值。

我使用了shift，然后我意识到它没有考虑到客户1的数据随后被客户2的数据接替。所以，实际上，客户1的最后一行正在“借用”客户2的第一行的标签。相反，客户1的最后一行的“next_month_label”字段应该保持为空/ null。

如何做到这一点？

预期结果应该如下所示：

df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
                 'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
                 'volume' : [1,2,3,4,5,6,7,8,9,10,12],
                 'num_transactions': [3,4,5,6,7,8,9,10,11,12,13],
                 'label': [1,1,1,0,1,1,0,0,0,0,0],
                 'next_month_label': [1,1,0,1,1,None,0,0,0,0,None]})

英文:

I have chronological data (monthly aggregation per customer).

df=pd.DataFrame({&#39;cust_id&#39;: [1,1,1,1,1,1,2,2,2,2,2],
                 &#39;period&#39; : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
                 &#39;volume&#39; : [1,2,3,4,5,6,7,8,9,10,12],
                 &#39;num_transactions&#39;: [3,4,5,6,7,8,9,10,11,12,13],
                 &#39;label&#39;: [1,1,1,0,1,1,0,0,0,0,0]})

The dataframe is sorted out by user and month, ascending.

There is a column "label" which is, essentially, a categorical variable.

I want to introduce a column "next_month_label" where I store the label value for the next month for that user.

I used shift and then I realised that it does not consider the fact that the data for customer1 is then followed by that of customer2. So, essentially, the last row for customer1 is "borrowing" the label of the first row of customer2. Instead, the field "next_month_label" for the last row of customer1 should stay empty / null.

How to do that?

The expected result should look like this:

df=pd.DataFrame({&#39;cust_id&#39;: [1,1,1,1,1,1,2,2,2,2,2],
                 &#39;period&#39; : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
                 &#39;volume&#39; : [1,2,3,4,5,6,7,8,9,10,12],
                 &#39;num_transactions&#39;: [3,4,5,6,7,8,9,10,11,12,13],
                 &#39;label&#39;: [1,1,1,0,1,1,0,0,0,0,0],
                 &#39;next_month_label&#39;: [1,1,0,1,1,NaN,0,0,0,0,NaN],
})

答案1

得分: 1

让我知道这段代码是否给你提供了所需的结果：

df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
                 'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
                 'volume' : [1,2,3,4,5,6,7,8,9,10,12],
                 'num_transactions': [3,4,5,6,7,8,9,10,11,12,13],
                 'label': [1,1,1,0,1,1,0,0,0,0,0]})

df['next_month_label'] = df.groupby('cust_id')['label'].shift(-1)

print(df)

输出结果为：

    cust_id  period  volume  num_transactions  label  next_month_label
0         1  200010       1                 3      1               1.0
1         1  200011       2                 4      1               1.0
2         1  200012       3                 5      1               0.0
3         1  200101       4                 6      0               1.0
4         1  200102       5                 7      1               1.0
5         1  200103       6                 8      1               NaN
6         2  200010       7                 9      0               0.0
7         2  200011       8                10      0               0.0
8         2  200012       9                11      0               0.0
9         2  200101      10                12      0               0.0
10        2  200103      12                13      0               NaN

英文:

Let me know if this code gives you required result:

df=pd.DataFrame({&#39;cust_id&#39;: [1,1,1,1,1,1,2,2,2,2,2],
                 &#39;period&#39; : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
                 &#39;volume&#39; : [1,2,3,4,5,6,7,8,9,10,12],
                 &#39;num_transactions&#39;: [3,4,5,6,7,8,9,10,11,12,13],
                 &#39;label&#39;: [1,1,1,0,1,1,0,0,0,0,0]})

df[&#39;next_month_label&#39;] = df.groupby(&#39;cust_id&#39;)[&#39;label&#39;].shift(-1)

print(df)

 cust_id  period  volume  num_transactions  label  next_month_label
0         1  200010       1                 3      1               1.0
1         1  200011       2                 4      1               1.0
2         1  200012       3                 5      1               0.0
3         1  200101       4                 6      0               1.0
4         1  200102       5                 7      1               1.0
5         1  200103       6                 8      1               NaN
6         2  200010       7                 9      0               0.0
7         2  200011       8                10      0               0.0
8         2  200012       9                11      0               0.0
9         2  200101      10                12      0               0.0
10        2  200103      12                13      0               NaN

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas 的 shift 函数，考虑了分组。

问题

答案1

如何使用`scipy`中的`interp1d(x, y)`函数插值月度频率样本数据的缺失值

如何将左表转换为汇总表？

从“NEO地球近距离接近”数据库生成/收集链接，以链接到轨道查看器。

在 pandas 中，如何按照自定义规则对列按值进行分组排序。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论