英文:
row_number resets based on two columns in Python
问题
我的目标是生成以下的行号,称为transaction_in_row
。
行号应该根据(partition_by
)客户和has_transaction列进行重置。我的问题出现在黄色列上,SQL中的row_number函数将返回3而不是1。
这是我的当前SQL代码
row_number() over(partition by customer, has_transaction order by month asc) as transaction_in_row
因为我在SQL中遇到了困难,我正在尝试在Python数据框中找到一种方法来做到这一点。我的想法是手动循环每个客户和每个月,但这将非常慢,因为我处理的是约3000万行。
有人可以帮助我找到更高效的方法吗?
英文:
My goal is to generate the following row_number, called transaction_in_row
The row_number should reset based on (partition_by
) customer and has_transaction column. My issue is on the yellow column, the row_number function in SQL will return 3 instead of 1.
This my current SQL code
row_number() over(partition by customer, has_transaction order by month asc) as transaction_in_row
Because I'm stuck in the SQL, I'm trying to find a way to do this in Python Dataframe instead. My thinking is to loop manually per customer and per month, but this will be painfully slow as I'm handling ~30 million rows.
Anyone can help me on a more efficient way to do this?
答案1
得分: 2
一种方法是使用 groupby(cumsum).cumsum
技巧。这比你在不同的信息源中找到的方法要复杂,因为你还需要按客户来执行这个操作。
这是一个版本(有很多中间结果,而不是尝试一行代码的方式。可能可以用一行代码来完成,但那更多是为了炫耀,尽管我经常这样做,但这里比仅仅连接行更难一些)。仍然只有4或5行代码(其余是注释)
import pandas as pd
# 仅作为示例的数据框。你应该提供这个 [mre]。另外
# 我添加了第二个客户 `y` 来说明“按客户”部分
# 请注意,第一个 `y` 值为True,因为前一行也是True,属于 `x`。第二个 `y` 值为True,当
# 前一个 `x` 值为False 时。
# 因此,检查这两个的情况很有趣:
# 我们期望第一个 `y` 被检测为值的变化(因为它是第一个值)。我们期望第二个 `y` 被检测为
# “与之前相同”的情况,尽管它是与前一行不同的变化,
# 但不是与最后一行相同的客户
df=pd.DataFrame({'customer':['x','x','x','x','x','x','y','x','y','y','y'], 'has_transaction':[True,True,False,False,True,True,True,False,True,True,False]})
# 一个用于按客户执行计算的groupby对象
gbcust=df.has_transaction.groupby(df.customer)
# 一个包含True/False值的表,表示每一行是否与同一客户的前一行相比发生了变化
change=gbcust.shift(1)!=gbcust.shift(0)
# 技巧在于对这个“change”值进行`.cumsum`
# 每当发生变化,`.cumsum` 结果增加1。
# 每一行与前一行相同,也保持与前一行相同的累积和
# 这样,我们得到了一种相同行的“ID”
# 但我们必须按客户来做
ids = change.groupby(df.customer).cumsum()
# 目标列。我们方便地将其设置为1,最初可以进行“cumsum”
df['transaction_in_row']=1
# 现在,我们只需累积每个ID/客户组中的行数
df['transaction_in_row'] = df['transaction_in_row'].groupby([ids, df.customer]).cumsum()
此示例的中间值
数据框:
customer | has_transaction | |
---|---|---|
0 | x | True |
1 | x | True |
2 | x | False |
3 | x | False |
4 | x | True |
5 | x | True |
6 | y | True |
7 | x | False |
8 | y | True |
9 | y | True |
10 | y | False |
change 变量的内容(这里以似乎已添加到数据框中的方式呈现)。对于每一行,如果对同一客户的前一行有新的 has_transaction
值,则 change
为True。
customer | has_transaction | change | |
---|---|---|---|
0 | x | True | True |
1 | x | True | False |
2 | x | False | True |
3 | x | False | False |
4 | x | True | True |
5 | x | True | False |
6 | y | True | True |
7 | x | False | True |
8 | y | True | False |
9 | y | True | False |
10 | y | False | True |
因此,ids 是这些值的累积和,对于给定客户的相同 has_transaction
组,它是一个不同的数字
customer | has_transaction | ids | |
---|---|---|---|
0 | x | True | 1 |
1 | x | True | 1 |
2 | x | False | 2 |
3 | x | False | 2 |
4 | x | True | 3 |
5 | x | True | 3 |
6 | y | True | 1 |
7 | x | False | 4 |
8 | y | True | 1 |
9 | y | True | 1 |
10 | y | False | 2 |
请注意,这个 ID 必须按客户来理解。因此,对于 x,我们有组1、2、3、4,对于客户 y,我们有组1和2。
最终结果是每个 ID/客户组中的行数计数
customer | has_transaction | transaction_in_row | |
---|---|---|---|
0 | x | True | 1 |
1 | x | True | 2 |
2 | x | False | 1 |
3 | x | False | 2 |
4 | x | True | 1 |
5 | x | True | 2 |
6 | y | True | 1 |
7 | x | False |
英文:
One way, is using the groupby(cumsum).cumsum
trick. Which is harder than in the different sources you could find about this, because you need also to do this per customer.
Here is one version (with lot of intermediary result, rather than trying a one-liner. It may be possible to do it in one-line, but that is more for showing off — even if I often do so —, and here is a bit tricker than just joining lines). Still there are only 4 or 5 lines of code (rest are comments)
import pandas as pd
# Just an example dataframe. You shoud have provided this [mre]. Plus
# I add a second customer `y` to illustrate the "by customer" part
# Note that the first `y` value is True as as is the preceeding line, that
# belongs to `x`. And the second `y` value is True, when
# previous `x` value is False.
# so, it is interesting to check what occur with those 2:
# we expect the first `y` to be detected as a value change (since it is
# the first value). And we expect the second `y` to be detected as a
# `same as before` case, even tho it is a change from row before,
# but not from the last row with the same customer
df=pd.DataFrame({'customer':['x','x','x','x','x','x','y','x','y','y','y'], 'has_transaction':[True,True,False,False,True,True,True,False,True,True,False]})
# A groupby object to perform computations per customer
gbcust=df.has_transaction.groupby(df.customer)
# A table of True/False value saying if a row has changed has_transaction
# value compared to the previous row of the same customer
change=gbcust.shift(1)!=gbcust.shift(0)
# The trick is to `.cumsum` the value in this "change"
# Each time we have a change, the .cumsum result increases by 1.
# Each row that is the same as the previous, keep also the same cumsum as
# the previous
# This way, we get a sort of "ID" of packets of identical rows
# But we must do this customer by customer
ids = change.groupby(df.customer).cumsum()
# The target column. We put it conveniently to 1, initially, to be able do
# `cumsum` it
df['transaction_in_row']=1
# Now, all we have to do is to accumulate the count of row in each group of ids/customer
df['transaction_in_row'] = df['transaction_in_row'].groupby([ids, df.customer]).cumsum()
Intermediary values for this example
The dataframe :
customer | has_transaction | |
---|---|---|
0 | x | True |
1 | x | True |
2 | x | False |
3 | x | False |
4 | x | True |
5 | x | True |
6 | y | True |
7 | x | False |
8 | y | True |
9 | y | True |
10 | y | False |
Content of change variable (presented here as if it were added to the dataframe, just for lisibility). Change is True for each row that has a new has_transaction
value for the same customer.
customer | has_transaction | change | |
---|---|---|---|
0 | x | True | True |
1 | x | True | False |
2 | x | False | True |
3 | x | False | False |
4 | x | True | True |
5 | x | True | False |
6 | y | True | True |
7 | x | False | True |
8 | y | True | False |
9 | y | True | False |
10 | y | False | True |
Hence the ids, which is the cumsum of this, that is a different number for each group of identical has_transaction
(for a given customer)
customer | has_transaction | ids | |
---|---|---|---|
0 | x | True | 1 |
1 | x | True | 1 |
2 | x | False | 2 |
3 | x | False | 2 |
4 | x | True | 3 |
5 | x | True | 3 |
6 | y | True | 1 |
7 | x | False | 4 |
8 | y | True | 1 |
9 | y | True | 1 |
10 | y | False | 2 |
Note that this id must be understood "by customer". So for x, we have groups 1,2,3,4, and for customer y, we have groups 1 and 2.
And the final result, which is the counting of rows in each id/customer group
customer | has_transaction | transaction_in_row | |
---|---|---|---|
0 | x | True | 1 |
1 | x | True | 2 |
2 | x | False | 1 |
3 | x | False | 2 |
4 | x | True | 1 |
5 | x | True | 2 |
6 | y | True | 1 |
7 | x | False | 1 |
8 | y | True | 2 |
9 | y | True | 3 |
10 | y | False | 1 |
I am pretty sure there are some smarter, and even faster pure pandas ways to do that (with less intermediary result).
But that is linear, and it is impossible to do better than linear for this problem. And that is vectorized : no for loop, no apply
of python functions.
So, if there is a better solution, I bet it can't be a lot more than twice faster (nothing compared to time factor 100 or 1000 that we typically get when we remove for
loops and apply
from pandas codes)
The best way, that being said, would probably to be using C extension, or numba, to create an ad-hoc numpy ufunc or pandas accumulator.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论