如何将计数分配给一个新列,计算在当前行不属于被计算的组时的行数?

huangapple go评论57阅读模式
英文:

How do I assign a count to a new column, counting the rows in a groupby when the current row doesn't belong in the group being counted?

问题

Here's the code you can use to calculate the COUNT_PREV_WINDOW, COUNT_2WINDOWS_AGO, and COUNT_3WINDOWS_AGO columns without using loops:

import pandas as pd

data = {
    'TRANSACTIONDATETIME': ['1-Jan-23', '2-Jan-23', '3-Jan-23', '4-Jan-23', '4-Jan-23', '5-Jan-23', '2-Jan-23', '3-Jan-23', '8-Jan-23', '9-Jan-23'],
    'CARD_NUM': [123, 123, 123, 123, 123, 123, 234, 234, 234, 234],
    'WINDOW': [1, 1, 2, 2, 2, 3, 1, 1, 4, 4],
    'COUNT_CURRENT_WINDOW': [1, 2, 1, 2, 3, 1, 1, 2, 1, 2]
}

df = pd.DataFrame(data)

df['COUNT_PREV_WINDOW'] = df.groupby(['CARD_NUM', 'WINDOW'])['COUNT_CURRENT_WINDOW'].shift().fillna(0).astype(int)
df['COUNT_2WINDOWS_AGO'] = df.groupby(['CARD_NUM', 'WINDOW'])['COUNT_CURRENT_WINDOW'].shift(2).fillna(0).astype(int)
df['COUNT_3WINDOWS_AGO'] = df.groupby(['CARD_NUM', 'WINDOW'])['COUNT_CURRENT_WINDOW'].shift(3).fillna(0).astype(int)

print(df)

This code uses the shift function to access the values of the previous WINDOW groups and fills NaN values with 0. It then converts the result to integers to match your desired output format.

英文:

In python I would like to modify a dataset. I have a transaction based data set as per the code below:

data = {
'TRANSACTIONDATETIME': ['1-Jan-23', '2-Jan-23', '3-Jan-23', '4-Jan-23', '4-Jan-23', '5-Jan-23', '2-Jan-23', '3-Jan-23', '8-Jan-23', '9-Jan-23'],
'CARD_NUM': [123, 123, 123, 123, 123, 123, 234, 234, 234, 234],
'WINDOW': [1, 1, 2, 2, 2, 3, 1, 1, 4, 4],
'COUNT_CURRENT_WINDOW': [1, 2, 1, 2, 3, 1, 1, 2, 1, 2]}

WINDOW groups the transactions on each card by 48 hours. E.g. A transaction is attributed WINDOW = 1 if it falls in the the first 48 hours of transactions (i.e. the first transaction on a card is always WINDOW=1). WINDOW = 2 will include transactions in the second 48 hours (relative to the first transaction on that card).

COUNT_CURRENT_WINDOW is a cumulative count of transactions occuring in the current row's WINDOW group.

Using pandas (and preferably without loops), I'd like to code 3 new columns that display what the count of transactions was in the previous 3 WINDOW groups per card. I.e.

COUNT_PREV_WINDOW = total count of transactions on the card occuring in (current row's (WINDOW value) minus 1
COUNT_ 2WINDOWS_AGO = total count of transactions on the card occuring in (current row's WINDOW value) minus 2
COUNT_3WINDOWS_AGO = total count of transactions on the card occuring in (current row's WINDOW value) minus 3

There will be occasions where there were no transactions in the previous WINDOW group. E.g. When the current row's WINDOW =1. Another example is if the current row's WINDOW = 4 but no transactions have occured since WINDOW = 1. The desired output below demonstrates this for card=234).

The desired output should looks like the following with no changes to sorting:

data = {
'TRANSACTIONDATETIME': ['1-Jan-23', '2-Jan-23', '3-Jan-23', '4-Jan-23', '4-Jan-23', '5-Jan-23', '2-Jan-23', '3-Jan-23', '8-Jan-23', '9-Jan-23'],
'CARD_NUM': [123, 123, 123, 123, 123, 123, 234, 234, 234, 234],
'WINDOW': [1, 1, 2, 2, 2, 3, 1, 1, 4, 4],
'COUNT_CURRENT_WINDOW': [1, 2, 1, 2, 3, 1, 1, 2, 1, 2],
'COUNT_PREV_WINDOW ': [0, 0, 2, 2, 2, 3, 0, 0, 0, 0],
'COUNT_2WINDOWS_AGO ': [0, 0, 0, 0, 0, 2, 0, 0, 0, 0],
'COUNT_3WINDOWS_AGO': [0, 0, 0, 0, 0, 0, 0, 0, 2, 2]}

I would prefer not to use loops due to efficiency as it as the real dataset is very large.

I have tried the following and similar variations but it just counts the current group and outputs the max.:

df['COUNT_PREV_WINDOW'] = df.groupby(['CARD_NUM',df['WINDOW']-1])['COUNT_CURRENT_WINDOW'].max().fillna(0)

答案1

得分: 2

Sure, here's the translated code:

df1 = (pd.crosstab(df['CARD_NUM'], df['WINDOW'])
         .stack().groupby(level=0).shift(1).fillna(0).astype('int')
         .reset_index(name='COUNT_PREV_WINDOW')
         .assign(COUNT_2WINDOWS_AGO=lambda x: x['COUNT_PREV_WINDOW'].groupby(x['CARD_NUM']).shift(1).fillna(0).astype('int'))
         .assign(COUNT_3WINDOWS_AGO=lambda x: x['COUNT_PREV_WINDOW'].groupby(x['CARD_NUM']).shift(2).fillna(0).astype('int')))

# Merge df and df1
out = df.merge(df1, how='left')

Please note that I've kept the code brief as requested.

英文:

Code

df1 = (pd.crosstab(df['CARD_NUM'], df['WINDOW'])
         .stack().groupby(level=0).shift(1).fillna(0).astype('int')
         .reset_index(name='COUNT_PREV_WINDOW')
         .assign(COUNT_2WINDOWS_AGO=lambda x: x['COUNT_PREV_WINDOW'].groupby(x['CARD_NUM']).shift(1).fillna(0).astype('int'))
         .assign(COUNT_3WINDOWS_AGO=lambda x: x['COUNT_PREV_WINDOW'].groupby(x['CARD_NUM']).shift(2).fillna(0).astype('int')))

df1

CARD_NUM	WINDOW	COUNT_PREV_WINDOW	COUNT_2WINDOWS_AGO	COUNT_3WINDOWS_AGO
0	123	    1	    0	                0	                0
1	123	    2	    2	                0	                0
2	123	    3	    3	                2	                0
3	123	    4	    1	                3	                2
4	234	    1	    0	                0	                0
5	234	    2	    2	                0	                0
6	234	    3	    0	                2	                0
7	234	    4	    0	                0                	2

merge df and df1

out = df.merge(df1, how='left')

out(image)

如何将计数分配给一个新列,计算在当前行不属于被计算的组时的行数?

I also don't want to attach the output as image, but your column names are too long. It's inconvenient even when solve problem, please keep it brief from next time.

答案2

得分: 1

你可以在循环中使用一个映射的DataFrame和merge方法:

N_DAYS = 3

tmp = (df.groupby(['CARD_NUM', 'WINDOW'], as_index=False)
       ['COUNT_CURRENT_WINDOW'].size()
      )

for i in range(N_DAYS):
    df[f'COUNT_{i+1}WINDOWS_AGO' if i else 'COUNT_PREV_WINDOW'] = (
        df.merge(tmp.eval(f'WINDOW = WINDOW+{i+1}'),
                 on=['CARD_NUM', 'WINDOW'], how='left')
        ['size'].fillna(0, downcast='infer')
    )

输出结果:

  TRANSACTIONDATETIME  CARD_NUM  WINDOW  COUNT_CURRENT_WINDOW  COUNT_PREV_WINDOW  COUNT_2WINDOWS_AGO  COUNT_3WINDOWS_AGO
0            1-Jan-23       123       1                     1                  0                   0                   0
1            2-Jan-23       123       1                     2                  0                   0                   0
2            3-Jan-23       123       2                     1                  2                   0                   0
3            4-Jan-23       123       2                     2                  2                   0                   0
4            4-Jan-23       123       2                     3                  2                   0                   0
5            5-Jan-23       123       3                     1                  3                   2                   0
6            2-Jan-23       234       1                     1                  0                   0                   0
7            3-Jan-23       234       1                     2                  0                   0                   0
8            8-Jan-23       234       4                     1                  0                   0                   2
9            9-Jan-23       234       4                     2                  0                   0                   2

请注意,这是原始代码和输出的中文翻译。

英文:

You can use a mapping DataFrame and merge in a loop:

N_DAYS = 3

tmp = (df.groupby(['CARD_NUM', 'WINDOW'], as_index=False)
       ['COUNT_CURRENT_WINDOW'].size()
      )

for i in range(N_DAYS):
    df[f'COUNT_{i+1}WINDOWS_AGO' if i else 'COUNT_PREV_WINDOW'] = (
        df.merge(tmp.eval(f'WINDOW = WINDOW+{i+1}'),
                 on=['CARD_NUM', 'WINDOW'], how='left')
        ['size'].fillna(0, downcast='infer')
    )

Output:

  TRANSACTIONDATETIME  CARD_NUM  WINDOW  COUNT_CURRENT_WINDOW  COUNT_PREV_WINDOW  COUNT_2WINDOWS_AGO  COUNT_3WINDOWS_AGO
0            1-Jan-23       123       1                     1                  0                   0                   0
1            2-Jan-23       123       1                     2                  0                   0                   0
2            3-Jan-23       123       2                     1                  2                   0                   0
3            4-Jan-23       123       2                     2                  2                   0                   0
4            4-Jan-23       123       2                     3                  2                   0                   0
5            5-Jan-23       123       3                     1                  3                   2                   0
6            2-Jan-23       234       1                     1                  0                   0                   0
7            3-Jan-23       234       1                     2                  0                   0                   0
8            8-Jan-23       234       4                     1                  0                   0                   2
9            9-Jan-23       234       4                     2                  0                   0                   2

答案3

得分: 0

尝试使用 Series.shift

import pandas as pd
data = {
    'TRANSACTIONDATETIME': ['1-Jan-23', '2-Jan-23', '3-Jan-23', '4-Jan-23', '4-Jan-23', '5-Jan-23', '2-Jan-23', '3-Jan-23', '8-Jan-23', '9-Jan-23'],
    'CARD_NUM': [123, 123, 123, 123, 123, 123, 234, 234, 234, 234],
    'WINDOW': [1, 1, 2, 2, 2, 3, 1, 1, 4, 4],
    'COUNT_CURRENT_WINDOW': [1, 2, 1, 2, 3, 1, 1, 2, 1, 2]}
df = pd.DataFrame(data)
for i in range(1, 4):
    df = pd.merge(df, df.groupby(['CARD_NUM', 'WINDOW']).count()['TRANSACTIONDATETIME'].shift(i).fillna(0).rename(f'COUNT_PREV_{i}_WINDOW'), left_on=['CARD_NUM', 'WINDOW'], right_index=True)
df

结果如下:

  TRANSACTIONDATETIME  CARD_NUM  WINDOW  COUNT_CURRENT_WINDOW  COUNT_PREV_1_WINDOW  COUNT_PREV_2_WINDOW  COUNT_PREV_3_WINDOW
0            1-Jan-23       123       1                     1                  0.0                  0.0                  0.0
1            2-Jan-23       123       1                     2                  0.0                  0.0                  0.0
2            3-Jan-23       123       2                     1                  2.0                  0.0                  0.0
3            4-Jan-23       123       2                     2                  2.0                  0.0                  0.0
4            4-Jan-23       123       2                     3                  2.0                  0.0                  0.0
5            5-Jan-23       123       3                     1                  3.0                  2.0                  0.0
6            2-Jan-23       234       1                     1                  1.0                  3.0                  2.0
7            3-Jan-23       234       1                     2                  1.0                  3.0                  2.0
8            8-Jan-23       234       4                     1                  2.0                  1.0                  3.0
9            9-Jan-23       234       4                     2                  2.0                  1.0                  3.0
英文:

Try Series.shift:

import pandas as pd
data = {
'TRANSACTIONDATETIME': ['1-Jan-23', '2-Jan-23', '3-Jan-23', '4-Jan-23', '4-Jan-23', '5-Jan-23', '2-Jan-23', '3-Jan-23', '8-Jan-23', '9-Jan-23'],
'CARD_NUM': [123, 123, 123, 123, 123, 123, 234, 234, 234, 234],
'WINDOW': [1, 1, 2, 2, 2, 3, 1, 1, 4, 4],
'COUNT_CURRENT_WINDOW': [1, 2, 1, 2, 3, 1, 1, 2, 1, 2]}
df = pd.DataFrame(data)
for i in range(1,4):
    df = pd.merge(df, df.groupby(['CARD_NUM', 'WINDOW']).count()['TRANSACTIONDATETIME'].shift(i).fillna(0).rename(f'COUNT_PREV_{i}_WINDOW'), left_on=['CARD_NUM', 'WINDOW'], right_index=True)
df

This results in:

  TRANSACTIONDATETIME  CARD_NUM  WINDOW  COUNT_CURRENT_WINDOW   
0            1-Jan-23       123       1                     1  \
1            2-Jan-23       123       1                     2   
2            3-Jan-23       123       2                     1   
3            4-Jan-23       123       2                     2   
4            4-Jan-23       123       2                     3   
5            5-Jan-23       123       3                     1   
6            2-Jan-23       234       1                     1   
7            3-Jan-23       234       1                     2   
8            8-Jan-23       234       4                     1   
9            9-Jan-23       234       4                     2   

   COUNT_PREV_1_WINDOW  COUNT_PREV_2_WINDOW  COUNT_PREV_3_WINDOW  
0                  0.0                  0.0                  0.0  
1                  0.0                  0.0                  0.0  
2                  2.0                  0.0                  0.0  
3                  2.0                  0.0                  0.0  
4                  2.0                  0.0                  0.0  
5                  3.0                  2.0                  0.0  
6                  1.0                  3.0                  2.0  
7                  1.0                  3.0                  2.0  
8                  2.0                  1.0                  3.0  
9                  2.0                  1.0                  3.0  

huangapple
  • 本文由 发表于 2023年5月11日 13:46:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76224465.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定