英文:
How do I assign a count to a new column, counting the rows in a groupby when the current row doesn't belong in the group being counted?
问题
Here's the code you can use to calculate the COUNT_PREV_WINDOW, COUNT_2WINDOWS_AGO, and COUNT_3WINDOWS_AGO columns without using loops:
import pandas as pd
data = {
'TRANSACTIONDATETIME': ['1-Jan-23', '2-Jan-23', '3-Jan-23', '4-Jan-23', '4-Jan-23', '5-Jan-23', '2-Jan-23', '3-Jan-23', '8-Jan-23', '9-Jan-23'],
'CARD_NUM': [123, 123, 123, 123, 123, 123, 234, 234, 234, 234],
'WINDOW': [1, 1, 2, 2, 2, 3, 1, 1, 4, 4],
'COUNT_CURRENT_WINDOW': [1, 2, 1, 2, 3, 1, 1, 2, 1, 2]
}
df = pd.DataFrame(data)
df['COUNT_PREV_WINDOW'] = df.groupby(['CARD_NUM', 'WINDOW'])['COUNT_CURRENT_WINDOW'].shift().fillna(0).astype(int)
df['COUNT_2WINDOWS_AGO'] = df.groupby(['CARD_NUM', 'WINDOW'])['COUNT_CURRENT_WINDOW'].shift(2).fillna(0).astype(int)
df['COUNT_3WINDOWS_AGO'] = df.groupby(['CARD_NUM', 'WINDOW'])['COUNT_CURRENT_WINDOW'].shift(3).fillna(0).astype(int)
print(df)
This code uses the shift
function to access the values of the previous WINDOW groups and fills NaN values with 0. It then converts the result to integers to match your desired output format.
英文:
In python I would like to modify a dataset. I have a transaction based data set as per the code below:
data = {
'TRANSACTIONDATETIME': ['1-Jan-23', '2-Jan-23', '3-Jan-23', '4-Jan-23', '4-Jan-23', '5-Jan-23', '2-Jan-23', '3-Jan-23', '8-Jan-23', '9-Jan-23'],
'CARD_NUM': [123, 123, 123, 123, 123, 123, 234, 234, 234, 234],
'WINDOW': [1, 1, 2, 2, 2, 3, 1, 1, 4, 4],
'COUNT_CURRENT_WINDOW': [1, 2, 1, 2, 3, 1, 1, 2, 1, 2]}
WINDOW groups the transactions on each card by 48 hours. E.g. A transaction is attributed WINDOW = 1 if it falls in the the first 48 hours of transactions (i.e. the first transaction on a card is always WINDOW=1). WINDOW = 2 will include transactions in the second 48 hours (relative to the first transaction on that card).
COUNT_CURRENT_WINDOW is a cumulative count of transactions occuring in the current row's WINDOW group.
Using pandas (and preferably without loops), I'd like to code 3 new columns that display what the count of transactions was in the previous 3 WINDOW groups per card. I.e.
COUNT_PREV_WINDOW = total count of transactions on the card occuring in (current row's (WINDOW value) minus 1
COUNT_ 2WINDOWS_AGO = total count of transactions on the card occuring in (current row's WINDOW value) minus 2
COUNT_3WINDOWS_AGO = total count of transactions on the card occuring in (current row's WINDOW value) minus 3
There will be occasions where there were no transactions in the previous WINDOW group. E.g. When the current row's WINDOW =1. Another example is if the current row's WINDOW = 4 but no transactions have occured since WINDOW = 1. The desired output below demonstrates this for card=234).
The desired output should looks like the following with no changes to sorting:
data = {
'TRANSACTIONDATETIME': ['1-Jan-23', '2-Jan-23', '3-Jan-23', '4-Jan-23', '4-Jan-23', '5-Jan-23', '2-Jan-23', '3-Jan-23', '8-Jan-23', '9-Jan-23'],
'CARD_NUM': [123, 123, 123, 123, 123, 123, 234, 234, 234, 234],
'WINDOW': [1, 1, 2, 2, 2, 3, 1, 1, 4, 4],
'COUNT_CURRENT_WINDOW': [1, 2, 1, 2, 3, 1, 1, 2, 1, 2],
'COUNT_PREV_WINDOW ': [0, 0, 2, 2, 2, 3, 0, 0, 0, 0],
'COUNT_2WINDOWS_AGO ': [0, 0, 0, 0, 0, 2, 0, 0, 0, 0],
'COUNT_3WINDOWS_AGO': [0, 0, 0, 0, 0, 0, 0, 0, 2, 2]}
I would prefer not to use loops due to efficiency as it as the real dataset is very large.
I have tried the following and similar variations but it just counts the current group and outputs the max.:
df['COUNT_PREV_WINDOW'] = df.groupby(['CARD_NUM',df['WINDOW']-1])['COUNT_CURRENT_WINDOW'].max().fillna(0)
答案1
得分: 2
Sure, here's the translated code:
df1 = (pd.crosstab(df['CARD_NUM'], df['WINDOW'])
.stack().groupby(level=0).shift(1).fillna(0).astype('int')
.reset_index(name='COUNT_PREV_WINDOW')
.assign(COUNT_2WINDOWS_AGO=lambda x: x['COUNT_PREV_WINDOW'].groupby(x['CARD_NUM']).shift(1).fillna(0).astype('int'))
.assign(COUNT_3WINDOWS_AGO=lambda x: x['COUNT_PREV_WINDOW'].groupby(x['CARD_NUM']).shift(2).fillna(0).astype('int')))
# Merge df and df1
out = df.merge(df1, how='left')
Please note that I've kept the code brief as requested.
英文:
Code
df1 = (pd.crosstab(df['CARD_NUM'], df['WINDOW'])
.stack().groupby(level=0).shift(1).fillna(0).astype('int')
.reset_index(name='COUNT_PREV_WINDOW')
.assign(COUNT_2WINDOWS_AGO=lambda x: x['COUNT_PREV_WINDOW'].groupby(x['CARD_NUM']).shift(1).fillna(0).astype('int'))
.assign(COUNT_3WINDOWS_AGO=lambda x: x['COUNT_PREV_WINDOW'].groupby(x['CARD_NUM']).shift(2).fillna(0).astype('int')))
df1
CARD_NUM WINDOW COUNT_PREV_WINDOW COUNT_2WINDOWS_AGO COUNT_3WINDOWS_AGO
0 123 1 0 0 0
1 123 2 2 0 0
2 123 3 3 2 0
3 123 4 1 3 2
4 234 1 0 0 0
5 234 2 2 0 0
6 234 3 0 2 0
7 234 4 0 0 2
merge df and df1
out = df.merge(df1, how='left')
out
(image)
I also don't want to attach the output as image, but your column names are too long. It's inconvenient even when solve problem, please keep it brief from next time.
答案2
得分: 1
你可以在循环中使用一个映射的DataFrame和merge
方法:
N_DAYS = 3
tmp = (df.groupby(['CARD_NUM', 'WINDOW'], as_index=False)
['COUNT_CURRENT_WINDOW'].size()
)
for i in range(N_DAYS):
df[f'COUNT_{i+1}WINDOWS_AGO' if i else 'COUNT_PREV_WINDOW'] = (
df.merge(tmp.eval(f'WINDOW = WINDOW+{i+1}'),
on=['CARD_NUM', 'WINDOW'], how='left')
['size'].fillna(0, downcast='infer')
)
输出结果:
TRANSACTIONDATETIME CARD_NUM WINDOW COUNT_CURRENT_WINDOW COUNT_PREV_WINDOW COUNT_2WINDOWS_AGO COUNT_3WINDOWS_AGO
0 1-Jan-23 123 1 1 0 0 0
1 2-Jan-23 123 1 2 0 0 0
2 3-Jan-23 123 2 1 2 0 0
3 4-Jan-23 123 2 2 2 0 0
4 4-Jan-23 123 2 3 2 0 0
5 5-Jan-23 123 3 1 3 2 0
6 2-Jan-23 234 1 1 0 0 0
7 3-Jan-23 234 1 2 0 0 0
8 8-Jan-23 234 4 1 0 0 2
9 9-Jan-23 234 4 2 0 0 2
请注意,这是原始代码和输出的中文翻译。
英文:
You can use a mapping DataFrame and merge
in a loop:
N_DAYS = 3
tmp = (df.groupby(['CARD_NUM', 'WINDOW'], as_index=False)
['COUNT_CURRENT_WINDOW'].size()
)
for i in range(N_DAYS):
df[f'COUNT_{i+1}WINDOWS_AGO' if i else 'COUNT_PREV_WINDOW'] = (
df.merge(tmp.eval(f'WINDOW = WINDOW+{i+1}'),
on=['CARD_NUM', 'WINDOW'], how='left')
['size'].fillna(0, downcast='infer')
)
Output:
TRANSACTIONDATETIME CARD_NUM WINDOW COUNT_CURRENT_WINDOW COUNT_PREV_WINDOW COUNT_2WINDOWS_AGO COUNT_3WINDOWS_AGO
0 1-Jan-23 123 1 1 0 0 0
1 2-Jan-23 123 1 2 0 0 0
2 3-Jan-23 123 2 1 2 0 0
3 4-Jan-23 123 2 2 2 0 0
4 4-Jan-23 123 2 3 2 0 0
5 5-Jan-23 123 3 1 3 2 0
6 2-Jan-23 234 1 1 0 0 0
7 3-Jan-23 234 1 2 0 0 0
8 8-Jan-23 234 4 1 0 0 2
9 9-Jan-23 234 4 2 0 0 2
答案3
得分: 0
尝试使用 Series.shift
:
import pandas as pd
data = {
'TRANSACTIONDATETIME': ['1-Jan-23', '2-Jan-23', '3-Jan-23', '4-Jan-23', '4-Jan-23', '5-Jan-23', '2-Jan-23', '3-Jan-23', '8-Jan-23', '9-Jan-23'],
'CARD_NUM': [123, 123, 123, 123, 123, 123, 234, 234, 234, 234],
'WINDOW': [1, 1, 2, 2, 2, 3, 1, 1, 4, 4],
'COUNT_CURRENT_WINDOW': [1, 2, 1, 2, 3, 1, 1, 2, 1, 2]}
df = pd.DataFrame(data)
for i in range(1, 4):
df = pd.merge(df, df.groupby(['CARD_NUM', 'WINDOW']).count()['TRANSACTIONDATETIME'].shift(i).fillna(0).rename(f'COUNT_PREV_{i}_WINDOW'), left_on=['CARD_NUM', 'WINDOW'], right_index=True)
df
结果如下:
TRANSACTIONDATETIME CARD_NUM WINDOW COUNT_CURRENT_WINDOW COUNT_PREV_1_WINDOW COUNT_PREV_2_WINDOW COUNT_PREV_3_WINDOW
0 1-Jan-23 123 1 1 0.0 0.0 0.0
1 2-Jan-23 123 1 2 0.0 0.0 0.0
2 3-Jan-23 123 2 1 2.0 0.0 0.0
3 4-Jan-23 123 2 2 2.0 0.0 0.0
4 4-Jan-23 123 2 3 2.0 0.0 0.0
5 5-Jan-23 123 3 1 3.0 2.0 0.0
6 2-Jan-23 234 1 1 1.0 3.0 2.0
7 3-Jan-23 234 1 2 1.0 3.0 2.0
8 8-Jan-23 234 4 1 2.0 1.0 3.0
9 9-Jan-23 234 4 2 2.0 1.0 3.0
英文:
Try Series.shift
:
import pandas as pd
data = {
'TRANSACTIONDATETIME': ['1-Jan-23', '2-Jan-23', '3-Jan-23', '4-Jan-23', '4-Jan-23', '5-Jan-23', '2-Jan-23', '3-Jan-23', '8-Jan-23', '9-Jan-23'],
'CARD_NUM': [123, 123, 123, 123, 123, 123, 234, 234, 234, 234],
'WINDOW': [1, 1, 2, 2, 2, 3, 1, 1, 4, 4],
'COUNT_CURRENT_WINDOW': [1, 2, 1, 2, 3, 1, 1, 2, 1, 2]}
df = pd.DataFrame(data)
for i in range(1,4):
df = pd.merge(df, df.groupby(['CARD_NUM', 'WINDOW']).count()['TRANSACTIONDATETIME'].shift(i).fillna(0).rename(f'COUNT_PREV_{i}_WINDOW'), left_on=['CARD_NUM', 'WINDOW'], right_index=True)
df
This results in:
TRANSACTIONDATETIME CARD_NUM WINDOW COUNT_CURRENT_WINDOW
0 1-Jan-23 123 1 1 \
1 2-Jan-23 123 1 2
2 3-Jan-23 123 2 1
3 4-Jan-23 123 2 2
4 4-Jan-23 123 2 3
5 5-Jan-23 123 3 1
6 2-Jan-23 234 1 1
7 3-Jan-23 234 1 2
8 8-Jan-23 234 4 1
9 9-Jan-23 234 4 2
COUNT_PREV_1_WINDOW COUNT_PREV_2_WINDOW COUNT_PREV_3_WINDOW
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 2.0 0.0 0.0
3 2.0 0.0 0.0
4 2.0 0.0 0.0
5 3.0 2.0 0.0
6 1.0 3.0 2.0
7 1.0 3.0 2.0
8 2.0 1.0 3.0
9 2.0 1.0 3.0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论