英文:
Speeding up for-loops using pandas for feature engineering
问题
我有一个包含以下标题的数据框:
- 付款方
- 收款国家
- 付款日期
每一行显示一笔交易,例如一行(Bob,UK,2023年1月1日)表示付款方Bob于2023年1月1日向英国发送了一笔付款。
对于此表中的每一行,我需要找到该行的付款方在之前向该行的国家发送付款的次数。所以对于上面的行,我想要找到Bob在2023年1月1日之前向英国发送付款的次数。
这是为了进行特征工程。
我已经使用for循环来实现这一点,我迭代每一行,并对每一行进行pandas loc调用,以查找具有相同付款方和国家但较早日期的行,但对于我需要处理的行数来说,这太慢了。
有人能想到一种使用一些快速的pandas函数加快这个过程的方法吗?
谢谢!
英文:
I have a dataframe with the following headings:
- payer
- recipient_country
- date of payment
Each rows shows a transaction, and a row (Bob,UK,1st January 2023) shows that a payer Bob sent a payment to the UK on 1st January 2023.
For each row in this table I need to find the number of times that the payer for that row has sent a payment to the country for that row in the past. So for the row above I would want to find the number of times that Bob has sent money to the UK prior to 1st January 2023.
This is for feature engineering purposes.
I have done this using a for loop in which I iterate through rows and do a pandas loc call for each row to find rows with an earlier date with the same payer and country, but this is far too slow for the number of rows I have to process.
Can anyone think of a way to speed up this process using some fast pandas functions?
Thanks!
答案1
得分: 0
以下是代码的翻译部分:
测试这个示例数据框:
df = pd.DataFrame(
[{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-01 00:00:00')},
{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-02 00:00:00')},
{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-03 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-04 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-05 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-06 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-07 00:00:00')}]
)
只需按名称和国家分组并累计计数:
>>> df['trns_bf'] = df.sort_values(by='date').groupby(['name', 'country'])['name'].cumcount()
name country date trns_bf
0 Bob UK 2023-01-01 0
1 Bob UK 2023-01-02 1
2 Bob UK 2023-01-03 2
3 Cob UK 2023-01-04 0
4 Cob UK 2023-01-05 1
5 Cob UK 2023-01-06 2
6 Cob UK 2023-01-07 3
您需要首先排序,以确保之前的元素不会与之后的元素混淆。我根据您的问题字面理解“之前”:例如,2023年1月1日Bob前往英国的交易之前没有交易。
每一行都会得到一个与该名称到该国家之前的日期的交易次数。如果一天上有多次交易,请确定您希望如何处理。我可能会使用另一个分组并选择该天的最大值:`df.groupby(['name', 'country', 'date'], as_index=False)['trns_bf'].max()`,然后将结果合并回去(索引会使直接附加变得困难,如上所示)。
英文:
Testing on this toy data frame:
df = pd.DataFrame(
[{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-01 00:00:00')},
{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-02 00:00:00')},
{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-03 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-04 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-05 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-06 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-07 00:00:00')}]
)
Just group by and cumulatively count:
>>> df['trns_bf'] = df.sort_values(by='date').groupby(['name', 'country'])['name'].cumcount()
name country date trns_bf
0 Bob UK 2023-01-01 0
1 Bob UK 2023-01-02 1
2 Bob UK 2023-01-03 2
3 Cob UK 2023-01-04 0
4 Cob UK 2023-01-05 1
5 Cob UK 2023-01-06 2
6 Cob UK 2023-01-07 3
You need to sort first, to ensure that elements before are not confused with elements after. I interpreted "prior" in your question literally: eg there are no transactions before Bob's transaction to the UK on 1 Jan 2023.
Each row gets its own count for transactions with that name to that country before that date. If there are multiple transactions on one day, determine how you want to deal with that. I would probably use another group by and select the maximum value for that day: df.groupby(['name', 'country', 'date'], as_index=False)['trns_bf'].max()
and then merge the result back (indexing will make it difficult to attach directly as above).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论