使用pandas加速for循环进行特征工程

huangapple go评论70阅读模式
英文:

Speeding up for-loops using pandas for feature engineering

问题

我有一个包含以下标题的数据框:

  • 付款方
  • 收款国家
  • 付款日期

每一行显示一笔交易,例如一行(Bob,UK,2023年1月1日)表示付款方Bob于2023年1月1日向英国发送了一笔付款。

对于此表中的每一行,我需要找到该行的付款方在之前向该行的国家发送付款的次数。所以对于上面的行,我想要找到Bob在2023年1月1日之前向英国发送付款的次数。

这是为了进行特征工程。

我已经使用for循环来实现这一点,我迭代每一行,并对每一行进行pandas loc调用,以查找具有相同付款方和国家但较早日期的行,但对于我需要处理的行数来说,这太慢了。

有人能想到一种使用一些快速的pandas函数加快这个过程的方法吗?

谢谢!

英文:

I have a dataframe with the following headings:

  • payer
  • recipient_country
  • date of payment

Each rows shows a transaction, and a row (Bob,UK,1st January 2023) shows that a payer Bob sent a payment to the UK on 1st January 2023.

For each row in this table I need to find the number of times that the payer for that row has sent a payment to the country for that row in the past. So for the row above I would want to find the number of times that Bob has sent money to the UK prior to 1st January 2023.

This is for feature engineering purposes.

I have done this using a for loop in which I iterate through rows and do a pandas loc call for each row to find rows with an earlier date with the same payer and country, but this is far too slow for the number of rows I have to process.

Can anyone think of a way to speed up this process using some fast pandas functions?

Thanks!

答案1

得分: 0

以下是代码的翻译部分:

测试这个示例数据框

df = pd.DataFrame(
    [{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-01 00:00:00')},
     {'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-02 00:00:00')},
     {'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-03 00:00:00')},
     {'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-04 00:00:00')},
     {'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-05 00:00:00')},
     {'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-06 00:00:00')},
     {'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-07 00:00:00')}]
)

只需按名称和国家分组并累计计数

>>> df['trns_bf'] = df.sort_values(by='date').groupby(['name', 'country'])['name'].cumcount()
  name country       date  trns_bf
0  Bob      UK 2023-01-01        0
1  Bob      UK 2023-01-02        1
2  Bob      UK 2023-01-03        2
3  Cob      UK 2023-01-04        0
4  Cob      UK 2023-01-05        1
5  Cob      UK 2023-01-06        2
6  Cob      UK 2023-01-07        3

您需要首先排序以确保之前的元素不会与之后的元素混淆我根据您的问题字面理解之前”:例如2023年1月1日Bob前往英国的交易之前没有交易

每一行都会得到一个与该名称到该国家之前的日期的交易次数如果一天上有多次交易请确定您希望如何处理我可能会使用另一个分组并选择该天的最大值:`df.groupby(['name', 'country', 'date'], as_index=False)['trns_bf'].max()`,然后将结果合并回去索引会使直接附加变得困难如上所示)。
英文:

Testing on this toy data frame:

df = pd.DataFrame(
[{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-01 00:00:00')},
{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-02 00:00:00')},
{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-03 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-04 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-05 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-06 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-07 00:00:00')}]
)

Just group by and cumulatively count:

>>> df['trns_bf'] = df.sort_values(by='date').groupby(['name', 'country'])['name'].cumcount()
name country       date  trns_bf
0  Bob      UK 2023-01-01        0
1  Bob      UK 2023-01-02        1
2  Bob      UK 2023-01-03        2
3  Cob      UK 2023-01-04        0
4  Cob      UK 2023-01-05        1
5  Cob      UK 2023-01-06        2
6  Cob      UK 2023-01-07        3

You need to sort first, to ensure that elements before are not confused with elements after. I interpreted "prior" in your question literally: eg there are no transactions before Bob's transaction to the UK on 1 Jan 2023.

Each row gets its own count for transactions with that name to that country before that date. If there are multiple transactions on one day, determine how you want to deal with that. I would probably use another group by and select the maximum value for that day: df.groupby(['name', 'country', 'date'], as_index=False)['trns_bf'].max() and then merge the result back (indexing will make it difficult to attach directly as above).

huangapple
  • 本文由 发表于 2023年2月18日 01:49:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/75487675.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定