2023年2月18日 01:49:42go评论95阅读模式

英文:

Speeding up for-loops using pandas for feature engineering

问题

我有一个包含以下标题的数据框：

付款方
收款国家
付款日期

每一行显示一笔交易，例如一行（Bob，UK，2023年1月1日）表示付款方Bob于2023年1月1日向英国发送了一笔付款。

对于此表中的每一行，我需要找到该行的付款方在之前向该行的国家发送付款的次数。所以对于上面的行，我想要找到Bob在2023年1月1日之前向英国发送付款的次数。

这是为了进行特征工程。

我已经使用for循环来实现这一点，我迭代每一行，并对每一行进行pandas loc调用，以查找具有相同付款方和国家但较早日期的行，但对于我需要处理的行数来说，这太慢了。

有人能想到一种使用一些快速的pandas函数加快这个过程的方法吗？

谢谢！

英文:

I have a dataframe with the following headings:

payer
recipient_country
date of payment

Each rows shows a transaction, and a row (Bob,UK,1st January 2023) shows that a payer Bob sent a payment to the UK on 1st January 2023.

For each row in this table I need to find the number of times that the payer for that row has sent a payment to the country for that row in the past. So for the row above I would want to find the number of times that Bob has sent money to the UK prior to 1st January 2023.

This is for feature engineering purposes.

I have done this using a for loop in which I iterate through rows and do a pandas loc call for each row to find rows with an earlier date with the same payer and country, but this is far too slow for the number of rows I have to process.

Can anyone think of a way to speed up this process using some fast pandas functions?

Thanks!

答案1

得分: 0

以下是代码的翻译部分：

测试这个示例数据框：
df = pd.DataFrame(
    [{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-01 00:00:00')},
     {'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-02 00:00:00')},
     {'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-03 00:00:00')},
     {'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-04 00:00:00')},
     {'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-05 00:00:00')},
     {'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-06 00:00:00')},
     {'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-07 00:00:00')}]
)
只需按名称和国家分组并累计计数：
>>> df['trns_bf'] = df.sort_values(by='date').groupby(['name', 'country'])['name'].cumcount()
  name country       date  trns_bf
0  Bob      UK 2023-01-01        0
1  Bob      UK 2023-01-02        1
2  Bob      UK 2023-01-03        2
3  Cob      UK 2023-01-04        0
4  Cob      UK 2023-01-05        1
5  Cob      UK 2023-01-06        2
6  Cob      UK 2023-01-07        3
您需要首先排序，以确保之前的元素不会与之后的元素混淆。我根据您的问题字面理解“之前”：例如，2023年1月1日Bob前往英国的交易之前没有交易。
每一行都会得到一个与该名称到该国家之前的日期的交易次数。如果一天上有多次交易，请确定您希望如何处理。我可能会使用另一个分组并选择该天的最大值：`df.groupby(['name', 'country', 'date'], as_index=False)['trns_bf'].max()`，然后将结果合并回去（索引会使直接附加变得困难，如上所示）。

英文:

Testing on this toy data frame:

df = pd.DataFrame(
[{&#39;name&#39;: &#39;Bob&#39;, &#39;country&#39;: &#39;UK&#39;, &#39;date&#39;: Timestamp(&#39;2023-01-01 00:00:00&#39;)},
{&#39;name&#39;: &#39;Bob&#39;, &#39;country&#39;: &#39;UK&#39;, &#39;date&#39;: Timestamp(&#39;2023-01-02 00:00:00&#39;)},
{&#39;name&#39;: &#39;Bob&#39;, &#39;country&#39;: &#39;UK&#39;, &#39;date&#39;: Timestamp(&#39;2023-01-03 00:00:00&#39;)},
{&#39;name&#39;: &#39;Cob&#39;, &#39;country&#39;: &#39;UK&#39;, &#39;date&#39;: Timestamp(&#39;2023-01-04 00:00:00&#39;)},
{&#39;name&#39;: &#39;Cob&#39;, &#39;country&#39;: &#39;UK&#39;, &#39;date&#39;: Timestamp(&#39;2023-01-05 00:00:00&#39;)},
{&#39;name&#39;: &#39;Cob&#39;, &#39;country&#39;: &#39;UK&#39;, &#39;date&#39;: Timestamp(&#39;2023-01-06 00:00:00&#39;)},
{&#39;name&#39;: &#39;Cob&#39;, &#39;country&#39;: &#39;UK&#39;, &#39;date&#39;: Timestamp(&#39;2023-01-07 00:00:00&#39;)}]
)

Just group by and cumulatively count:

&gt;&gt;&gt; df[&#39;trns_bf&#39;] = df.sort_values(by=&#39;date&#39;).groupby([&#39;name&#39;, &#39;country&#39;])[&#39;name&#39;].cumcount()
name country       date  trns_bf
0  Bob      UK 2023-01-01        0
1  Bob      UK 2023-01-02        1
2  Bob      UK 2023-01-03        2
3  Cob      UK 2023-01-04        0
4  Cob      UK 2023-01-05        1
5  Cob      UK 2023-01-06        2
6  Cob      UK 2023-01-07        3

You need to sort first, to ensure that elements before are not confused with elements after. I interpreted "prior" in your question literally: eg there are no transactions before Bob's transaction to the UK on 1 Jan 2023.

Each row gets its own count for transactions with that name to that country before that date. If there are multiple transactions on one day, determine how you want to deal with that. I would probably use another group by and select the maximum value for that day: df.groupby(['name', 'country', 'date'], as_index=False)['trns_bf'].max() and then merge the result back (indexing will make it difficult to attach directly as above).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用pandas加速for循环进行特征工程

问题

答案1

Scipy curve_fit不能拟合高斯函数

点击搜索人名后，我的目标是点击“查看详情”按钮。

如何将一个长度可变的列表插入到 wx.ListCtrl 或 wx.ComboBox 中？

如何将抓取的页面内容加载到Langchain的VectorstoreIndexCreator中？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。