2023年6月13日 12:44:30go评论150阅读模式

英文:

How to optimise pandas iterrows with large df

问题

我有不同的数据框，customers（根据"customer_id"唯一）有30,000行，以及transactions有2百万行。对于customers中的每一行，我尝试基于与df1中给定行有关的条件从transactions中获取一部分数据。我尝试了下面的简单代码，但速度太慢了...我不确定在哪里进行优化。

for index, row in customers.iterrows():
    selection = transactions.loc[(transactions["customer_id"] == row["customer_id"]) &
                                 ((transactions["purchased_ts"] - row["signup_ts"]).dt.days > 0) &
                                 ((transactions["purchased_ts"] - row["signup_ts"]).dt.days < 183)
                                 , :]
    total_amount = float(selection["amount"].sum())
    customers.loc[customers["customer_id"] == row["customer_id"], "total_amount"] = total_amount

英文:

I have different dataframes, customers (unique on "customer_id") with 30k rows and transactions with 2mil rows. For each row in customers I am trying to grab a selection from transactions based on conditions that are subject to the given row in df1. I have tried the below naive code but it is far too slow... I am unsure where to optimise.

for index, row in customers.iterrows()():
    selection = transactions.loc[(transactions[&quot;customer_id&quot;] == row[&quot;customer_id&quot;]) &amp;
                                 ((transactions[&quot;purchased_ts&quot;] - row[&quot;signup_ts&quot;]).dt.days &gt; 0) &amp;
                                 ((transactions[&quot;purchased_ts&quot;] - row[&quot;signup_ts&quot;]).dt.days &lt; 183)
                                 , :]
    total_amount = float(selection[&quot;amount&quot;].sum())
    customers.loc[customers[&quot;customer_id&quot;] == row[&quot;customer_id&quot;], &quot;total_amount&quot;] = total_amount

答案1

得分: 3

Dataframe 是一种专门用于支持矢量化表格操作的数据结构。当你对它进行迭代时，你会失去这些优势（在大多数情况下）

生成数据：

import pandas as pd
import numpy as np

# 创建顾客数据框
num_customers = 100
customer_ids = np.arange(1, num_customers + 1)
signup_timestamps = pd.date_range(start='2023-01-01', periods=num_customers, freq='D')

customers = pd.DataFrame({
    'customer_id': customer_ids,
    'signup_timestamp': signup_timestamps
})

# 创建交易数据框
num_transactions = 500
transaction_customer_ids = np.random.choice(customer_ids, size=num_transactions)
purchased_timestamps = pd.date_range(start='2023-01-01', periods=num_transactions, freq='H')
amounts = np.random.uniform(low=10, high=1000, size=num_transactions)

transactions = pd.DataFrame({
    'customer_id': transaction_customer_ids,
    'purchased_timestamp': purchased_timestamps,
    'amount': amounts
})

# 打印数据框
print("顾客数据框:")
print(customers.head())

print("\n交易数据框:")
print(transactions.head())

初始数据框：

顾客数据框:
customer_id signup_timestamp
0            1       2023-01-01
1            2       2023-01-02
2            3       2023-01-03
3            4       2023-01-04
4            5       2023-01-05

交易数据框:
customer_id purchased_timestamp      amount
0           17 2023-01-01 00:00:00  771.878903
1           14 2023-01-01 01:00:00  599.129092
2           76 2023-01-01 02:00:00  418.872408
3           86 2023-01-01 03:00:00  741.925505
4           41 2023-01-01 04:00:00  229.850105

注意：请始终提供像这样的最小可重现示例，以便用户可以专注于解决问题而不是重新创建您的数据。

解决方案：

# 合并数据框
selection = pd.merge(customers, transactions, on=["customer_id"])

# 获取日期差异并筛选
selection['date_diff'] =  selection['purchased_timestamp'] - selection['signup_timestamp']
selection['date_diff'] = selection['date_diff'].dt.days
selection = selection.loc[selection['date_diff'].between(0, 183)]

print("顾客数据框:")
print(selection.head())

合并后的数据框：

customer_id signup_timestamp purchased_timestamp      amount  date_diff
0            1       2023-01-01 2023-01-03 08:00:00   77.277407          2
1            1       2023-01-01 2023-01-08 01:00:00   40.460298          7
2            1       2023-01-01 2023-01-11 04:00:00  613.818700         10
3            1       2023-01-01 2023-01-11 11:00:00  310.237391         10
4            1       2023-01-01 2023-01-21 01:00:00  287.355806         20

更新顾客表格：

# 计算“total_amount”
total_dict = selection.groupby('customer_id')['amount'].sum().to_dict()
# 将值映射到顾客表格。
customers['total_amount'] = customers['customer_id'].map(total_dict)
print(customers)

最终输出：

customer_id signup_timestamp  total_amount
0             1       2023-01-01   1329.149603
1             2       2023-01-02   1650.039620
2             3       2023-01-03   2422.828485
3             4       2023-01-04   2768.061753
4             5       2023-01-05    358.508196
..          ...              ...           ...
95           96       2023-04-06           NaN
96           97       2023-04-07           NaN
97           98       2023-04-08           NaN
98           99       2023-04-09           NaN
99          100       2023-04-10           NaN

注意：这只是说明在处理数据框时应该具有的思维方式。可以进一步优化很多方面。我的目标是展示一种不迭代行的替代方法，而不是提供最有效的方法。

英文:

Dataframe is a data-structure made specifically to favour vectorised table wise operations. When you iterate over it you loose the advantages (in most cases)

Generate the data:

import pandas as pd
import numpy as np

# Create customers dataframe
num_customers = 100
customer_ids = np.arange(1, num_customers + 1)
signup_timestamps = pd.date_range(start=&#39;2023-01-01&#39;, periods=num_customers, freq=&#39;D&#39;)

customers = pd.DataFrame({
    &#39;customer_id&#39;: customer_ids,
    &#39;signup_timestamp&#39;: signup_timestamps
})

# Create transactions dataframe
num_transactions = 500
transaction_customer_ids = np.random.choice(customer_ids, size=num_transactions)
purchased_timestamps = pd.date_range(start=&#39;2023-01-01&#39;, periods=num_transactions, freq=&#39;H&#39;)
amounts = np.random.uniform(low=10, high=1000, size=num_transactions)

transactions = pd.DataFrame({
    &#39;customer_id&#39;: transaction_customer_ids,
    &#39;purchased_timestamp&#39;: purchased_timestamps,
    &#39;amount&#39;: amounts
})

# Print the dataframes
print(&quot;Customers DataFrame:&quot;)
print(customers.head())

print(&quot;\nTransactions DataFrame:&quot;)
print(transactions.head())

Initial dataframes:

Customers DataFrame:
customer_id signup_timestamp
0            1       2023-01-01
1            2       2023-01-02
2            3       2023-01-03
3            4       2023-01-04
4            5       2023-01-05

Transactions DataFrame:
customer_id purchased_timestamp      amount
0           17 2023-01-01 00:00:00  771.878903
1           14 2023-01-01 01:00:00  599.129092
2           76 2023-01-01 02:00:00  418.872408
3           86 2023-01-01 03:00:00  741.925505
4           41 2023-01-01 04:00:00  229.850105

Note: Please always provide an MRE like this one so that users can focus on solution not recreating your data.

Solution:

# Merge Dataframes
selection = pd.merge(customers, transactions, on=[&quot;customer_id&quot;])

# Get date difference &amp; filter
selection[&#39;date_diff&#39;] =  selection[&#39;purchased_timestamp&#39;] - selection[&#39;signup_timestamp&#39;]
selection[&#39;date_diff&#39;] = selection[&#39;date_diff&#39;].dt.days
selection = selection.loc[selection[&#39;date_diff&#39;].between(0, 183)]

print(&quot;Customers DataFrame:&quot;)
print(selection.head())

Combined dataframes:

customer_id signup_timestamp purchased_timestamp      amount  date_diff
0            1       2023-01-01 2023-01-03 08:00:00   77.277407          2
1            1       2023-01-01 2023-01-08 01:00:00   40.460298          7
2            1       2023-01-01 2023-01-11 04:00:00  613.818700         10
3            1       2023-01-01 2023-01-11 11:00:00  310.237391         10
4            1       2023-01-01 2023-01-21 01:00:00  287.355806         20

Updating customers table:

# Compute &quot;total_amount&quot;
total_dict = selection.groupby(&#39;customer_id&#39;)[&#39;amount&#39;].sum().to_dict()
# Map the values to customers table.
customers[&#39;total_amount&#39;] = customers[&#39;customer_id&#39;].map(total_dict)
print(customers)

Final Output:

customer_id signup_timestamp  total_amount
0             1       2023-01-01   1329.149603
1             2       2023-01-02   1650.039620
2             3       2023-01-03   2422.828485
3             4       2023-01-04   2768.061753
4             5       2023-01-05    358.508196
..          ...              ...           ...
95           96       2023-04-06           NaN
96           97       2023-04-07           NaN
97           98       2023-04-08           NaN
98           99       2023-04-09           NaN
99          100       2023-04-10           NaN

Note: This simply illustrates the mindset you should have when working with dataframes. It can be optimised further in many ways. My goal here is to show an alternative to iterating over the rows not to provide the most efficient method.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何优化处理大型数据框的pandas iterrows。

问题

答案1

“np.nan” 没有正确转换，但 “None” 是。

将一个 Pandas 的 groupby 对象保存到一个 CSV 文件中。

如何在使用Lambda和Boto3时在Athena中分页查询。

更改ttkbootstrap ScrolledFrame的内容

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论