英文:
How to optimise pandas iterrows with large df
问题
我有不同的数据框,customers
(根据"customer_id"
唯一)有30,000行,以及transactions
有2百万行。对于customers
中的每一行,我尝试基于与df1
中给定行有关的条件从transactions
中获取一部分数据。我尝试了下面的简单代码,但速度太慢了...我不确定在哪里进行优化。
for index, row in customers.iterrows():
selection = transactions.loc[(transactions["customer_id"] == row["customer_id"]) &
((transactions["purchased_ts"] - row["signup_ts"]).dt.days > 0) &
((transactions["purchased_ts"] - row["signup_ts"]).dt.days < 183)
, :]
total_amount = float(selection["amount"].sum())
customers.loc[customers["customer_id"] == row["customer_id"], "total_amount"] = total_amount
英文:
I have different dataframes, customers
(unique on "customer_id"
) with 30k rows and transactions
with 2mil rows. For each row in customers
I am trying to grab a selection from transactions
based on conditions that are subject to the given row in df1
. I have tried the below naive code but it is far too slow... I am unsure where to optimise.
for index, row in customers.iterrows()():
selection = transactions.loc[(transactions["customer_id"] == row["customer_id"]) &
((transactions["purchased_ts"] - row["signup_ts"]).dt.days > 0) &
((transactions["purchased_ts"] - row["signup_ts"]).dt.days < 183)
, :]
total_amount = float(selection["amount"].sum())
customers.loc[customers["customer_id"] == row["customer_id"], "total_amount"] = total_amount
答案1
得分: 3
Dataframe 是一种专门用于支持矢量化表格操作的数据结构。当你对它进行迭代时,你会失去这些优势(在大多数情况下)
生成数据:
import pandas as pd
import numpy as np
# 创建顾客数据框
num_customers = 100
customer_ids = np.arange(1, num_customers + 1)
signup_timestamps = pd.date_range(start='2023-01-01', periods=num_customers, freq='D')
customers = pd.DataFrame({
'customer_id': customer_ids,
'signup_timestamp': signup_timestamps
})
# 创建交易数据框
num_transactions = 500
transaction_customer_ids = np.random.choice(customer_ids, size=num_transactions)
purchased_timestamps = pd.date_range(start='2023-01-01', periods=num_transactions, freq='H')
amounts = np.random.uniform(low=10, high=1000, size=num_transactions)
transactions = pd.DataFrame({
'customer_id': transaction_customer_ids,
'purchased_timestamp': purchased_timestamps,
'amount': amounts
})
# 打印数据框
print("顾客数据框:")
print(customers.head())
print("\n交易数据框:")
print(transactions.head())
初始数据框:
顾客数据框:
customer_id signup_timestamp
0 1 2023-01-01
1 2 2023-01-02
2 3 2023-01-03
3 4 2023-01-04
4 5 2023-01-05
交易数据框:
customer_id purchased_timestamp amount
0 17 2023-01-01 00:00:00 771.878903
1 14 2023-01-01 01:00:00 599.129092
2 76 2023-01-01 02:00:00 418.872408
3 86 2023-01-01 03:00:00 741.925505
4 41 2023-01-01 04:00:00 229.850105
注意:请始终提供像这样的最小可重现示例,以便用户可以专注于解决问题而不是重新创建您的数据。
解决方案:
# 合并数据框
selection = pd.merge(customers, transactions, on=["customer_id"])
# 获取日期差异并筛选
selection['date_diff'] = selection['purchased_timestamp'] - selection['signup_timestamp']
selection['date_diff'] = selection['date_diff'].dt.days
selection = selection.loc[selection['date_diff'].between(0, 183)]
print("顾客数据框:")
print(selection.head())
合并后的数据框:
customer_id signup_timestamp purchased_timestamp amount date_diff
0 1 2023-01-01 2023-01-03 08:00:00 77.277407 2
1 1 2023-01-01 2023-01-08 01:00:00 40.460298 7
2 1 2023-01-01 2023-01-11 04:00:00 613.818700 10
3 1 2023-01-01 2023-01-11 11:00:00 310.237391 10
4 1 2023-01-01 2023-01-21 01:00:00 287.355806 20
更新顾客表格:
# 计算“total_amount”
total_dict = selection.groupby('customer_id')['amount'].sum().to_dict()
# 将值映射到顾客表格。
customers['total_amount'] = customers['customer_id'].map(total_dict)
print(customers)
最终输出:
customer_id signup_timestamp total_amount
0 1 2023-01-01 1329.149603
1 2 2023-01-02 1650.039620
2 3 2023-01-03 2422.828485
3 4 2023-01-04 2768.061753
4 5 2023-01-05 358.508196
.. ... ... ...
95 96 2023-04-06 NaN
96 97 2023-04-07 NaN
97 98 2023-04-08 NaN
98 99 2023-04-09 NaN
99 100 2023-04-10 NaN
注意:这只是说明在处理数据框时应该具有的思维方式。可以进一步优化很多方面。我的目标是展示一种不迭代行的替代方法,而不是提供最有效的方法。
英文:
Dataframe is a data-structure made specifically to favour vectorised table wise operations. When you iterate over it you loose the advantages (in most cases)
Generate the data:
import pandas as pd
import numpy as np
# Create customers dataframe
num_customers = 100
customer_ids = np.arange(1, num_customers + 1)
signup_timestamps = pd.date_range(start='2023-01-01', periods=num_customers, freq='D')
customers = pd.DataFrame({
'customer_id': customer_ids,
'signup_timestamp': signup_timestamps
})
# Create transactions dataframe
num_transactions = 500
transaction_customer_ids = np.random.choice(customer_ids, size=num_transactions)
purchased_timestamps = pd.date_range(start='2023-01-01', periods=num_transactions, freq='H')
amounts = np.random.uniform(low=10, high=1000, size=num_transactions)
transactions = pd.DataFrame({
'customer_id': transaction_customer_ids,
'purchased_timestamp': purchased_timestamps,
'amount': amounts
})
# Print the dataframes
print("Customers DataFrame:")
print(customers.head())
print("\nTransactions DataFrame:")
print(transactions.head())
Initial dataframes:
Customers DataFrame:
customer_id signup_timestamp
0 1 2023-01-01
1 2 2023-01-02
2 3 2023-01-03
3 4 2023-01-04
4 5 2023-01-05
Transactions DataFrame:
customer_id purchased_timestamp amount
0 17 2023-01-01 00:00:00 771.878903
1 14 2023-01-01 01:00:00 599.129092
2 76 2023-01-01 02:00:00 418.872408
3 86 2023-01-01 03:00:00 741.925505
4 41 2023-01-01 04:00:00 229.850105
Note: Please always provide an MRE like this one so that users can focus on solution not recreating your data.
Solution:
# Merge Dataframes
selection = pd.merge(customers, transactions, on=["customer_id"])
# Get date difference & filter
selection['date_diff'] = selection['purchased_timestamp'] - selection['signup_timestamp']
selection['date_diff'] = selection['date_diff'].dt.days
selection = selection.loc[selection['date_diff'].between(0, 183)]
print("Customers DataFrame:")
print(selection.head())
Combined dataframes:
customer_id signup_timestamp purchased_timestamp amount date_diff
0 1 2023-01-01 2023-01-03 08:00:00 77.277407 2
1 1 2023-01-01 2023-01-08 01:00:00 40.460298 7
2 1 2023-01-01 2023-01-11 04:00:00 613.818700 10
3 1 2023-01-01 2023-01-11 11:00:00 310.237391 10
4 1 2023-01-01 2023-01-21 01:00:00 287.355806 20
Updating customers table:
# Compute "total_amount"
total_dict = selection.groupby('customer_id')['amount'].sum().to_dict()
# Map the values to customers table.
customers['total_amount'] = customers['customer_id'].map(total_dict)
print(customers)
Final Output:
customer_id signup_timestamp total_amount
0 1 2023-01-01 1329.149603
1 2 2023-01-02 1650.039620
2 3 2023-01-03 2422.828485
3 4 2023-01-04 2768.061753
4 5 2023-01-05 358.508196
.. ... ... ...
95 96 2023-04-06 NaN
96 97 2023-04-07 NaN
97 98 2023-04-08 NaN
98 99 2023-04-09 NaN
99 100 2023-04-10 NaN
Note: This simply illustrates the mindset you should have when working with dataframes. It can be optimised further in many ways. My goal here is to show an alternative to iterating over the rows not to provide the most efficient method.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论