如何优化处理大型数据框的pandas iterrows。

huangapple go评论66阅读模式
英文:

How to optimise pandas iterrows with large df

问题

我有不同的数据框,customers(根据"customer_id"唯一)有30,000行,以及transactions有2百万行。对于customers中的每一行,我尝试基于与df1中给定行有关的条件从transactions中获取一部分数据。我尝试了下面的简单代码,但速度太慢了...我不确定在哪里进行优化。

for index, row in customers.iterrows():
    selection = transactions.loc[(transactions["customer_id"] == row["customer_id"]) &
                                 ((transactions["purchased_ts"] - row["signup_ts"]).dt.days > 0) &
                                 ((transactions["purchased_ts"] - row["signup_ts"]).dt.days < 183)
                                 , :]
    total_amount = float(selection["amount"].sum())
    customers.loc[customers["customer_id"] == row["customer_id"], "total_amount"] = total_amount
英文:

I have different dataframes, customers (unique on &quot;customer_id&quot;) with 30k rows and transactions with 2mil rows. For each row in customers I am trying to grab a selection from transactions based on conditions that are subject to the given row in df1. I have tried the below naive code but it is far too slow... I am unsure where to optimise.

for index, row in customers.iterrows()():
    selection = transactions.loc[(transactions[&quot;customer_id&quot;] == row[&quot;customer_id&quot;]) &amp;
                                 ((transactions[&quot;purchased_ts&quot;] - row[&quot;signup_ts&quot;]).dt.days &gt; 0) &amp;
                                 ((transactions[&quot;purchased_ts&quot;] - row[&quot;signup_ts&quot;]).dt.days &lt; 183)
                                 , :]
    total_amount = float(selection[&quot;amount&quot;].sum())
    customers.loc[customers[&quot;customer_id&quot;] == row[&quot;customer_id&quot;], &quot;total_amount&quot;] = total_amount

答案1

得分: 3

Dataframe 是一种专门用于支持矢量化表格操作的数据结构。当你对它进行迭代时,你会失去这些优势(在大多数情况下)

生成数据:

import pandas as pd
import numpy as np

# 创建顾客数据框
num_customers = 100
customer_ids = np.arange(1, num_customers + 1)
signup_timestamps = pd.date_range(start='2023-01-01', periods=num_customers, freq='D')

customers = pd.DataFrame({
    'customer_id': customer_ids,
    'signup_timestamp': signup_timestamps
})

# 创建交易数据框
num_transactions = 500
transaction_customer_ids = np.random.choice(customer_ids, size=num_transactions)
purchased_timestamps = pd.date_range(start='2023-01-01', periods=num_transactions, freq='H')
amounts = np.random.uniform(low=10, high=1000, size=num_transactions)

transactions = pd.DataFrame({
    'customer_id': transaction_customer_ids,
    'purchased_timestamp': purchased_timestamps,
    'amount': amounts
})

# 打印数据框
print("顾客数据框:")
print(customers.head())

print("\n交易数据框:")
print(transactions.head())

初始数据框:

顾客数据框:
customer_id signup_timestamp
0            1       2023-01-01
1            2       2023-01-02
2            3       2023-01-03
3            4       2023-01-04
4            5       2023-01-05

交易数据框:
customer_id purchased_timestamp      amount
0           17 2023-01-01 00:00:00  771.878903
1           14 2023-01-01 01:00:00  599.129092
2           76 2023-01-01 02:00:00  418.872408
3           86 2023-01-01 03:00:00  741.925505
4           41 2023-01-01 04:00:00  229.850105

注意:请始终提供像这样的最小可重现示例,以便用户可以专注于解决问题而不是重新创建您的数据。

解决方案:

# 合并数据框
selection = pd.merge(customers, transactions, on=["customer_id"])

# 获取日期差异并筛选
selection['date_diff'] =  selection['purchased_timestamp'] - selection['signup_timestamp']
selection['date_diff'] = selection['date_diff'].dt.days
selection = selection.loc[selection['date_diff'].between(0, 183)]

print("顾客数据框:")
print(selection.head())

合并后的数据框:

customer_id signup_timestamp purchased_timestamp      amount  date_diff
0            1       2023-01-01 2023-01-03 08:00:00   77.277407          2
1            1       2023-01-01 2023-01-08 01:00:00   40.460298          7
2            1       2023-01-01 2023-01-11 04:00:00  613.818700         10
3            1       2023-01-01 2023-01-11 11:00:00  310.237391         10
4            1       2023-01-01 2023-01-21 01:00:00  287.355806         20

更新顾客表格:

# 计算“total_amount”
total_dict = selection.groupby('customer_id')['amount'].sum().to_dict()
# 将值映射到顾客表格。
customers['total_amount'] = customers['customer_id'].map(total_dict)
print(customers)

最终输出:

customer_id signup_timestamp  total_amount
0             1       2023-01-01   1329.149603
1             2       2023-01-02   1650.039620
2             3       2023-01-03   2422.828485
3             4       2023-01-04   2768.061753
4             5       2023-01-05    358.508196
..          ...              ...           ...
95           96       2023-04-06           NaN
96           97       2023-04-07           NaN
97           98       2023-04-08           NaN
98           99       2023-04-09           NaN
99          100       2023-04-10           NaN

注意:这只是说明在处理数据框时应该具有的思维方式。可以进一步优化很多方面。我的目标是展示一种不迭代行的替代方法,而不是提供最有效的方法。

英文:

Dataframe is a data-structure made specifically to favour vectorised table wise operations. When you iterate over it you loose the advantages (in most cases)

Generate the data:

import pandas as pd
import numpy as np

# Create customers dataframe
num_customers = 100
customer_ids = np.arange(1, num_customers + 1)
signup_timestamps = pd.date_range(start=&#39;2023-01-01&#39;, periods=num_customers, freq=&#39;D&#39;)

customers = pd.DataFrame({
    &#39;customer_id&#39;: customer_ids,
    &#39;signup_timestamp&#39;: signup_timestamps
})

# Create transactions dataframe
num_transactions = 500
transaction_customer_ids = np.random.choice(customer_ids, size=num_transactions)
purchased_timestamps = pd.date_range(start=&#39;2023-01-01&#39;, periods=num_transactions, freq=&#39;H&#39;)
amounts = np.random.uniform(low=10, high=1000, size=num_transactions)

transactions = pd.DataFrame({
    &#39;customer_id&#39;: transaction_customer_ids,
    &#39;purchased_timestamp&#39;: purchased_timestamps,
    &#39;amount&#39;: amounts
})

# Print the dataframes
print(&quot;Customers DataFrame:&quot;)
print(customers.head())

print(&quot;\nTransactions DataFrame:&quot;)
print(transactions.head())

Initial dataframes:

Customers DataFrame:
customer_id signup_timestamp
0            1       2023-01-01
1            2       2023-01-02
2            3       2023-01-03
3            4       2023-01-04
4            5       2023-01-05

Transactions DataFrame:
customer_id purchased_timestamp      amount
0           17 2023-01-01 00:00:00  771.878903
1           14 2023-01-01 01:00:00  599.129092
2           76 2023-01-01 02:00:00  418.872408
3           86 2023-01-01 03:00:00  741.925505
4           41 2023-01-01 04:00:00  229.850105

Note: Please always provide an MRE like this one so that users can focus on solution not recreating your data.


Solution:

# Merge Dataframes
selection = pd.merge(customers, transactions, on=[&quot;customer_id&quot;])

# Get date difference &amp; filter
selection[&#39;date_diff&#39;] =  selection[&#39;purchased_timestamp&#39;] - selection[&#39;signup_timestamp&#39;]
selection[&#39;date_diff&#39;] = selection[&#39;date_diff&#39;].dt.days
selection = selection.loc[selection[&#39;date_diff&#39;].between(0, 183)]

print(&quot;Customers DataFrame:&quot;)
print(selection.head())

Combined dataframes:

customer_id signup_timestamp purchased_timestamp      amount  date_diff
0            1       2023-01-01 2023-01-03 08:00:00   77.277407          2
1            1       2023-01-01 2023-01-08 01:00:00   40.460298          7
2            1       2023-01-01 2023-01-11 04:00:00  613.818700         10
3            1       2023-01-01 2023-01-11 11:00:00  310.237391         10
4            1       2023-01-01 2023-01-21 01:00:00  287.355806         20

Updating customers table:

# Compute &quot;total_amount&quot;
total_dict = selection.groupby(&#39;customer_id&#39;)[&#39;amount&#39;].sum().to_dict()
# Map the values to customers table.
customers[&#39;total_amount&#39;] = customers[&#39;customer_id&#39;].map(total_dict)
print(customers)

Final Output:

customer_id signup_timestamp  total_amount
0             1       2023-01-01   1329.149603
1             2       2023-01-02   1650.039620
2             3       2023-01-03   2422.828485
3             4       2023-01-04   2768.061753
4             5       2023-01-05    358.508196
..          ...              ...           ...
95           96       2023-04-06           NaN
96           97       2023-04-07           NaN
97           98       2023-04-08           NaN
98           99       2023-04-09           NaN
99          100       2023-04-10           NaN

Note: This simply illustrates the mindset you should have when working with dataframes. It can be optimised further in many ways. My goal here is to show an alternative to iterating over the rows not to provide the most efficient method.

huangapple
  • 本文由 发表于 2023年6月13日 12:44:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76461753.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定