加速遍历DataFrame时使用NumPy的lambda行吗?

huangapple go评论79阅读模式
英文:

Speed up iteration over dataframe lambda row with numpy?

问题

以下是您的代码,经过翻译:

###########################################################
### 回测优化器 - 暴力方法 - 三个指标 ###
###########################################################

import pandas as pd
import itertools
import time
import numpy as np

pd.set_option('display.max_rows', 15)
tz = timezone('America/Denver')

x = 'QQQ'
trades_threshold = 1000 

print('回测优化器开始于', datetime.now(tz).strftime('%H:%M:%S'))
checkpoint_01 = time.perf_counter()

filename = str('/content/'+x+'_summarized_trades.csv')
summarized_trades = pd.read_csv(filename)
print('交易数量', len(summarized_trades))

##############################
list_of_indicators = ['rsi','std_pct','atr_5m/last_trade']
##############################

indicator_01 = list_of_indicators[0]
print('指标01是', indicator_01)
indicator_02 = list_of_indicators[1]
print('指标02是', indicator_02)
indicator_03 = list_of_indicators[2]
print('指标03是', indicator_03)

summarized_trades['indicator_01_quantile'] = pd.qcut(summarized_trades[indicator_01].values, 10).codes
summarized_trades['indicator_02_quantile'] = pd.qcut(summarized_trades[indicator_02].values, 10).codes
summarized_trades['indicator_03_quantile'] = pd.qcut(summarized_trades[indicator_03].values, 10).codes
summarized_trades['indicator_01_quantile_min'] = pd.qcut(summarized_trades[indicator_01], 10).apply(lambda x: x.left)
summarized_trades['indicator_01_quantile_max'] = pd.qcut(summarized_trades[indicator_01], 10).apply(lambda x: x.right)
summarized_trades['indicator_02_quantile_min'] = pd.qcut(summarized_trades[indicator_02], 10).apply(lambda x: x.left)
summarized_trades['indicator_02_quantile_max'] = pd.qcut(summarized_trades[indicator_02], 10).apply(lambda x: x.right)
summarized_trades['indicator_03_quantile_min'] = pd.qcut(summarized_trades[indicator_03], 10).apply(lambda x: x.left)
summarized_trades['indicator_03_quantile_max'] = pd.qcut(summarized_trades[indicator_03], 10).apply(lambda x: x.right)

summarized_trades['number_of_trades'] = 1
summarized_trades['number_of_winners'] = summarized_trades.net_pnl.apply(lambda x: 1 if x > 0 else 0)

### 将类似的分位数桶进行分组
grouped_trades = summarized_trades.groupby(by=['indicator_01_quantile','indicator_02_quantile','indicator_03_quantile']).agg({'net_pnl':'sum','number_of_trades':'sum','number_of_winners':'sum','indicator_01_quantile_min':'min','indicator_01_quantile_max':'max','indicator_02_quantile_min':'min','indicator_02_quantile_max':'max','indicator_03_quantile_min':'min','indicator_03_quantile_max':'max'})
grouped_trades = grouped_trades.reset_index()  # 需要这样做,否则不可哈希化
print('分组交易数', len(grouped_trades))
display(grouped_trades)

output = []
default_possibilities = [0,1,2,3,4,5,6,7,8,9] # 对于使用10个分位数的情况有效
possibilities = list(itertools.product(default_possibilities,default_possibilities,default_possibilities,default_possibilities,default_possibilities,default_possibilities))  # 每个指标都可以有10x10的排列组合
print('排列组合长度', len(possibilities))
possibilities = [item for item in possibilities if item[1]>=item[0] and item[3]>=item[2] and item[5]>=item[4]] # 只保留有意义的组合(例如,技术指标>2且<4,但不是>6和<3)
print('修订后的排列组合长度', len(possibilities))

### 遍历有意义的组合
for i in possibilities:
  a = i[0]
  b = i[1]
  c = i[2]
  d = i[3]
  e = i[4]
  f = i[5]
  ### 为每一行创建新列以进行过滤,如果行是YES并且满足最小交易阈值,则将结果输出到数据框中
  ### 使用NUMPY而不是LAMBDA ROW??
  grouped_trades['filter_01'] = grouped_trades.apply(lambda row: 'YES' if row['indicator_01_quantile'] >= a and row['indicator_01_quantile'] <= b and row['indicator_02_quantile'] >= c and row['indicator_02_quantile'] <= d and row['indicator_03_quantile'] >= e and row['indicator_03_quantile'] <= f else 'NO', axis=1)
  grouped_trades_filtered = grouped_trades.loc[(grouped_trades['filter_01']=='YES')]
  number_of_trades = grouped_trades_filtered.number_of_trades.sum()
  if number_of_trades < trades_threshold:
    continue
  else:
    number_of_winners = grouped_trades_filtered.number_of_winners.sum()
    net_pnl = grouped_trades_filtered.net_pnl.sum()
    output.append({'tech_indicator_01_min':a,'tech_indicator_01_max':b, 'tech_indicator_02_min':c,'tech_indicator_02_max':d, 'tech_indicator_03_min':e,'tech_indicator_03_max':f,'number_of_trades':number_of_trades,'number_of_winners':number_of_winners,'net_pnl':net_pnl})

output_df = pd.DataFrame(output)
print('输出数据框长度',len(output_df))
output_df['win_rate'] = output_df.number_of_winners/output_df.number_of_trades
final_df = output_df.copy() ### 如果以上的阈值比较
final_df.sort_values(by='net_pnl', ascending=False, inplace=True)
display(final_df)
checkpoint_02 = time.perf_counter()
print('总时间',checkpoint_02-checkpoint_01)

# 取final_df的第一行,然后取每个值并使用分位数
tech_indicator_01_min_quantile = final_df.iloc[0]['tech_indicator_01_min']
print(tech_indicator_01_min_quantile)
tech_indicator_01_max_quantile = final_df.iloc[0]['

<details>
<summary>英文:</summary>

Below is my code, which starts with 1,000,000 permutations, then eliminates permutations that don&#39;t make sense (for example, a row where indicator_01 is greater than 6 and is less than 4).
In this example, I am then working with 166,375 permutations.

The code is iterating through each of the 166,375 rows, and then doing a groupby to sum up the pnl (profit $), then produce an output which shows the best overall profit.

This process takes ~30 minutes.
I understand from reading many questions here, that utilizing numpy could potentially speed this up considerably.  I&#39;ve been working on this for a few days and getting nowhere, so I&#39;m posting my original code (iterating through each with lambda rows: and looking for the community&#39;s help.

###########################################################

Backtest Optimizer - Brute Force - Three indicators

###########################################################

import pandas as pd
import itertools
import time
import numpy as np

pd.set_option('display.max_rows', 15)
tz = timezone('America/Denver')

x = 'QQQ'
trades_threshold = 1000

print('Backtest Optimizer started at',datetime.now(tz).strftime('%H:%M:%S'))
checkpoint_01 = time.perf_counter()

filename = str('/content/'+x+'_summarized_trades.csv')
summarized_trades = pd.read_csv(filename)
print('Number of trades',len(summarized_trades))

##############################
list_of_indicators = ['rsi','std_pct','atr_5m/last_trade']
##############################

indicator_01 = list_of_indicators[0]
print('Indicator_01 is ',indicator_01)
indicator_02 = list_of_indicators1
print('Indicator_02 is ',indicator_02)
indicator_03 = list_of_indicators[2]
print('Indicator_03 is ',indicator_03)

summarized_trades['indicator_01_quantile'] = pd.qcut(summarized_trades[indicator_01].values, 10).codes
summarized_trades['indicator_02_quantile'] = pd.qcut(summarized_trades[indicator_02].values, 10).codes
summarized_trades['indicator_03_quantile'] = pd.qcut(summarized_trades[indicator_03].values, 10).codes
summarized_trades['indicator_01_quantile_min'] = pd.qcut(summarized_trades[indicator_01], 10).apply(lambda x: x.left)
summarized_trades['indicator_01_quantile_max'] = pd.qcut(summarized_trades[indicator_01], 10).apply(lambda x: x.right)
summarized_trades['indicator_02_quantile_min'] = pd.qcut(summarized_trades[indicator_02], 10).apply(lambda x: x.left)
summarized_trades['indicator_02_quantile_max'] = pd.qcut(summarized_trades[indicator_02], 10).apply(lambda x: x.right)
summarized_trades['indicator_03_quantile_min'] = pd.qcut(summarized_trades[indicator_03], 10).apply(lambda x: x.left)
summarized_trades['indicator_03_quantile_max'] = pd.qcut(summarized_trades[indicator_03], 10).apply(lambda x: x.right)

summarized_trades['number_of_trades'] = 1
summarized_trades['number_of_winners'] = summarized_trades.net_pnl.apply(lambda x: 1 if x >0 else 0)

Group into similar quantile buckets

grouped_trades = summarized_trades.groupby(by=['indicator_01_quantile','indicator_02_quantile','indicator_03_quantile']).agg({'net_pnl':'sum','number_of_trades':'sum','number_of_winners':'sum','indicator_01_quantile_min':'min','indicator_01_quantile_max':'max','indicator_02_quantile_min':'min','indicator_02_quantile_max':'max','indicator_03_quantile_min':'min','indicator_03_quantile_max':'max'})
grouped_trades = grouped_trades.reset_index() # Need to do this, otherwise unhashable
print('Length of grouped trades =',len(grouped_trades))
display(grouped_trades)

output = []
default_possibilities = [0,1,2,3,4,5,6,7,8,9] # Works for using 10 quantiles
possibilities = list(itertools.product(default_possibilities,default_possibilities,default_possibilities,default_possibilities,default_possibilities,default_possibilities)) # each of the three indicators can have 10x10 permutations
print('length of possibilities',len(possibilities))
possibilities = [item for item in possibilities if item1>=item[0] and item[3]>=item[2] and item[5]>=item[4]] #only keep combinations that make sense (e.g., tech indicator >2 and <4, but not >6 and <3)
print('length of revised possibilities',len(possibilities))

iterate through the combinations that make sense

for i in possibilities:
a = i[0]
b = i1
c = i[2]
d = i[3]
e = i[4]
f = i[5]

for each row, create new column to filter on, and if row is YES and has meets minimum trades_threshold, output to dataframe the results

UTILIZE NUMPY INSTEAD OF LAMBDA ROW????

grouped_trades['filter_01'] = grouped_trades.apply(lambda row: 'YES' if row['indicator_01_quantile'] >= a and row['indicator_01_quantile'] <= b and row['indicator_02_quantile'] >= c and row['indicator_02_quantile'] <= d and row['indicator_03_quantile'] >= e and row['indicator_03_quantile'] <= f else 'NO', axis=1)
grouped_trades_filtered = grouped_trades.loc[(grouped_trades['filter_01']=='YES')]
number_of_trades = grouped_trades_filtered.number_of_trades.sum()
if number_of_trades < trades_threshold:
continue
else:
number_of_winners = grouped_trades_filtered.number_of_winners.sum()
net_pnl = grouped_trades_filtered.net_pnl.sum()
output.append({'tech_indicator_01_min':a,'tech_indicator_01_max':b, 'tech_indicator_02_min':c,'tech_indicator_02_max':d, 'tech_indicator_03_min':e,'tech_indicator_03_max':f,'number_of_trades':number_of_trades,'number_of_winners':number_of_winners,'net_pnl':net_pnl})

output_df = pd.DataFrame(output)
print('Length of output',len(output_df))
output_df['win_rate'] = output_df.number_of_winners/output_df.number_of_trades
final_df = output_df.copy() ### if comparing with threshold above
final_df.sort_values(by='net_pnl', ascending=False, inplace=True)
display(final_df)
checkpoint_02 = time.perf_counter()
print('Total time',checkpoint_02-checkpoint_01)

Take row 1 of the final_df, and take each value and use quantile

tech_indicator_01_min_quantile = final_df.iloc[0]['tech_indicator_01_min']
print(tech_indicator_01_min_quantile)
tech_indicator_01_max_quantile = final_df.iloc[0]['tech_indicator_01_max']
print(tech_indicator_01_max_quantile)
indicator_01_min_df = grouped_trades.loc[(grouped_trades.indicator_01_quantile == tech_indicator_01_min_quantile)]
tech_indicator_01_min_value = indicator_01_min_df.iloc[0]['indicator_01_quantile_min']
print(tech_indicator_01_min_value)
indicator_01_max_df = grouped_trades.loc[(grouped_trades.indicator_01_quantile == tech_indicator_01_max_quantile)]
tech_indicator_01_max_value = indicator_01_max_df.iloc[0]['indicator_01_quantile_max']
print(tech_indicator_01_max_value)

tech_indicator_02_min_quantile = final_df.iloc[0]['tech_indicator_02_min']
print(tech_indicator_02_min_quantile)
tech_indicator_02_max_quantile = final_df.iloc[0]['tech_indicator_02_max']
print(tech_indicator_02_max_quantile)
indicator_02_min_df = grouped_trades.loc[(grouped_trades.indicator_02_quantile == tech_indicator_02_min_quantile)]
tech_indicator_02_min_value = indicator_02_min_df.iloc[0]['indicator_02_quantile_min']
print(tech_indicator_02_min_value)
indicator_02_max_df = grouped_trades.loc[(grouped_trades.indicator_02_quantile == tech_indicator_02_max_quantile)]
tech_indicator_02_max_value = indicator_02_max_df.iloc[0]['indicator_02_quantile_max']
print(tech_indicator_02_max_value)

tech_indicator_03_min_quantile = final_df.iloc[0]['tech_indicator_03_min']
print(tech_indicator_03_min_quantile)
tech_indicator_03_max_quantile = final_df.iloc[0]['tech_indicator_03_max']
print(tech_indicator_03_max_quantile)
indicator_03_min_df = grouped_trades.loc[(grouped_trades.indicator_03_quantile == tech_indicator_03_min_quantile)]
tech_indicator_03_min_value = indicator_03_min_df.iloc[0]['indicator_03_quantile_min']
print(tech_indicator_03_min_value)
indicator_03_max_df = grouped_trades.loc[(grouped_trades.indicator_03_quantile == tech_indicator_03_max_quantile)]
tech_indicator_03_max_value = indicator_03_max_df.iloc[0]['indicator_03_quantile_max']
print(tech_indicator_03_max_value)

print(indicator_01,'>=',tech_indicator_01_min_value,'and',indicator_01,'<=',tech_indicator_01_max_value,'and',indicator_02,'>=',tech_indicator_02_min_value,'and',indicator_02,'<=',tech_indicator_02_max_value,'and',indicator_03,'>=',tech_indicator_03_min_value,'and',indicator_03,'<=',tech_indicator_03_max_value)


Edit:
Here&#39;s an output using random data
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/ajPLn.jpg
</details>
# 答案1
**得分**: 1
你可以使用 [numpy广播][1](计算两个数组的外积)来向量化你的操作。这是一个常见的模式,可以一次计算所有行的相同操作(如果内存足够的话...)。
你需要理解的是:
- `xxx[:, None]`:添加一个新的轴 `array([0, 1, 2])[:, None]` => `array([[0], [1], [2]])`
- `np.sum(..., axis=n)`:沿着一个轴计算和,而不是整个数组(降维)。
下面的代码可以简化,但会降低可读性。所以,用这段代码替换你的循环:
```python
# 转换为numpy
arr = np.array(possibilities)
lb, ub = arr[:, 0::2], arr[:, 1::2]
cols = ['indicator_01_quantile', 'indicator_02_quantile', 'indicator_03_quantile']
qtl = grouped_trades[cols].values
# 使用数字广播计算
m1 = np.all(qtl >= lb[:, None], axis=2)  # 下限
m2 = np.all(qtl <= ub[:, None], axis=2)  # 上限
m3 = np.all(m1 & m2, axis=1)
number_of_trades = np.sum(m3[:, None] * grouped_trades['number_of_trades'].values, axis=1)
m4 = number_of_trades >= trades_threshold
number_of_winners = np.sum(m4[:, None] * grouped_trades['number_of_winners'].values, axis=1)
net_pnl = np.sum(m4[:, None] * grouped_trades['net_pnl'].values, axis=1)
# 创建输出的DataFrame
cols = ['tech_indicator_01_min', 'tech_indicator_01_max', 'tech_indicator_02_min',
'tech_indicator_02_max', 'tech_indicator_03_min', 'tech_indicator_03_max']
df1 = pd.DataFrame(arr[m4], columns=cols)
df2 = pd.DataFrame({'number_of_trades': number_of_trades[m4],
'number_of_winners': number_of_winners[m4],
'net_pnl': net_pnl[m4]})
output_df = pd.concat([df1, df2], axis=1)
英文:

You can use numpy broadcasting (to compute the outer product of 2 arrays) to vectorize your operations. This is a common pattern to calculate the same operations for all rows at once (if you have enough memory...).

What you have to understand:

  • xxx[:, None]: add a new axis array([0, 1, 2])[:, None] => array([[0], [1], [2]])
  • np.sum(..., axis=n): compute the sum along an axis, not on whole array (dimension reduction).

The code below can be reduced at the expense of readability. So, replace your loop by this code:

# Convert to numpy
arr = np.array(possibilities)
lb, ub = arr[:, 0::2], arr[:, 1::2]
cols = [&#39;indicator_01_quantile&#39;, &#39;indicator_02_quantile&#39;, &#39;indicator_03_quantile&#39;]
qtl = grouped_trades[cols].values
# Compute with number broadcasting
m1 = np.all(qtl &gt;= lb[:, None], axis=2)  # lower bounds
m2 = np.all(qtl &lt;= ub[:, None], axis=2)  # upper bounds
m3 = np.all(m1 &amp; m2, axis=1)
number_of_trades = np.sum(m3[:, None] * grouped_trades[&#39;number_of_trades&#39;].values, axis=1)
m4 = number_of_trades &gt;= trades_threshold
number_of_winners = np.sum(m4[:, None] * grouped_trades[&#39;number_of_winners&#39;].values, axis=1)
net_pnl = np.sum(m4[:, None] * grouped_trades[&#39;net_pnl&#39;].values, axis=1)
# Create output dataframe
cols = [&#39;tech_indicator_01_min&#39;, &#39;tech_indicator_01_max&#39;, &#39;tech_indicator_02_min&#39;,
&#39;tech_indicator_02_max&#39;, &#39;tech_indicator_03_min&#39;, &#39;tech_indicator_03_max&#39;]
df1 = pd.DataFrame(arr[m4], columns=cols)
df2 = pd.DataFrame({&#39;number_of_trades&#39;: number_of_trades[m4],
&#39;number_of_winners&#39;: number_of_winners[m4],
&#39;net_pnl&#39;: net_pnl[m4]})
output_df = pd.concat([df1, df2], axis=1)

huangapple
  • 本文由 发表于 2023年6月5日 00:31:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76401392.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定