英文:
Speed up iteration over dataframe lambda row with numpy?
问题
以下是您的代码,经过翻译:
###########################################################
### 回测优化器 - 暴力方法 - 三个指标 ###
###########################################################
import pandas as pd
import itertools
import time
import numpy as np
pd.set_option('display.max_rows', 15)
tz = timezone('America/Denver')
x = 'QQQ'
trades_threshold = 1000
print('回测优化器开始于', datetime.now(tz).strftime('%H:%M:%S'))
checkpoint_01 = time.perf_counter()
filename = str('/content/'+x+'_summarized_trades.csv')
summarized_trades = pd.read_csv(filename)
print('交易数量', len(summarized_trades))
##############################
list_of_indicators = ['rsi','std_pct','atr_5m/last_trade']
##############################
indicator_01 = list_of_indicators[0]
print('指标01是', indicator_01)
indicator_02 = list_of_indicators[1]
print('指标02是', indicator_02)
indicator_03 = list_of_indicators[2]
print('指标03是', indicator_03)
summarized_trades['indicator_01_quantile'] = pd.qcut(summarized_trades[indicator_01].values, 10).codes
summarized_trades['indicator_02_quantile'] = pd.qcut(summarized_trades[indicator_02].values, 10).codes
summarized_trades['indicator_03_quantile'] = pd.qcut(summarized_trades[indicator_03].values, 10).codes
summarized_trades['indicator_01_quantile_min'] = pd.qcut(summarized_trades[indicator_01], 10).apply(lambda x: x.left)
summarized_trades['indicator_01_quantile_max'] = pd.qcut(summarized_trades[indicator_01], 10).apply(lambda x: x.right)
summarized_trades['indicator_02_quantile_min'] = pd.qcut(summarized_trades[indicator_02], 10).apply(lambda x: x.left)
summarized_trades['indicator_02_quantile_max'] = pd.qcut(summarized_trades[indicator_02], 10).apply(lambda x: x.right)
summarized_trades['indicator_03_quantile_min'] = pd.qcut(summarized_trades[indicator_03], 10).apply(lambda x: x.left)
summarized_trades['indicator_03_quantile_max'] = pd.qcut(summarized_trades[indicator_03], 10).apply(lambda x: x.right)
summarized_trades['number_of_trades'] = 1
summarized_trades['number_of_winners'] = summarized_trades.net_pnl.apply(lambda x: 1 if x > 0 else 0)
### 将类似的分位数桶进行分组
grouped_trades = summarized_trades.groupby(by=['indicator_01_quantile','indicator_02_quantile','indicator_03_quantile']).agg({'net_pnl':'sum','number_of_trades':'sum','number_of_winners':'sum','indicator_01_quantile_min':'min','indicator_01_quantile_max':'max','indicator_02_quantile_min':'min','indicator_02_quantile_max':'max','indicator_03_quantile_min':'min','indicator_03_quantile_max':'max'})
grouped_trades = grouped_trades.reset_index() # 需要这样做,否则不可哈希化
print('分组交易数', len(grouped_trades))
display(grouped_trades)
output = []
default_possibilities = [0,1,2,3,4,5,6,7,8,9] # 对于使用10个分位数的情况有效
possibilities = list(itertools.product(default_possibilities,default_possibilities,default_possibilities,default_possibilities,default_possibilities,default_possibilities)) # 每个指标都可以有10x10的排列组合
print('排列组合长度', len(possibilities))
possibilities = [item for item in possibilities if item[1]>=item[0] and item[3]>=item[2] and item[5]>=item[4]] # 只保留有意义的组合(例如,技术指标>2且<4,但不是>6和<3)
print('修订后的排列组合长度', len(possibilities))
### 遍历有意义的组合
for i in possibilities:
a = i[0]
b = i[1]
c = i[2]
d = i[3]
e = i[4]
f = i[5]
### 为每一行创建新列以进行过滤,如果行是YES并且满足最小交易阈值,则将结果输出到数据框中
### 使用NUMPY而不是LAMBDA ROW??
grouped_trades['filter_01'] = grouped_trades.apply(lambda row: 'YES' if row['indicator_01_quantile'] >= a and row['indicator_01_quantile'] <= b and row['indicator_02_quantile'] >= c and row['indicator_02_quantile'] <= d and row['indicator_03_quantile'] >= e and row['indicator_03_quantile'] <= f else 'NO', axis=1)
grouped_trades_filtered = grouped_trades.loc[(grouped_trades['filter_01']=='YES')]
number_of_trades = grouped_trades_filtered.number_of_trades.sum()
if number_of_trades < trades_threshold:
continue
else:
number_of_winners = grouped_trades_filtered.number_of_winners.sum()
net_pnl = grouped_trades_filtered.net_pnl.sum()
output.append({'tech_indicator_01_min':a,'tech_indicator_01_max':b, 'tech_indicator_02_min':c,'tech_indicator_02_max':d, 'tech_indicator_03_min':e,'tech_indicator_03_max':f,'number_of_trades':number_of_trades,'number_of_winners':number_of_winners,'net_pnl':net_pnl})
output_df = pd.DataFrame(output)
print('输出数据框长度',len(output_df))
output_df['win_rate'] = output_df.number_of_winners/output_df.number_of_trades
final_df = output_df.copy() ### 如果以上的阈值比较
final_df.sort_values(by='net_pnl', ascending=False, inplace=True)
display(final_df)
checkpoint_02 = time.perf_counter()
print('总时间',checkpoint_02-checkpoint_01)
# 取final_df的第一行,然后取每个值并使用分位数
tech_indicator_01_min_quantile = final_df.iloc[0]['tech_indicator_01_min']
print(tech_indicator_01_min_quantile)
tech_indicator_01_max_quantile = final_df.iloc[0]['
<details>
<summary>英文:</summary>
Below is my code, which starts with 1,000,000 permutations, then eliminates permutations that don't make sense (for example, a row where indicator_01 is greater than 6 and is less than 4).
In this example, I am then working with 166,375 permutations.
The code is iterating through each of the 166,375 rows, and then doing a groupby to sum up the pnl (profit $), then produce an output which shows the best overall profit.
This process takes ~30 minutes.
I understand from reading many questions here, that utilizing numpy could potentially speed this up considerably. I've been working on this for a few days and getting nowhere, so I'm posting my original code (iterating through each with lambda rows: and looking for the community's help.
###########################################################
Backtest Optimizer - Brute Force - Three indicators
###########################################################
import pandas as pd
import itertools
import time
import numpy as np
pd.set_option('display.max_rows', 15)
tz = timezone('America/Denver')
x = 'QQQ'
trades_threshold = 1000
print('Backtest Optimizer started at',datetime.now(tz).strftime('%H:%M:%S'))
checkpoint_01 = time.perf_counter()
filename = str('/content/'+x+'_summarized_trades.csv')
summarized_trades = pd.read_csv(filename)
print('Number of trades',len(summarized_trades))
##############################
list_of_indicators = ['rsi','std_pct','atr_5m/last_trade']
##############################
indicator_01 = list_of_indicators[0]
print('Indicator_01 is ',indicator_01)
indicator_02 = list_of_indicators1
print('Indicator_02 is ',indicator_02)
indicator_03 = list_of_indicators[2]
print('Indicator_03 is ',indicator_03)
summarized_trades['indicator_01_quantile'] = pd.qcut(summarized_trades[indicator_01].values, 10).codes
summarized_trades['indicator_02_quantile'] = pd.qcut(summarized_trades[indicator_02].values, 10).codes
summarized_trades['indicator_03_quantile'] = pd.qcut(summarized_trades[indicator_03].values, 10).codes
summarized_trades['indicator_01_quantile_min'] = pd.qcut(summarized_trades[indicator_01], 10).apply(lambda x: x.left)
summarized_trades['indicator_01_quantile_max'] = pd.qcut(summarized_trades[indicator_01], 10).apply(lambda x: x.right)
summarized_trades['indicator_02_quantile_min'] = pd.qcut(summarized_trades[indicator_02], 10).apply(lambda x: x.left)
summarized_trades['indicator_02_quantile_max'] = pd.qcut(summarized_trades[indicator_02], 10).apply(lambda x: x.right)
summarized_trades['indicator_03_quantile_min'] = pd.qcut(summarized_trades[indicator_03], 10).apply(lambda x: x.left)
summarized_trades['indicator_03_quantile_max'] = pd.qcut(summarized_trades[indicator_03], 10).apply(lambda x: x.right)
summarized_trades['number_of_trades'] = 1
summarized_trades['number_of_winners'] = summarized_trades.net_pnl.apply(lambda x: 1 if x >0 else 0)
Group into similar quantile buckets
grouped_trades = summarized_trades.groupby(by=['indicator_01_quantile','indicator_02_quantile','indicator_03_quantile']).agg({'net_pnl':'sum','number_of_trades':'sum','number_of_winners':'sum','indicator_01_quantile_min':'min','indicator_01_quantile_max':'max','indicator_02_quantile_min':'min','indicator_02_quantile_max':'max','indicator_03_quantile_min':'min','indicator_03_quantile_max':'max'})
grouped_trades = grouped_trades.reset_index() # Need to do this, otherwise unhashable
print('Length of grouped trades =',len(grouped_trades))
display(grouped_trades)
output = []
default_possibilities = [0,1,2,3,4,5,6,7,8,9] # Works for using 10 quantiles
possibilities = list(itertools.product(default_possibilities,default_possibilities,default_possibilities,default_possibilities,default_possibilities,default_possibilities)) # each of the three indicators can have 10x10 permutations
print('length of possibilities',len(possibilities))
possibilities = [item for item in possibilities if item1>=item[0] and item[3]>=item[2] and item[5]>=item[4]] #only keep combinations that make sense (e.g., tech indicator >2 and <4, but not >6 and <3)
print('length of revised possibilities',len(possibilities))
iterate through the combinations that make sense
for i in possibilities:
a = i[0]
b = i1
c = i[2]
d = i[3]
e = i[4]
f = i[5]
for each row, create new column to filter on, and if row is YES and has meets minimum trades_threshold, output to dataframe the results
UTILIZE NUMPY INSTEAD OF LAMBDA ROW????
grouped_trades['filter_01'] = grouped_trades.apply(lambda row: 'YES' if row['indicator_01_quantile'] >= a and row['indicator_01_quantile'] <= b and row['indicator_02_quantile'] >= c and row['indicator_02_quantile'] <= d and row['indicator_03_quantile'] >= e and row['indicator_03_quantile'] <= f else 'NO', axis=1)
grouped_trades_filtered = grouped_trades.loc[(grouped_trades['filter_01']=='YES')]
number_of_trades = grouped_trades_filtered.number_of_trades.sum()
if number_of_trades < trades_threshold:
continue
else:
number_of_winners = grouped_trades_filtered.number_of_winners.sum()
net_pnl = grouped_trades_filtered.net_pnl.sum()
output.append({'tech_indicator_01_min':a,'tech_indicator_01_max':b, 'tech_indicator_02_min':c,'tech_indicator_02_max':d, 'tech_indicator_03_min':e,'tech_indicator_03_max':f,'number_of_trades':number_of_trades,'number_of_winners':number_of_winners,'net_pnl':net_pnl})
output_df = pd.DataFrame(output)
print('Length of output',len(output_df))
output_df['win_rate'] = output_df.number_of_winners/output_df.number_of_trades
final_df = output_df.copy() ### if comparing with threshold above
final_df.sort_values(by='net_pnl', ascending=False, inplace=True)
display(final_df)
checkpoint_02 = time.perf_counter()
print('Total time',checkpoint_02-checkpoint_01)
Take row 1 of the final_df, and take each value and use quantile
tech_indicator_01_min_quantile = final_df.iloc[0]['tech_indicator_01_min']
print(tech_indicator_01_min_quantile)
tech_indicator_01_max_quantile = final_df.iloc[0]['tech_indicator_01_max']
print(tech_indicator_01_max_quantile)
indicator_01_min_df = grouped_trades.loc[(grouped_trades.indicator_01_quantile == tech_indicator_01_min_quantile)]
tech_indicator_01_min_value = indicator_01_min_df.iloc[0]['indicator_01_quantile_min']
print(tech_indicator_01_min_value)
indicator_01_max_df = grouped_trades.loc[(grouped_trades.indicator_01_quantile == tech_indicator_01_max_quantile)]
tech_indicator_01_max_value = indicator_01_max_df.iloc[0]['indicator_01_quantile_max']
print(tech_indicator_01_max_value)
tech_indicator_02_min_quantile = final_df.iloc[0]['tech_indicator_02_min']
print(tech_indicator_02_min_quantile)
tech_indicator_02_max_quantile = final_df.iloc[0]['tech_indicator_02_max']
print(tech_indicator_02_max_quantile)
indicator_02_min_df = grouped_trades.loc[(grouped_trades.indicator_02_quantile == tech_indicator_02_min_quantile)]
tech_indicator_02_min_value = indicator_02_min_df.iloc[0]['indicator_02_quantile_min']
print(tech_indicator_02_min_value)
indicator_02_max_df = grouped_trades.loc[(grouped_trades.indicator_02_quantile == tech_indicator_02_max_quantile)]
tech_indicator_02_max_value = indicator_02_max_df.iloc[0]['indicator_02_quantile_max']
print(tech_indicator_02_max_value)
tech_indicator_03_min_quantile = final_df.iloc[0]['tech_indicator_03_min']
print(tech_indicator_03_min_quantile)
tech_indicator_03_max_quantile = final_df.iloc[0]['tech_indicator_03_max']
print(tech_indicator_03_max_quantile)
indicator_03_min_df = grouped_trades.loc[(grouped_trades.indicator_03_quantile == tech_indicator_03_min_quantile)]
tech_indicator_03_min_value = indicator_03_min_df.iloc[0]['indicator_03_quantile_min']
print(tech_indicator_03_min_value)
indicator_03_max_df = grouped_trades.loc[(grouped_trades.indicator_03_quantile == tech_indicator_03_max_quantile)]
tech_indicator_03_max_value = indicator_03_max_df.iloc[0]['indicator_03_quantile_max']
print(tech_indicator_03_max_value)
print(indicator_01,'>=',tech_indicator_01_min_value,'and',indicator_01,'<=',tech_indicator_01_max_value,'and',indicator_02,'>=',tech_indicator_02_min_value,'and',indicator_02,'<=',tech_indicator_02_max_value,'and',indicator_03,'>=',tech_indicator_03_min_value,'and',indicator_03,'<=',tech_indicator_03_max_value)
Edit:
Here's an output using random data
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/ajPLn.jpg
</details>
# 答案1
**得分**: 1
你可以使用 [numpy广播][1](计算两个数组的外积)来向量化你的操作。这是一个常见的模式,可以一次计算所有行的相同操作(如果内存足够的话...)。
你需要理解的是:
- `xxx[:, None]`:添加一个新的轴 `array([0, 1, 2])[:, None]` => `array([[0], [1], [2]])`
- `np.sum(..., axis=n)`:沿着一个轴计算和,而不是整个数组(降维)。
下面的代码可以简化,但会降低可读性。所以,用这段代码替换你的循环:
```python
# 转换为numpy
arr = np.array(possibilities)
lb, ub = arr[:, 0::2], arr[:, 1::2]
cols = ['indicator_01_quantile', 'indicator_02_quantile', 'indicator_03_quantile']
qtl = grouped_trades[cols].values
# 使用数字广播计算
m1 = np.all(qtl >= lb[:, None], axis=2) # 下限
m2 = np.all(qtl <= ub[:, None], axis=2) # 上限
m3 = np.all(m1 & m2, axis=1)
number_of_trades = np.sum(m3[:, None] * grouped_trades['number_of_trades'].values, axis=1)
m4 = number_of_trades >= trades_threshold
number_of_winners = np.sum(m4[:, None] * grouped_trades['number_of_winners'].values, axis=1)
net_pnl = np.sum(m4[:, None] * grouped_trades['net_pnl'].values, axis=1)
# 创建输出的DataFrame
cols = ['tech_indicator_01_min', 'tech_indicator_01_max', 'tech_indicator_02_min',
'tech_indicator_02_max', 'tech_indicator_03_min', 'tech_indicator_03_max']
df1 = pd.DataFrame(arr[m4], columns=cols)
df2 = pd.DataFrame({'number_of_trades': number_of_trades[m4],
'number_of_winners': number_of_winners[m4],
'net_pnl': net_pnl[m4]})
output_df = pd.concat([df1, df2], axis=1)
英文:
You can use numpy broadcasting (to compute the outer product of 2 arrays) to vectorize your operations. This is a common pattern to calculate the same operations for all rows at once (if you have enough memory...).
What you have to understand:
xxx[:, None]
: add a new axisarray([0, 1, 2])[:, None]
=>array([[0], [1], [2]])
np.sum(..., axis=n)
: compute the sum along an axis, not on whole array (dimension reduction).
The code below can be reduced at the expense of readability. So, replace your loop by this code:
# Convert to numpy
arr = np.array(possibilities)
lb, ub = arr[:, 0::2], arr[:, 1::2]
cols = ['indicator_01_quantile', 'indicator_02_quantile', 'indicator_03_quantile']
qtl = grouped_trades[cols].values
# Compute with number broadcasting
m1 = np.all(qtl >= lb[:, None], axis=2) # lower bounds
m2 = np.all(qtl <= ub[:, None], axis=2) # upper bounds
m3 = np.all(m1 & m2, axis=1)
number_of_trades = np.sum(m3[:, None] * grouped_trades['number_of_trades'].values, axis=1)
m4 = number_of_trades >= trades_threshold
number_of_winners = np.sum(m4[:, None] * grouped_trades['number_of_winners'].values, axis=1)
net_pnl = np.sum(m4[:, None] * grouped_trades['net_pnl'].values, axis=1)
# Create output dataframe
cols = ['tech_indicator_01_min', 'tech_indicator_01_max', 'tech_indicator_02_min',
'tech_indicator_02_max', 'tech_indicator_03_min', 'tech_indicator_03_max']
df1 = pd.DataFrame(arr[m4], columns=cols)
df2 = pd.DataFrame({'number_of_trades': number_of_trades[m4],
'number_of_winners': number_of_winners[m4],
'net_pnl': net_pnl[m4]})
output_df = pd.concat([df1, df2], axis=1)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论