加速遍历DataFrame时使用NumPy的lambda行吗?

huangapple go评论118阅读模式
英文:

Speed up iteration over dataframe lambda row with numpy?

问题

以下是您的代码,经过翻译:

  1. ###########################################################
  2. ### 回测优化器 - 暴力方法 - 三个指标 ###
  3. ###########################################################
  4. import pandas as pd
  5. import itertools
  6. import time
  7. import numpy as np
  8. pd.set_option('display.max_rows', 15)
  9. tz = timezone('America/Denver')
  10. x = 'QQQ'
  11. trades_threshold = 1000
  12. print('回测优化器开始于', datetime.now(tz).strftime('%H:%M:%S'))
  13. checkpoint_01 = time.perf_counter()
  14. filename = str('/content/'+x+'_summarized_trades.csv')
  15. summarized_trades = pd.read_csv(filename)
  16. print('交易数量', len(summarized_trades))
  17. ##############################
  18. list_of_indicators = ['rsi','std_pct','atr_5m/last_trade']
  19. ##############################
  20. indicator_01 = list_of_indicators[0]
  21. print('指标01是', indicator_01)
  22. indicator_02 = list_of_indicators[1]
  23. print('指标02是', indicator_02)
  24. indicator_03 = list_of_indicators[2]
  25. print('指标03是', indicator_03)
  26. summarized_trades['indicator_01_quantile'] = pd.qcut(summarized_trades[indicator_01].values, 10).codes
  27. summarized_trades['indicator_02_quantile'] = pd.qcut(summarized_trades[indicator_02].values, 10).codes
  28. summarized_trades['indicator_03_quantile'] = pd.qcut(summarized_trades[indicator_03].values, 10).codes
  29. summarized_trades['indicator_01_quantile_min'] = pd.qcut(summarized_trades[indicator_01], 10).apply(lambda x: x.left)
  30. summarized_trades['indicator_01_quantile_max'] = pd.qcut(summarized_trades[indicator_01], 10).apply(lambda x: x.right)
  31. summarized_trades['indicator_02_quantile_min'] = pd.qcut(summarized_trades[indicator_02], 10).apply(lambda x: x.left)
  32. summarized_trades['indicator_02_quantile_max'] = pd.qcut(summarized_trades[indicator_02], 10).apply(lambda x: x.right)
  33. summarized_trades['indicator_03_quantile_min'] = pd.qcut(summarized_trades[indicator_03], 10).apply(lambda x: x.left)
  34. summarized_trades['indicator_03_quantile_max'] = pd.qcut(summarized_trades[indicator_03], 10).apply(lambda x: x.right)
  35. summarized_trades['number_of_trades'] = 1
  36. summarized_trades['number_of_winners'] = summarized_trades.net_pnl.apply(lambda x: 1 if x > 0 else 0)
  37. ### 将类似的分位数桶进行分组
  38. grouped_trades = summarized_trades.groupby(by=['indicator_01_quantile','indicator_02_quantile','indicator_03_quantile']).agg({'net_pnl':'sum','number_of_trades':'sum','number_of_winners':'sum','indicator_01_quantile_min':'min','indicator_01_quantile_max':'max','indicator_02_quantile_min':'min','indicator_02_quantile_max':'max','indicator_03_quantile_min':'min','indicator_03_quantile_max':'max'})
  39. grouped_trades = grouped_trades.reset_index() # 需要这样做,否则不可哈希化
  40. print('分组交易数', len(grouped_trades))
  41. display(grouped_trades)
  42. output = []
  43. default_possibilities = [0,1,2,3,4,5,6,7,8,9] # 对于使用10个分位数的情况有效
  44. possibilities = list(itertools.product(default_possibilities,default_possibilities,default_possibilities,default_possibilities,default_possibilities,default_possibilities)) # 每个指标都可以有10x10的排列组合
  45. print('排列组合长度', len(possibilities))
  46. possibilities = [item for item in possibilities if item[1]>=item[0] and item[3]>=item[2] and item[5]>=item[4]] # 只保留有意义的组合(例如,技术指标>2且<4,但不是>6和<3)
  47. print('修订后的排列组合长度', len(possibilities))
  48. ### 遍历有意义的组合
  49. for i in possibilities:
  50. a = i[0]
  51. b = i[1]
  52. c = i[2]
  53. d = i[3]
  54. e = i[4]
  55. f = i[5]
  56. ### 为每一行创建新列以进行过滤,如果行是YES并且满足最小交易阈值,则将结果输出到数据框中
  57. ### 使用NUMPY而不是LAMBDA ROW??
  58. grouped_trades['filter_01'] = grouped_trades.apply(lambda row: 'YES' if row['indicator_01_quantile'] >= a and row['indicator_01_quantile'] <= b and row['indicator_02_quantile'] >= c and row['indicator_02_quantile'] <= d and row['indicator_03_quantile'] >= e and row['indicator_03_quantile'] <= f else 'NO', axis=1)
  59. grouped_trades_filtered = grouped_trades.loc[(grouped_trades['filter_01']=='YES')]
  60. number_of_trades = grouped_trades_filtered.number_of_trades.sum()
  61. if number_of_trades < trades_threshold:
  62. continue
  63. else:
  64. number_of_winners = grouped_trades_filtered.number_of_winners.sum()
  65. net_pnl = grouped_trades_filtered.net_pnl.sum()
  66. output.append({'tech_indicator_01_min':a,'tech_indicator_01_max':b, 'tech_indicator_02_min':c,'tech_indicator_02_max':d, 'tech_indicator_03_min':e,'tech_indicator_03_max':f,'number_of_trades':number_of_trades,'number_of_winners':number_of_winners,'net_pnl':net_pnl})
  67. output_df = pd.DataFrame(output)
  68. print('输出数据框长度',len(output_df))
  69. output_df['win_rate'] = output_df.number_of_winners/output_df.number_of_trades
  70. final_df = output_df.copy() ### 如果以上的阈值比较
  71. final_df.sort_values(by='net_pnl', ascending=False, inplace=True)
  72. display(final_df)
  73. checkpoint_02 = time.perf_counter()
  74. print('总时间',checkpoint_02-checkpoint_01)
  75. # 取final_df的第一行,然后取每个值并使用分位数
  76. tech_indicator_01_min_quantile = final_df.iloc[0]['tech_indicator_01_min']
  77. print(tech_indicator_01_min_quantile)
  78. tech_indicator_01_max_quantile = final_df.iloc[0]['
  79. <details>
  80. <summary>英文:</summary>
  81. Below is my code, which starts with 1,000,000 permutations, then eliminates permutations that don&#39;t make sense (for example, a row where indicator_01 is greater than 6 and is less than 4).
  82. In this example, I am then working with 166,375 permutations.
  83. The code is iterating through each of the 166,375 rows, and then doing a groupby to sum up the pnl (profit $), then produce an output which shows the best overall profit.
  84. This process takes ~30 minutes.
  85. I understand from reading many questions here, that utilizing numpy could potentially speed this up considerably. I&#39;ve been working on this for a few days and getting nowhere, so I&#39;m posting my original code (iterating through each with lambda rows: and looking for the community&#39;s help.

###########################################################

Backtest Optimizer - Brute Force - Three indicators

###########################################################

import pandas as pd
import itertools
import time
import numpy as np

pd.set_option('display.max_rows', 15)
tz = timezone('America/Denver')

x = 'QQQ'
trades_threshold = 1000

print('Backtest Optimizer started at',datetime.now(tz).strftime('%H:%M:%S'))
checkpoint_01 = time.perf_counter()

filename = str('/content/'+x+'_summarized_trades.csv')
summarized_trades = pd.read_csv(filename)
print('Number of trades',len(summarized_trades))

##############################
list_of_indicators = ['rsi','std_pct','atr_5m/last_trade']
##############################

indicator_01 = list_of_indicators[0]
print('Indicator_01 is ',indicator_01)
indicator_02 = list_of_indicators1
print('Indicator_02 is ',indicator_02)
indicator_03 = list_of_indicators[2]
print('Indicator_03 is ',indicator_03)

summarized_trades['indicator_01_quantile'] = pd.qcut(summarized_trades[indicator_01].values, 10).codes
summarized_trades['indicator_02_quantile'] = pd.qcut(summarized_trades[indicator_02].values, 10).codes
summarized_trades['indicator_03_quantile'] = pd.qcut(summarized_trades[indicator_03].values, 10).codes
summarized_trades['indicator_01_quantile_min'] = pd.qcut(summarized_trades[indicator_01], 10).apply(lambda x: x.left)
summarized_trades['indicator_01_quantile_max'] = pd.qcut(summarized_trades[indicator_01], 10).apply(lambda x: x.right)
summarized_trades['indicator_02_quantile_min'] = pd.qcut(summarized_trades[indicator_02], 10).apply(lambda x: x.left)
summarized_trades['indicator_02_quantile_max'] = pd.qcut(summarized_trades[indicator_02], 10).apply(lambda x: x.right)
summarized_trades['indicator_03_quantile_min'] = pd.qcut(summarized_trades[indicator_03], 10).apply(lambda x: x.left)
summarized_trades['indicator_03_quantile_max'] = pd.qcut(summarized_trades[indicator_03], 10).apply(lambda x: x.right)

summarized_trades['number_of_trades'] = 1
summarized_trades['number_of_winners'] = summarized_trades.net_pnl.apply(lambda x: 1 if x >0 else 0)

Group into similar quantile buckets

grouped_trades = summarized_trades.groupby(by=['indicator_01_quantile','indicator_02_quantile','indicator_03_quantile']).agg({'net_pnl':'sum','number_of_trades':'sum','number_of_winners':'sum','indicator_01_quantile_min':'min','indicator_01_quantile_max':'max','indicator_02_quantile_min':'min','indicator_02_quantile_max':'max','indicator_03_quantile_min':'min','indicator_03_quantile_max':'max'})
grouped_trades = grouped_trades.reset_index() # Need to do this, otherwise unhashable
print('Length of grouped trades =',len(grouped_trades))
display(grouped_trades)

output = []
default_possibilities = [0,1,2,3,4,5,6,7,8,9] # Works for using 10 quantiles
possibilities = list(itertools.product(default_possibilities,default_possibilities,default_possibilities,default_possibilities,default_possibilities,default_possibilities)) # each of the three indicators can have 10x10 permutations
print('length of possibilities',len(possibilities))
possibilities = [item for item in possibilities if item1>=item[0] and item[3]>=item[2] and item[5]>=item[4]] #only keep combinations that make sense (e.g., tech indicator >2 and <4, but not >6 and <3)
print('length of revised possibilities',len(possibilities))

iterate through the combinations that make sense

for i in possibilities:
a = i[0]
b = i1
c = i[2]
d = i[3]
e = i[4]
f = i[5]

for each row, create new column to filter on, and if row is YES and has meets minimum trades_threshold, output to dataframe the results

UTILIZE NUMPY INSTEAD OF LAMBDA ROW????

grouped_trades['filter_01'] = grouped_trades.apply(lambda row: 'YES' if row['indicator_01_quantile'] >= a and row['indicator_01_quantile'] <= b and row['indicator_02_quantile'] >= c and row['indicator_02_quantile'] <= d and row['indicator_03_quantile'] >= e and row['indicator_03_quantile'] <= f else 'NO', axis=1)
grouped_trades_filtered = grouped_trades.loc[(grouped_trades['filter_01']=='YES')]
number_of_trades = grouped_trades_filtered.number_of_trades.sum()
if number_of_trades < trades_threshold:
continue
else:
number_of_winners = grouped_trades_filtered.number_of_winners.sum()
net_pnl = grouped_trades_filtered.net_pnl.sum()
output.append({'tech_indicator_01_min':a,'tech_indicator_01_max':b, 'tech_indicator_02_min':c,'tech_indicator_02_max':d, 'tech_indicator_03_min':e,'tech_indicator_03_max':f,'number_of_trades':number_of_trades,'number_of_winners':number_of_winners,'net_pnl':net_pnl})

output_df = pd.DataFrame(output)
print('Length of output',len(output_df))
output_df['win_rate'] = output_df.number_of_winners/output_df.number_of_trades
final_df = output_df.copy() ### if comparing with threshold above
final_df.sort_values(by='net_pnl', ascending=False, inplace=True)
display(final_df)
checkpoint_02 = time.perf_counter()
print('Total time',checkpoint_02-checkpoint_01)

Take row 1 of the final_df, and take each value and use quantile

tech_indicator_01_min_quantile = final_df.iloc[0]['tech_indicator_01_min']
print(tech_indicator_01_min_quantile)
tech_indicator_01_max_quantile = final_df.iloc[0]['tech_indicator_01_max']
print(tech_indicator_01_max_quantile)
indicator_01_min_df = grouped_trades.loc[(grouped_trades.indicator_01_quantile == tech_indicator_01_min_quantile)]
tech_indicator_01_min_value = indicator_01_min_df.iloc[0]['indicator_01_quantile_min']
print(tech_indicator_01_min_value)
indicator_01_max_df = grouped_trades.loc[(grouped_trades.indicator_01_quantile == tech_indicator_01_max_quantile)]
tech_indicator_01_max_value = indicator_01_max_df.iloc[0]['indicator_01_quantile_max']
print(tech_indicator_01_max_value)

tech_indicator_02_min_quantile = final_df.iloc[0]['tech_indicator_02_min']
print(tech_indicator_02_min_quantile)
tech_indicator_02_max_quantile = final_df.iloc[0]['tech_indicator_02_max']
print(tech_indicator_02_max_quantile)
indicator_02_min_df = grouped_trades.loc[(grouped_trades.indicator_02_quantile == tech_indicator_02_min_quantile)]
tech_indicator_02_min_value = indicator_02_min_df.iloc[0]['indicator_02_quantile_min']
print(tech_indicator_02_min_value)
indicator_02_max_df = grouped_trades.loc[(grouped_trades.indicator_02_quantile == tech_indicator_02_max_quantile)]
tech_indicator_02_max_value = indicator_02_max_df.iloc[0]['indicator_02_quantile_max']
print(tech_indicator_02_max_value)

tech_indicator_03_min_quantile = final_df.iloc[0]['tech_indicator_03_min']
print(tech_indicator_03_min_quantile)
tech_indicator_03_max_quantile = final_df.iloc[0]['tech_indicator_03_max']
print(tech_indicator_03_max_quantile)
indicator_03_min_df = grouped_trades.loc[(grouped_trades.indicator_03_quantile == tech_indicator_03_min_quantile)]
tech_indicator_03_min_value = indicator_03_min_df.iloc[0]['indicator_03_quantile_min']
print(tech_indicator_03_min_value)
indicator_03_max_df = grouped_trades.loc[(grouped_trades.indicator_03_quantile == tech_indicator_03_max_quantile)]
tech_indicator_03_max_value = indicator_03_max_df.iloc[0]['indicator_03_quantile_max']
print(tech_indicator_03_max_value)

print(indicator_01,'>=',tech_indicator_01_min_value,'and',indicator_01,'<=',tech_indicator_01_max_value,'and',indicator_02,'>=',tech_indicator_02_min_value,'and',indicator_02,'<=',tech_indicator_02_max_value,'and',indicator_03,'>=',tech_indicator_03_min_value,'and',indicator_03,'<=',tech_indicator_03_max_value)

  1. Edit:
  2. Here&#39;s an output using random data
  3. [![enter image description here][1]][1]
  4. [1]: https://i.stack.imgur.com/ajPLn.jpg
  5. </details>
  6. # 答案1
  7. **得分**: 1
  8. 你可以使用 [numpy广播][1](计算两个数组的外积)来向量化你的操作。这是一个常见的模式,可以一次计算所有行的相同操作(如果内存足够的话...)。
  9. 你需要理解的是:
  10. - `xxx[:, None]`:添加一个新的轴 `array([0, 1, 2])[:, None]` => `array([[0], [1], [2]])`
  11. - `np.sum(..., axis=n)`:沿着一个轴计算和,而不是整个数组(降维)。
  12. 下面的代码可以简化,但会降低可读性。所以,用这段代码替换你的循环:
  13. ```python
  14. # 转换为numpy
  15. arr = np.array(possibilities)
  16. lb, ub = arr[:, 0::2], arr[:, 1::2]
  17. cols = ['indicator_01_quantile', 'indicator_02_quantile', 'indicator_03_quantile']
  18. qtl = grouped_trades[cols].values
  19. # 使用数字广播计算
  20. m1 = np.all(qtl >= lb[:, None], axis=2) # 下限
  21. m2 = np.all(qtl <= ub[:, None], axis=2) # 上限
  22. m3 = np.all(m1 & m2, axis=1)
  23. number_of_trades = np.sum(m3[:, None] * grouped_trades['number_of_trades'].values, axis=1)
  24. m4 = number_of_trades >= trades_threshold
  25. number_of_winners = np.sum(m4[:, None] * grouped_trades['number_of_winners'].values, axis=1)
  26. net_pnl = np.sum(m4[:, None] * grouped_trades['net_pnl'].values, axis=1)
  27. # 创建输出的DataFrame
  28. cols = ['tech_indicator_01_min', 'tech_indicator_01_max', 'tech_indicator_02_min',
  29. 'tech_indicator_02_max', 'tech_indicator_03_min', 'tech_indicator_03_max']
  30. df1 = pd.DataFrame(arr[m4], columns=cols)
  31. df2 = pd.DataFrame({'number_of_trades': number_of_trades[m4],
  32. 'number_of_winners': number_of_winners[m4],
  33. 'net_pnl': net_pnl[m4]})
  34. output_df = pd.concat([df1, df2], axis=1)
英文:

You can use numpy broadcasting (to compute the outer product of 2 arrays) to vectorize your operations. This is a common pattern to calculate the same operations for all rows at once (if you have enough memory...).

What you have to understand:

  • xxx[:, None]: add a new axis array([0, 1, 2])[:, None] => array([[0], [1], [2]])
  • np.sum(..., axis=n): compute the sum along an axis, not on whole array (dimension reduction).

The code below can be reduced at the expense of readability. So, replace your loop by this code:

  1. # Convert to numpy
  2. arr = np.array(possibilities)
  3. lb, ub = arr[:, 0::2], arr[:, 1::2]
  4. cols = [&#39;indicator_01_quantile&#39;, &#39;indicator_02_quantile&#39;, &#39;indicator_03_quantile&#39;]
  5. qtl = grouped_trades[cols].values
  6. # Compute with number broadcasting
  7. m1 = np.all(qtl &gt;= lb[:, None], axis=2) # lower bounds
  8. m2 = np.all(qtl &lt;= ub[:, None], axis=2) # upper bounds
  9. m3 = np.all(m1 &amp; m2, axis=1)
  10. number_of_trades = np.sum(m3[:, None] * grouped_trades[&#39;number_of_trades&#39;].values, axis=1)
  11. m4 = number_of_trades &gt;= trades_threshold
  12. number_of_winners = np.sum(m4[:, None] * grouped_trades[&#39;number_of_winners&#39;].values, axis=1)
  13. net_pnl = np.sum(m4[:, None] * grouped_trades[&#39;net_pnl&#39;].values, axis=1)
  14. # Create output dataframe
  15. cols = [&#39;tech_indicator_01_min&#39;, &#39;tech_indicator_01_max&#39;, &#39;tech_indicator_02_min&#39;,
  16. &#39;tech_indicator_02_max&#39;, &#39;tech_indicator_03_min&#39;, &#39;tech_indicator_03_max&#39;]
  17. df1 = pd.DataFrame(arr[m4], columns=cols)
  18. df2 = pd.DataFrame({&#39;number_of_trades&#39;: number_of_trades[m4],
  19. &#39;number_of_winners&#39;: number_of_winners[m4],
  20. &#39;net_pnl&#39;: net_pnl[m4]})
  21. output_df = pd.concat([df1, df2], axis=1)

huangapple
  • 本文由 发表于 2023年6月5日 00:31:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76401392.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定