快速数据表与时间序列的连接

huangapple go评论64阅读模式
英文:

Fast data.table joins with time series

问题

我有两个数据集,分别包含两种不同时间段(1小时和5分钟)的货币价格数据(最高价、最低价、开盘价和收盘价)。

对于每个1小时时间段数据行,我有价格目标,以及交易的目标方向(要么是1,要么是-1)。

对于目标方向为'1'的交易,我试图找到5分钟图上第一个"低点"小于或等于目标水平的点。相反,对于目标方向为'-1'的交易,我想找到5分钟图上第一个"高点"大于或等于目标交易水平的点。

我已经提供了下面的代码来演示我的需求。

我在实践中遇到的问题是,我正在查看跨越10-20年的时间段,这使得连接变得非常缓慢。
我将下面的示例设置为从2016年到2022年,这样在我的计算机上不会太慢,但如果扩展到超过10年,它开始变得非常痛苦。

这可能是我不得不忍受的事情,但我正在寻求两方面的指导:
1)是否有更快/更高效的方法来实现我下面概述的内容?
2)您可以看到,我将买入和卖出交易分开连接到两个连接点。是否有办法将它们合并为一个单一的连接?这并不是非常重要,因为它在我的代码中不占用太多的“空间”,但出于教育目的,我感兴趣。

非常感谢您提前的帮助,
Phil

英文:

I have two data sets with currency data prices (High, Low, Open and Close) on two different timeframes (1 hour and 5 minute).

I have price targets for each row of the 1 Hour timeframe data, and a target direction for the trade (either 1 or -1).

For trades with a direction of '1', I'm trying to find the first point on the 5 minute chart where the "low" is <= the target level. Conversely, for trades with a direction of '-1', I want to find the first point on the 5-minute chart where the "high" is >= the target trade level.

I've put the code below to demonstrate what I'm after.

The problem I'm having in practice is that I'm looking at time periods spanning 10 - 20 years, which makes the joins very slow.
I've set the example below to be from 2016 to 2022 so it's not too slow on my computer, but if you extend it out to more than 10 years it starts to really become a pain.

This may be something that I have to live with, but I'm looking for guidance on two things:

  1. Is there a faster / more efficient way of achieving what I've outlined below?
  2. You can see that I've separated the buy and sell trades into two joins at the end. Is there a way to combine that into one single join? It's not really that important as it doesn't take up too much 'space' in my code, but I'm interested for educational purposes.

Thanks heaps in advance,
Phil

# Load required packages
library(data.table)
library(dplyr)

# Define timeframes for the data
Start &lt;- as.POSIXct(&quot;2016-01-01 00:00:00&quot;)
End &lt;- as.POSIXct(&quot;2022-01-01 23:55:00&quot;)

Hours &lt;- floor(as.numeric(difftime(End,Start,units = &quot;hours&quot;))) + 1
Minutes &lt;- floor(as.numeric(difftime(End,Start,units = &quot;mins&quot;)) / 5) + 1

# Create the Hourly data
set.seed(123)
hourly_prices &lt;- data.table(
  datetime = seq(Start, End, by = &quot;hour&quot;),
  open = rnorm(Hours, mean = 100, sd = 1),
  high = rnorm(Hours, mean = 101, sd = 1),
  low = rnorm(Hours, mean = 99, sd = 1),
  close = rnorm(Hours, mean = 100, sd = 1),
  Direction = sample(c(1,-1),Hours,replace = T)) %&gt;%
  .[,Target_price := ifelse(Direction == -1,rnorm(.N, mean = 104, sd = 1),rnorm(.N,mean = 97,sd = 1))]

# Create the 5-minute data
set.seed(456)
minute_prices &lt;- data.table(
  datetime = seq(Start, End, by = &quot;5 min&quot;),
  open = rnorm(Minutes, mean = 100, sd = 1),
  high = rnorm(Minutes, mean = 101, sd = 1),
  low = rnorm(Minutes, mean = 99, sd = 1),
  close = rnorm(Minutes, mean = 100, sd = 1),
  Position = seq_len(Minutes))

# Join the two data.tables to find the first point at which price passes the target levels
hourly_prices[(Direction == 1),Location := minute_prices[.SD, on = .(datetime &gt; datetime, low &lt;= Target_price),mult = &quot;first&quot;,x.Position]]
hourly_prices[(Direction == -1),Location := minute_prices[.SD, on = .(datetime &gt; datetime, high &gt;= Target_price),mult = &quot;first&quot;,x.Position]]

答案1

得分: 2

以下是代码部分的翻译:

一种简单的加速方法是将`minute_prices`数据表的子集,仅包括那些是当前小时的累积最小/最大值的行(减去1秒,因为连接是 `datetime > datetime`):

dtM <- copy(minute_prices)
dtH <- copy(hourly_prices)

system.time({
  dtH[(Direction == 1), Location := dtM[dtM[, low == cummin(low), as.integer(datetime - 1) %/% 3600L][[2]]][.SD, on = .(datetime > datetime, low <= Target_price), mult = "first", x.Position]]
  dtH[(Direction == -1), Location := dtM[dtM[, high == cummax(high), as.integer(datetime - 1) %/% 3600L][[2]]][.SD, on = .(datetime > datetime, high >= Target_price), mult = "first", x.Position]]
})
#>    user  system elapsed 
#>    6.77    0.01    6.81

system.time({
  hourly_prices[(Direction == 1), Location := minute_prices[.SD, on = .(datetime > datetime, low <= Target_price), mult = "first", x.Position]]
  hourly_prices[(Direction == -1), Location := minute_prices[.SD, on = .(datetime > datetime, high >= Target_price), mult = "first", x.Position]]
})
#>    user  system elapsed 
#>   22.97    0.00   23.04

identical(dtH, hourly_prices)
#> [1] TRUE

希望这对你有所帮助。

英文:

One simple ~3x speed-up would be to subset the minute_prices data.table to include only those rows that are the cumulative min/max for the current hour (minus 1 second, since the join is datetime &gt; datetime):

dtM &lt;- copy(minute_prices)
dtH &lt;- copy(hourly_prices)

system.time({
  dtH[(Direction == 1),Location := dtM[dtM[, low == cummin(low), as.integer(datetime - 1)%/%3600L][[2]]][.SD, on = .(datetime &gt; datetime, low &lt;= Target_price),mult = &quot;first&quot;,x.Position]]
  dtH[(Direction == -1),Location := dtM[dtM[, high == cummax(high), as.integer(datetime - 1)%/%3600L][[2]]][.SD, on = .(datetime &gt; datetime, high &gt;= Target_price),mult = &quot;first&quot;,x.Position]]
})
#&gt;    user  system elapsed 
#&gt;    6.77    0.01    6.81

system.time({
  hourly_prices[(Direction == 1),Location := minute_prices[.SD, on = .(datetime &gt; datetime, low &lt;= Target_price),mult = &quot;first&quot;,x.Position]]
  hourly_prices[(Direction == -1),Location := minute_prices[.SD, on = .(datetime &gt; datetime, high &gt;= Target_price),mult = &quot;first&quot;,x.Position]]
})
#&gt;    user  system elapsed 
#&gt;   22.97    0.00   23.04

identical(dtH, hourly_prices)
#&gt; [1] TRUE

huangapple
  • 本文由 发表于 2023年2月16日 07:21:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/75466339.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定