优化性能,同时循环遍历数据表并使用 set 函数。

huangapple go评论64阅读模式
英文:

Improving performance while looping over data.table with set

问题

我想知道是否有更好的方法来编写以下代码以提高性能

我的真实数据集有120,000个ID每个ID有25行

我想对每行应用指数预测

    library(data.table)

    #虚拟数据集
    dt <- data.table(
      ID = rep(c("A","B"), each=5),
      Value = abs(round(rnorm(10)*10))
      )

    #初始化第一行的列和值
    dt[, SES := 0]

    #按ID拆分成列表以便用lapply循环
    dt <- split(dt, dt$ID) 

    #用于循环的函数
    alpha <- 0.3
    loop_function <- function(x) {
      for(i in 2L:5L) {
        set(x, i, "SES", round(x[i, alpha * Value] + x[i-1L, (1L - alpha) * SES], 0))
        }
        return(x)
      }

    #将函数应用于列表元素并绑定结果
    dt <- lapply(dt, loop_function)
    dt <- rbindlist(dt)
英文:

I wonder if there is a better way to code the following to improve the performance.

My real data set has 120k id's with each 25 rows.

I would like to apply an exponential forecast rowise

library(data.table)

#dummy data set
dt &lt;- data.table(
  ID = rep(c(&quot;A&quot;,&quot;B&quot;), each=5),
  Value = abs(round(rnorm(10)*10))
  )

#Initialize column and value for 1st row
dt[, SES := 0]

#split by ID into list to loop over with lapply
dt &lt;- split(dt, dt$ID) 

#function to loop with
alpha &lt;- 0.3
loop_function &lt;- function(x) {
  for(i in 2L:5L) {
    set(x, i, &quot;SES&quot;, round(x[i, alpha * Value] + x[i-1L, (1L - alpha) * SES], 0))
    }
    return(x)
  }

#apply function to list elements and bind result
dt &lt;- lapply(dt, loop_function)
dt &lt;- rbindlist(dt)

答案1

得分: 2

这应该快得多:

library(data.table)

# 虚拟数据集
dt <- data.table(
  ID = rep(c("A","B"), each=5),
  Value = abs(round(rnorm(10)*10))
)

# 初始化第一行的列和值
dt[, SES := 0]
# 创建索引并进行迭代
dt[, idx:= rowid(ID)]
for(i in 2:max(dt$idx))
{
  prev <- dt[idx==(i-1L), SES]
  dt[idx==i, SES:= {
    round(alpha * Value + (1L - alpha) * prev, 0)
  }]
}

这与您的想法非常相似,意味着它会在索引上进行迭代(2:5L),但以一种经过优化的data.table方式。希望这有所帮助 优化性能,同时循环遍历数据表并使用 set 函数。

英文:

This should be much faster:

library(data.table)

#dummy data set
dt &lt;- data.table(
  ID = rep(c(&quot;A&quot;,&quot;B&quot;), each=5),
  Value = abs(round(rnorm(10)*10))
)

#Initialize column and value for 1st row
dt[, SES := 0]
# Create index and iterate over it
dt[, idx:= rowid(ID)]
for(i in 2:max(dt$idx))
{
  prev &lt;- dt[idx==(i-1L), SES]
  dt[idx==i, SES:= {
    round(alpha * Value + (1L - alpha) * prev, 0)
  }]
}

It is in the end pretty similar to your idea, meaning it iterates over indexes (2:5L) but in an optimised, data.table way. Hope this helps 优化性能,同时循环遍历数据表并使用 set 函数。

huangapple
  • 本文由 发表于 2023年5月28日 13:13:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76350031.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定