如何最佳优化我的R代码并避免循环

huangapple go评论61阅读模式
英文:

How to best optimize my R code and avoid looping

问题

我现在有一段代码,它存在于一个优化程序中。下面是用来演示这段代码如何工作的样本对象。

当矩阵 `X` 的所有元素都被观察到时,计算非常高效,可以使用以下方式用 ```res1``` 进行写作。在 ```res2``` 中创建的对象产生与 `res1` 相同的结果,但会遍历行,这在R中非常昂贵和低效。

    ### 如果所有都被观察到
    res1 <- exp(X %*% log(pr.t) + mX %*% log(1 - pr.t)) %*% wts
    res2 <- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)
    all.equal(res1[,1], res2)

现在,问题出在我的真实场景中,矩阵 `X` 中往往会有缺失值。因此,对于 `res1` 的计算会在其第一个元素中产生一个 `NA`,就像在这个新示例中所示的那样(出于显而易见的原因,这不是我的问题)。通过 `res2` 创建的对象在这种情况下会给出我所需要的结果,但会退回到一个循环,从理论上讲符合我想要的,但在计算上不可取。

    ### 这将无法正常工作,如预期的那样。
    res1 <- exp(X %*% log(pr.t) + mX %*% log(1 - pr.t)) %*% wts
    res2 <- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)

我的问题是,是否有人知道一种方法可以在矩阵 `X` 中存在缺失数据时产生与 `res2` 相同的结果,就像我使用 `sapply()` 方法一样,但在大矩阵计算方面同样高效?

我看到两个选项,我正在探索这两个选项。一个选项可能是对循环使用并行处理,第二个选项可能是使用Rcpp。这两个都是不错的选择。然而,在选择这两条路径之前,我想要一些帮助,看看是否有人看到了我没有看到的一个非常好的计算实现?
英文:

I have a piece of code now that lives inside an optimization routine. Below at the bottom are sample objects to use to see how this code works.

When all elements of the matrix X are observed, the calculation is very efficient and can be written as follows using res1. The object created in res2 produces the same result as res1 but loops over rows and is very expensive and inefficient in R.

### If everything is observed
res1 <- exp(X %*% log(pr.t) + mX %*% log(1 - pr.t)) %*% wts
res2 <- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)
all.equal(res1[,1], res2)

Now, the problem is in my real world scenario, there will often be missing values in the matrix X. As such, the calculation for res1 would yield an NA for its first element as shown in this new example (for obvious reasons, this is not my question). The object created by res2 gives exactly what I would need in this instance, but reverts to a loop and then becomes theoretically right in terms of what I want, but computationally not desirable.

### This would not work, as expected.
res1 <- exp(X %*% log(pr.t) + mX %*% log(1 - pr.t)) %*% wts
res2 <- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)

My question is whether anyone is aware of a way to produce the same result as res2 when there is missing data in X as I do with the sapply() method but is equally as efficient as the big matrix calculation?

I see two options, both of which I am exploring. One option could be to use parallel processing for the loop and a second option could be to use Rcpp. Both decent options. However, before going down either of those two pathways, I'm asking for some help to learn if anyone sees a really nice computational implementation that I am not seeing?

### Objects to run sample code

X <- structure(c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), dim = c(5L, 
5L), dimnames = list(NULL, c("Item 1", "Item 2", "Item 3", "Item 4", 
"Item 5")))

pr.t <- structure(c(0.000389840525419771, 0.000389840525419771, 0.000389840525419771, 
0.000389840525419771, 0.000389840525419771, 0.00116782384335194, 
0.00116782384335194, 0.00116782384335194, 0.00116782384335194, 
0.00116782384335194, 0.00293127561410344, 0.00293127561410344, 
0.00293127561410344, 0.00293127561410344, 0.00293127561410344, 
0.00672641421586068, 0.00672641421586068, 0.00672641421586068, 
0.00672641421586068, 0.00672641421586068, 0.0145666908055583, 
0.0145666908055583, 0.0145666908055583, 0.0145666908055583, 0.0145666908055583, 
0.0301824687604691, 0.0301824687604691, 0.0301824687604691, 0.0301824687604691, 
0.0301824687604691, 0.0600531695657659, 0.0600531695657659, 0.0600531695657659, 
0.0600531695657659, 0.0600531695657659, 0.114143103288218, 0.114143103288218, 
0.114143103288218, 0.114143103288218, 0.114143103288218, 0.204278364784018, 
0.204278364784018, 0.204278364784018, 0.204278364784018, 0.204278364784018, 
0.336697623276164, 0.336697623276164, 0.336697623276164, 0.336697623276164, 
0.336697623276164, 0.5, 0.5, 0.5, 0.5, 0.5, 0.663302376723836, 
0.663302376723836, 0.663302376723836, 0.663302376723836, 0.663302376723836, 
0.795721635215982, 0.795721635215982, 0.795721635215982, 0.795721635215982, 
0.795721635215982, 0.885856896711782, 0.885856896711782, 0.885856896711782, 
0.885856896711782, 0.885856896711782, 0.939946830434234, 0.939946830434234, 
0.939946830434234, 0.939946830434234, 0.939946830434234, 0.969817531239531, 
0.969817531239531, 0.969817531239531, 0.969817531239531, 0.969817531239531, 
0.985433309194442, 0.985433309194442, 0.985433309194442, 0.985433309194442, 
0.985433309194442, 0.993273585784139, 0.993273585784139, 0.993273585784139, 
0.993273585784139, 0.993273585784139, 0.997068724385897, 0.997068724385897, 
0.997068724385897, 0.997068724385897, 0.997068724385897, 0.998832176156648, 
0.998832176156648, 0.998832176156648, 0.998832176156648, 0.998832176156648, 
0.99961015947458, 0.99961015947458, 0.99961015947458, 0.99961015947458, 
0.99961015947458), dim = c(5L, 21L))

wts <- c(2.09899121956567e-14, 4.97536860412164e-11, 1.45066128449311e-08, 
1.22535483614825e-06, 4.21923474255167e-05, 0.000708047795481538, 
0.00643969705140876, 0.033952729786543, 0.108392285626419, 0.21533371569506, 
0.270260183572876, 0.21533371569506, 0.10839228562642, 0.0339527297865429, 
0.00643969705140878, 0.000708047795481537, 4.21923474255168e-05, 
1.22535483614826e-06, 1.45066128449309e-08, 4.97536860412161e-11, 
2.09899121956567e-14)

mX <- 1 - X

答案1

得分: 3

请注意,在res2中,您使用了colSums(.,na.rm=T),在这种情况下等同于将缺失值设为0。因此,我们可以对res1采取相同的操作:

library(tidyr)

res1 <- exp(replace_na(X, 0) %*% log(pr.t) + replace_na(mX, 0) %*% log(1 - pr.t)) %*% wts

res2 <- sapply(1:nrow(X), function(i) exp(colSums(X[i, 1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i, 1:5] * log(1 - pr.t[1:5,]), na.rm = TRUE)) %*% wts)

# 使用以下带有1个缺失值的X
X <- structure(c(0L, 0L, NA_real_, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
                 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), dim = c(5L, 
                                                                          5L), dimnames = list(NULL, c("Item 1", "Item 2", "Item 3", "Item 4", 
                                                                                                      "Item 5")))

> all.equal(res1[, 1], res2)
[1] TRUE

以上是您提供的代码的翻译。

英文:

Note that you use colSums(.,na.rm=T) in res2, which in this case is equivalent to setting missing value to 0. Therefore, we can do the same to res1:

library(tidyr)

res1 &lt;- exp(replace_na(X ,0)%*% log(pr.t) + replace_na(mX ,0)%*% log(1 - pr.t)) %*% wts

res2 &lt;- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)

#Using the following X with 1 missing value
X &lt;- structure(c(0L, 0L, NA_real_, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
                 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), dim = c(5L, 
                                                                          5L), dimnames = list(NULL, c(&quot;Item 1&quot;, &quot;Item 2&quot;, &quot;Item 3&quot;, &quot;Item 4&quot;, 
                                                                                                       &quot;Item 5&quot;)))

&gt; all.equal(res1[,1], res2)
[1] TRUE

huangapple
  • 本文由 发表于 2023年2月8日 23:03:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/75387682.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定