英文:
How to best optimize my R code and avoid looping
问题
我现在有一段代码,它存在于一个优化程序中。下面是用来演示这段代码如何工作的样本对象。
当矩阵 `X` 的所有元素都被观察到时,计算非常高效,可以使用以下方式用 ```res1``` 进行写作。在 ```res2``` 中创建的对象产生与 `res1` 相同的结果,但会遍历行,这在R中非常昂贵和低效。
### 如果所有都被观察到
res1 <- exp(X %*% log(pr.t) + mX %*% log(1 - pr.t)) %*% wts
res2 <- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)
all.equal(res1[,1], res2)
现在,问题出在我的真实场景中,矩阵 `X` 中往往会有缺失值。因此,对于 `res1` 的计算会在其第一个元素中产生一个 `NA`,就像在这个新示例中所示的那样(出于显而易见的原因,这不是我的问题)。通过 `res2` 创建的对象在这种情况下会给出我所需要的结果,但会退回到一个循环,从理论上讲符合我想要的,但在计算上不可取。
### 这将无法正常工作,如预期的那样。
res1 <- exp(X %*% log(pr.t) + mX %*% log(1 - pr.t)) %*% wts
res2 <- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)
我的问题是,是否有人知道一种方法可以在矩阵 `X` 中存在缺失数据时产生与 `res2` 相同的结果,就像我使用 `sapply()` 方法一样,但在大矩阵计算方面同样高效?
我看到两个选项,我正在探索这两个选项。一个选项可能是对循环使用并行处理,第二个选项可能是使用Rcpp。这两个都是不错的选择。然而,在选择这两条路径之前,我想要一些帮助,看看是否有人看到了我没有看到的一个非常好的计算实现?
英文:
I have a piece of code now that lives inside an optimization routine. Below at the bottom are sample objects to use to see how this code works.
When all elements of the matrix X
are observed, the calculation is very efficient and can be written as follows using res1
. The object created in res2
produces the same result as res1
but loops over rows and is very expensive and inefficient in R.
### If everything is observed
res1 <- exp(X %*% log(pr.t) + mX %*% log(1 - pr.t)) %*% wts
res2 <- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)
all.equal(res1[,1], res2)
Now, the problem is in my real world scenario, there will often be missing values in the matrix X
. As such, the calculation for res1
would yield an NA
for its first element as shown in this new example (for obvious reasons, this is not my question). The object created by res2
gives exactly what I would need in this instance, but reverts to a loop and then becomes theoretically right in terms of what I want, but computationally not desirable.
### This would not work, as expected.
res1 <- exp(X %*% log(pr.t) + mX %*% log(1 - pr.t)) %*% wts
res2 <- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)
My question is whether anyone is aware of a way to produce the same result as res2
when there is missing data in X
as I do with the sapply()
method but is equally as efficient as the big matrix calculation?
I see two options, both of which I am exploring. One option could be to use parallel processing for the loop and a second option could be to use Rcpp. Both decent options. However, before going down either of those two pathways, I'm asking for some help to learn if anyone sees a really nice computational implementation that I am not seeing?
### Objects to run sample code
X <- structure(c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), dim = c(5L,
5L), dimnames = list(NULL, c("Item 1", "Item 2", "Item 3", "Item 4",
"Item 5")))
pr.t <- structure(c(0.000389840525419771, 0.000389840525419771, 0.000389840525419771,
0.000389840525419771, 0.000389840525419771, 0.00116782384335194,
0.00116782384335194, 0.00116782384335194, 0.00116782384335194,
0.00116782384335194, 0.00293127561410344, 0.00293127561410344,
0.00293127561410344, 0.00293127561410344, 0.00293127561410344,
0.00672641421586068, 0.00672641421586068, 0.00672641421586068,
0.00672641421586068, 0.00672641421586068, 0.0145666908055583,
0.0145666908055583, 0.0145666908055583, 0.0145666908055583, 0.0145666908055583,
0.0301824687604691, 0.0301824687604691, 0.0301824687604691, 0.0301824687604691,
0.0301824687604691, 0.0600531695657659, 0.0600531695657659, 0.0600531695657659,
0.0600531695657659, 0.0600531695657659, 0.114143103288218, 0.114143103288218,
0.114143103288218, 0.114143103288218, 0.114143103288218, 0.204278364784018,
0.204278364784018, 0.204278364784018, 0.204278364784018, 0.204278364784018,
0.336697623276164, 0.336697623276164, 0.336697623276164, 0.336697623276164,
0.336697623276164, 0.5, 0.5, 0.5, 0.5, 0.5, 0.663302376723836,
0.663302376723836, 0.663302376723836, 0.663302376723836, 0.663302376723836,
0.795721635215982, 0.795721635215982, 0.795721635215982, 0.795721635215982,
0.795721635215982, 0.885856896711782, 0.885856896711782, 0.885856896711782,
0.885856896711782, 0.885856896711782, 0.939946830434234, 0.939946830434234,
0.939946830434234, 0.939946830434234, 0.939946830434234, 0.969817531239531,
0.969817531239531, 0.969817531239531, 0.969817531239531, 0.969817531239531,
0.985433309194442, 0.985433309194442, 0.985433309194442, 0.985433309194442,
0.985433309194442, 0.993273585784139, 0.993273585784139, 0.993273585784139,
0.993273585784139, 0.993273585784139, 0.997068724385897, 0.997068724385897,
0.997068724385897, 0.997068724385897, 0.997068724385897, 0.998832176156648,
0.998832176156648, 0.998832176156648, 0.998832176156648, 0.998832176156648,
0.99961015947458, 0.99961015947458, 0.99961015947458, 0.99961015947458,
0.99961015947458), dim = c(5L, 21L))
wts <- c(2.09899121956567e-14, 4.97536860412164e-11, 1.45066128449311e-08,
1.22535483614825e-06, 4.21923474255167e-05, 0.000708047795481538,
0.00643969705140876, 0.033952729786543, 0.108392285626419, 0.21533371569506,
0.270260183572876, 0.21533371569506, 0.10839228562642, 0.0339527297865429,
0.00643969705140878, 0.000708047795481537, 4.21923474255168e-05,
1.22535483614826e-06, 1.45066128449309e-08, 4.97536860412161e-11,
2.09899121956567e-14)
mX <- 1 - X
答案1
得分: 3
请注意,在res2
中,您使用了colSums(.,na.rm=T)
,在这种情况下等同于将缺失值设为0。因此,我们可以对res1
采取相同的操作:
library(tidyr)
res1 <- exp(replace_na(X, 0) %*% log(pr.t) + replace_na(mX, 0) %*% log(1 - pr.t)) %*% wts
res2 <- sapply(1:nrow(X), function(i) exp(colSums(X[i, 1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i, 1:5] * log(1 - pr.t[1:5,]), na.rm = TRUE)) %*% wts)
# 使用以下带有1个缺失值的X
X <- structure(c(0L, 0L, NA_real_, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), dim = c(5L,
5L), dimnames = list(NULL, c("Item 1", "Item 2", "Item 3", "Item 4",
"Item 5")))
> all.equal(res1[, 1], res2)
[1] TRUE
以上是您提供的代码的翻译。
英文:
Note that you use colSums(.,na.rm=T)
in res2
, which in this case is equivalent to setting missing value to 0. Therefore, we can do the same to res1
:
library(tidyr)
res1 <- exp(replace_na(X ,0)%*% log(pr.t) + replace_na(mX ,0)%*% log(1 - pr.t)) %*% wts
res2 <- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)
#Using the following X with 1 missing value
X <- structure(c(0L, 0L, NA_real_, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), dim = c(5L,
5L), dimnames = list(NULL, c("Item 1", "Item 2", "Item 3", "Item 4",
"Item 5")))
> all.equal(res1[,1], res2)
[1] TRUE
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论