2023年2月8日 23:03:26go评论90阅读模式

英文:

How to best optimize my R code and avoid looping

问题

我现在有一段代码，它存在于一个优化程序中。下面是用来演示这段代码如何工作的样本对象。
当矩阵 `X` 的所有元素都被观察到时，计算非常高效，可以使用以下方式用 ```res1``` 进行写作。在 ```res2``` 中创建的对象产生与 `res1` 相同的结果，但会遍历行，这在R中非常昂贵和低效。
    ### 如果所有都被观察到
    res1 &lt;- exp(X %*% log(pr.t) + mX %*% log(1 - pr.t)) %*% wts
    res2 &lt;- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)
    all.equal(res1[,1], res2)
现在，问题出在我的真实场景中，矩阵 `X` 中往往会有缺失值。因此，对于 `res1` 的计算会在其第一个元素中产生一个 `NA`，就像在这个新示例中所示的那样（出于显而易见的原因，这不是我的问题）。通过 `res2` 创建的对象在这种情况下会给出我所需要的结果，但会退回到一个循环，从理论上讲符合我想要的，但在计算上不可取。
    ### 这将无法正常工作，如预期的那样。
    res1 &lt;- exp(X %*% log(pr.t) + mX %*% log(1 - pr.t)) %*% wts
    res2 &lt;- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)
我的问题是，是否有人知道一种方法可以在矩阵 `X` 中存在缺失数据时产生与 `res2` 相同的结果，就像我使用 `sapply()` 方法一样，但在大矩阵计算方面同样高效？
我看到两个选项，我正在探索这两个选项。一个选项可能是对循环使用并行处理，第二个选项可能是使用Rcpp。这两个都是不错的选择。然而，在选择这两条路径之前，我想要一些帮助，看看是否有人看到了我没有看到的一个非常好的计算实现？

英文:

I have a piece of code now that lives inside an optimization routine. Below at the bottom are sample objects to use to see how this code works.

When all elements of the matrix X are observed, the calculation is very efficient and can be written as follows using res1. The object created in res2 produces the same result as res1 but loops over rows and is very expensive and inefficient in R.

### If everything is observed
res1 &lt;- exp(X %*% log(pr.t) + mX %*% log(1 - pr.t)) %*% wts
res2 &lt;- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)
all.equal(res1[,1], res2)

Now, the problem is in my real world scenario, there will often be missing values in the matrix X. As such, the calculation for res1 would yield an NA for its first element as shown in this new example (for obvious reasons, this is not my question). The object created by res2 gives exactly what I would need in this instance, but reverts to a loop and then becomes theoretically right in terms of what I want, but computationally not desirable.

### This would not work, as expected.
res1 &lt;- exp(X %*% log(pr.t) + mX %*% log(1 - pr.t)) %*% wts
res2 &lt;- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)

My question is whether anyone is aware of a way to produce the same result as res2 when there is missing data in X as I do with the sapply() method but is equally as efficient as the big matrix calculation?

I see two options, both of which I am exploring. One option could be to use parallel processing for the loop and a second option could be to use Rcpp. Both decent options. However, before going down either of those two pathways, I'm asking for some help to learn if anyone sees a really nice computational implementation that I am not seeing?

### Objects to run sample code
X &lt;- structure(c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), dim = c(5L, 
5L), dimnames = list(NULL, c(&quot;Item 1&quot;, &quot;Item 2&quot;, &quot;Item 3&quot;, &quot;Item 4&quot;, 
&quot;Item 5&quot;)))
pr.t &lt;- structure(c(0.000389840525419771, 0.000389840525419771, 0.000389840525419771, 
0.000389840525419771, 0.000389840525419771, 0.00116782384335194, 
0.00116782384335194, 0.00116782384335194, 0.00116782384335194, 
0.00116782384335194, 0.00293127561410344, 0.00293127561410344, 
0.00293127561410344, 0.00293127561410344, 0.00293127561410344, 
0.00672641421586068, 0.00672641421586068, 0.00672641421586068, 
0.00672641421586068, 0.00672641421586068, 0.0145666908055583, 
0.0145666908055583, 0.0145666908055583, 0.0145666908055583, 0.0145666908055583, 
0.0301824687604691, 0.0301824687604691, 0.0301824687604691, 0.0301824687604691, 
0.0301824687604691, 0.0600531695657659, 0.0600531695657659, 0.0600531695657659, 
0.0600531695657659, 0.0600531695657659, 0.114143103288218, 0.114143103288218, 
0.114143103288218, 0.114143103288218, 0.114143103288218, 0.204278364784018, 
0.204278364784018, 0.204278364784018, 0.204278364784018, 0.204278364784018, 
0.336697623276164, 0.336697623276164, 0.336697623276164, 0.336697623276164, 
0.336697623276164, 0.5, 0.5, 0.5, 0.5, 0.5, 0.663302376723836, 
0.663302376723836, 0.663302376723836, 0.663302376723836, 0.663302376723836, 
0.795721635215982, 0.795721635215982, 0.795721635215982, 0.795721635215982, 
0.795721635215982, 0.885856896711782, 0.885856896711782, 0.885856896711782, 
0.885856896711782, 0.885856896711782, 0.939946830434234, 0.939946830434234, 
0.939946830434234, 0.939946830434234, 0.939946830434234, 0.969817531239531, 
0.969817531239531, 0.969817531239531, 0.969817531239531, 0.969817531239531, 
0.985433309194442, 0.985433309194442, 0.985433309194442, 0.985433309194442, 
0.985433309194442, 0.993273585784139, 0.993273585784139, 0.993273585784139, 
0.993273585784139, 0.993273585784139, 0.997068724385897, 0.997068724385897, 
0.997068724385897, 0.997068724385897, 0.997068724385897, 0.998832176156648, 
0.998832176156648, 0.998832176156648, 0.998832176156648, 0.998832176156648, 
0.99961015947458, 0.99961015947458, 0.99961015947458, 0.99961015947458, 
0.99961015947458), dim = c(5L, 21L))
wts &lt;- c(2.09899121956567e-14, 4.97536860412164e-11, 1.45066128449311e-08, 
1.22535483614825e-06, 4.21923474255167e-05, 0.000708047795481538, 
0.00643969705140876, 0.033952729786543, 0.108392285626419, 0.21533371569506, 
0.270260183572876, 0.21533371569506, 0.10839228562642, 0.0339527297865429, 
0.00643969705140878, 0.000708047795481537, 4.21923474255168e-05, 
1.22535483614826e-06, 1.45066128449309e-08, 4.97536860412161e-11, 
2.09899121956567e-14)
mX &lt;- 1 - X

答案1

得分: 3

请注意，在res2中，您使用了colSums(.,na.rm=T)，在这种情况下等同于将缺失值设为0。因此，我们可以对res1采取相同的操作：

library(tidyr)
res1 <- exp(replace_na(X, 0) %*% log(pr.t) + replace_na(mX, 0) %*% log(1 - pr.t)) %*% wts
res2 <- sapply(1:nrow(X), function(i) exp(colSums(X[i, 1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i, 1:5] * log(1 - pr.t[1:5,]), na.rm = TRUE)) %*% wts)
# 使用以下带有1个缺失值的X
X <- structure(c(0L, 0L, NA_real_, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
                 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), dim = c(5L, 
                                                                          5L), dimnames = list(NULL, c("Item 1", "Item 2", "Item 3", "Item 4", 
                                                                                                      "Item 5")))
> all.equal(res1[, 1], res2)
[1] TRUE

以上是您提供的代码的翻译。

英文:

Note that you use colSums(.,na.rm=T) in res2, which in this case is equivalent to setting missing value to 0. Therefore, we can do the same to res1:

library(tidyr)
res1 &lt;- exp(replace_na(X ,0)%*% log(pr.t) + replace_na(mX ,0)%*% log(1 - pr.t)) %*% wts
res2 &lt;- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)
#Using the following X with 1 missing value
X &lt;- structure(c(0L, 0L, NA_real_, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
                 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), dim = c(5L, 
                                                                          5L), dimnames = list(NULL, c(&quot;Item 1&quot;, &quot;Item 2&quot;, &quot;Item 3&quot;, &quot;Item 4&quot;, 
                                                                                                       &quot;Item 5&quot;)))
&gt; all.equal(res1[,1], res2)
[1] TRUE

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何最佳优化我的R代码并避免循环

问题

答案1

R data.table滚动连接等，以添加到前一时期的动态

如何绘制带有图例和数据作为x轴的一部分的图表？

数据操纵在文件（Excel）输入后在Shiny中。

R CMD CHECK 和 @examples — check() 的失败与否取决于注释。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。