2023年2月24日 14:03:06go评论96阅读模式

英文:

Replacing nested for loops with lapply in R

问题

我有一个庞大的数据集，使用for循环获取结果需要很长时间。似乎我可以使用lapply来代替，但我在分析中使用它时遇到了问题。

以下是一个示例代码。我使用data.table而不是dataframe。

library(data.table)
allCountries = rep(rep(LETTERS[1:3],3),3)
allYears = rep(rep(1991:1993, each=3),3)
myData = data.table(allCountries,allYears)  
myData[,variable1 := rnorm(nrow(myData))]
myData[,variable2 := rnorm(nrow(myData))]
myData2 = myData[,.(variable3=mean(variable1)),by=.(allCountries,allYears)]
myData2[,variable4:=rnorm(nrow(myData2))]
myFunction = function(x,y){summary(lm(y~x))}
for(ii in unique(myData$allCountries)){
  for(jj in unique(myData$allYears)){
    xx=myData[allCountries==ii&allYears==jj,variable1]
    yy=myData[allCountries==ii&allYears==jj,variable2]
    test = myFunction(xx,yy)
    a=test$coefficients[2]
    myData2[allCountries==ii&allYears==jj,result:=a]
  }
}

我尝试将模型拟合到数据子集，并将结果记录在另一个数据集中。我理解lapply的逻辑，但在实施时遇到困难。任何帮助将不胜感激！

英文:

I have a large dataset, and it takes forever to get the results using for loops. It seems I can use lapply instead, but I'm having trouble using it for my analysis.

A sample code is below. I am using a data.table instead of dataframe.

library(data.table)
allCountries = rep(rep(LETTERS[1:3],3),3)
allYears = rep(rep(1991:1993, each=3),3)
myData = data.table(allCountries,allYears)  
myData[,variable1 := rnorm(nrow(myData))]
myData[,variable2 := rnorm(nrow(myData))]
myData2 = myData[,.(variable3=mean(variable1)),by=.(allCountries,allYears)]
myData2[,variable4:=rnorm(nrow(myData2))]
myFunction = function(x,y){summary(lm(y~x))}
for(ii in unique(myData$allCountries)){
  for(jj in unique(myData$allYears)){
    xx=myData[allCountries==ii&amp;allYears==jj,variable1]
    yy=myData[allCountries==ii&amp;allYears==jj,variable2]
    test = myFunction(xx,yy)
    a=test$coefficients[2]
    myData2[allCountries==ii&amp;allYears==jj,result:=a]
  }
}

I'm trying to fit the model to the subset of the data and record the result in another dataset. I understand the logic of lapply, but struggling to implement it. Any help would be much appreciated!

答案1

得分: 0

# 技巧在于将数据按`allCountries`和`allYears`拆分。这将创建一个data.table列表，`lapply`可以对它们进行操作。
库(data.table)
# 原始代码
所有国家= rep(rep(LETTERS[1:3],3),3)
所有年份= rep(rep(1991:1993, each=3),3)
# 使结果可重现
set.seed(2023)
我的数据= data.table(所有国家,所有年份)  
我的数据[,变量1 := rnorm(nrow(我的数据))]
我的数据[,变量2 := rnorm(nrow(我的数据))]
我的数据2 = 我的数据[,.(变量3=mean(变量1)),by=.(所有国家,所有年份)]
我的数据2[,变量4:=rnorm(nrow(我的数据2))]
我的函数 = function(x,y){summary(lm(y~x))}
for(ii in unique(我的数据$所有国家)){
  for(jj in unique(我的数据$所有年份)){
    xx=我的数据[所有国家==ii & 所有年份==jj,变量1]
    yy=我的数据[所有国家==ii & 所有年份==jj,变量2]
    测试 = 我的函数(xx, yy)
    a = 测试$coefficients[2]
    我的数据2[所有国家==ii & 所有年份==jj, 结果 := a]
  }
}
# 保存以便后续比较
md2 <- 我的数据2
# lapply代码从这里开始
rm(list = ls(pattern = "^我的数据"))
# 重新启动伪随机数生成器并重新生成数据
set.seed(2023)
我的数据 = data.table(所有国家,所有年份)  
我的数据[,变量1 := rnorm(nrow(我的数据))]
我的数据[,变量2 := rnorm(nrow(我的数据))]
#
我的数据2 = 我的数据[,.(变量3=mean(变量1)),by=.(所有国家,所有年份)]
我的数据2[,变量4:=rnorm(nrow(我的数据2))]
sp <- split(我的数据, list(我的数据$所有国家, 我的数据$所有年份))
# data.table会在原地转换数据，所以‘res’并不是严格需要的，但它避免了打印lapply的输出
res <- lapply(sp, \(X) {
  xx <- X[, 变量1]
  yy <- X[, 变量2]
  测试 <- 我的函数(xx, yy)
  a <- 测试$coefficients[2]
  我的数据2[所有国家 == X$所有国家[1] & 所有年份 == X$所有年份[1], 结果 := a]
})
identical(md2, 我的数据2)
#> [1] TRUE
rm(sp, res)    # 最终清理

编辑

以下是上面的lapply循环的简化版本。

# 这段代码比上面的lapply代码简单，它们的结果（我的数据2）是相同的()
res <- lapply(sp, \(X) {
  测试 <- with(X, 我的函数(变量1,变量2))
  a <- 测试$coefficients[2]
  我的数据2[所有国家 == X$所有国家[1] & 所有年份 == X$所有年份[1], 结果 := a]
})

英文:

The trick is to split the data by allCountries and allYears. This creates a list of data.tables and lapply can operate on them.

library(data.table)
# original code
allCountries = rep(rep(LETTERS[1:3],3),3)
allYears = rep(rep(1991:1993, each=3),3)
# make the results reproducible
set.seed(2023)
myData = data.table(allCountries,allYears)  
myData[,variable1 := rnorm(nrow(myData))]
myData[,variable2 := rnorm(nrow(myData))]
myData2 = myData[,.(variable3=mean(variable1)),by=.(allCountries,allYears)]
myData2[,variable4:=rnorm(nrow(myData2))]
myFunction = function(x,y){summary(lm(y~x))}
for(ii in unique(myData$allCountries)){
  for(jj in unique(myData$allYears)){
    xx=myData[allCountries==ii&amp;allYears==jj,variable1]
    yy=myData[allCountries==ii&amp;allYears==jj,variable2]
    test = myFunction(xx, yy)
    a = test$coefficients[2]
    myData2[allCountries==ii &amp; allYears==jj, result := a]
  }
}
# save to compare later
md2 &lt;- myData2

# lapply code starts here
rm(list = ls(pattern = &quot;^myData&quot;))
# restart the pseudo-RNG and reproduce the data
set.seed(2023)
myData = data.table(allCountries,allYears)  
myData[,variable1 := rnorm(nrow(myData))]
myData[,variable2 := rnorm(nrow(myData))]
#
myData2 = myData[,.(variable3=mean(variable1)),by=.(allCountries,allYears)]
myData2[,variable4:=rnorm(nrow(myData2))]
sp &lt;- split(myData, list(myData$allCountries, myData$allYears))
# data.table transforms the data in place so &#39;res&#39; is
# not stricktly needed but it avoids printing lapply&#39;s output
res &lt;- lapply(sp, \(X) {
  xx &lt;- X[, variable1]
  yy &lt;- X[, variable2]
  test &lt;- myFunction(xx, yy)
  a &lt;- test$coefficients[2]
  myData2[allCountries == X$allCountries[1] &amp; allYears == X$allYears[1], result := a]
})
identical(md2, myData2)
#&gt; [1] TRUE
rm(sp, res)    # final clean-up

<sup>Created on 2023-02-24 with reprex v2.0.2</sup>

Edit

Here is a simplification of the lapply loop above.

# this code is simpler than the lapply code above
# and their results (myData2) are identical()
res &lt;- lapply(sp, \(X) {
test &lt;- with(X, myFunction(variable1,variable2))
a &lt;- test$coefficients[2]
myData2[allCountries == X$allCountries[1] &amp; allYears == X$allYears[1], result := a]
})

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

用lapply在R中替换嵌套的for循环

问题

答案1

编辑

Edit

在R中，如果数据由“+”符号分隔，将其添加到新列中。

shiny / rhandsontable / 更新单元格

I am getting an error loading a package in R as it says an older version of rlang is loaded but I have the newer version installed

如何按正确的日期顺序对list.files()进行排序？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。