使用lapply函数在构建带有多个条件的复杂列表时是否值得代替for循环?

huangapple go评论75阅读模式
英文:

Is using the lapply function worthwhile in lieu of a for-loop when building complex lists with multiple conditionals?

问题

在下面的示例代码中,我创建了一个名为createBucket的函数,该函数遍历一个向量(dfVector)和一个包含两个子列表数据框("DFOne"和"DFTwo")的列表(dfList)。该函数为每个dfList子列表数据框创建另一个虚拟数据框列表,其中找到元素"Boy"。这个示例代码按预期工作。

这只是我正在处理的代码的简化版本。在实际代码中,dfVectordfList的等效部分是响应性的,根据Shiny输入而扩展和收缩。函数读取其他列表,函数在遍历向量和列表时还会施加其他条件。还有计算,这些计算不同于这个示例,这个示例为了简化而将子列表数据框填充为零。

考虑到这个函数有多么复杂,是否建议使用lapply()或其他apply系列函数?速度很重要,但由这个函数及相关函数生成的最终数据框不会被视为"大数据"(120行100多列)。我如何在下面的代码中使用lapply()?我可以使用for循环与lapply()运行速度测试。

代码:

dfVector <- function(){c("DF One","DF Two")}

dfList <- list(DFOne = c("Boy","Cat","Dog"),DFTwo = c("Boy","Rat","Bat"))

createBucket <- function(nbr_rows) {
  series <- gsub("\\s+", "", dfVector())
  buckets <- list()
  
  for (i in seq_along(series)) {
    series_name <- series[i]
    dfListOrder <- dfList[[series_name]]
    
    if ("Boy" %in% dfListOrder) {
      df_name <- paste0("bucket", gsub("\\s+", "", series_name))
      bucket <- data.frame(
        A = rep(0, nbr_rows),
        B = rep(0, nbr_rows),
        check.names = FALSE
      )
      buckets[[df_name]] <- bucket
    }
  }
  if (length(buckets) > 0) {return(buckets)} else {return(NULL)}
}

result <- createBucket(10)
result

希望这对你有所帮助。

英文:

In the example code below I create a function createBucket that reads through a vector (dfVector) and a list (dfList) comprised of two sublist dataframes, "DFOne" and "DFTwo". The function creates another list of dummy dataframes for each dfList sublist dataframe where it finds the element "Boy". This example code works as intended.

This is a simplification of the code I am working on. In the actual code, the equivalents of dfVector and dfList are reactive, expanding and contracting depending on Shiny inputs. There are other lists that the function reads, and there are other conditionals imposed as the vectors and lists are read through by the function. There are also calculations that feed from one sublist to another, instead of filling the sublist dataframes with zeroes as this example does for the sake of simplicity.

Given how much is going on with this function, is using lapply() or another apply family function advisable? Speed is important, but the ultimate dataframe generated by this and related functions won't qualify for "big data" (120 rows by 100+ columns). How could I use lapply() in the below? I could run speed tests with the for-loop versus lapply().

Code:

dfVector &lt;- function(){c(&quot;DF One&quot;,&quot;DF Two&quot;)}

dfList &lt;- list(DFOne = c(&quot;Boy&quot;,&quot;Cat&quot;,&quot;Dog&quot;),DFTwo = c(&quot;Boy&quot;,&quot;Rat&quot;,&quot;Bat&quot;))

createBucket &lt;- function(nbr_rows) {
  series &lt;- gsub(&quot;\\s+&quot;, &quot;&quot;, dfVector())
  buckets &lt;- list()
  
  for (i in seq_along(series)) {
    series_name &lt;- series[i]
    dfListOrder &lt;- dfList[[series_name]]
    
    if (&quot;Boy&quot; %in% dfListOrder) {
      df_name &lt;- paste0(&quot;bucket&quot;, gsub(&quot;\\s+&quot;, &quot;&quot;, series_name))
      bucket &lt;- data.frame(
        A = rep(0, nbr_rows),
        B = rep(0, nbr_rows),
        check.names = FALSE
      )
      buckets[[df_name]] &lt;- bucket
    }
  }
  if (length(buckets) &gt; 0) {return(buckets)} else {return(NULL)}
  }

result &lt;- createBucket(10)
result

答案1

得分: 3

以下是您要翻译的代码部分:

one approach:

createBucket2 <- function(nbr_rows){
series <- gsub("\s+", "", dfVector())
series |
lapply(FUN = (series_name){
if('Boy' %in% dfList[[series_name]]){
## here's the actual performance boost:
as.data.frame(matrix(0, nbr_rows, 2)) |
setNames(nm = c('A', 'B'))
}
}) |
setNames(nm = paste0('bucket', series)) |
((.) list(NULL, .)[[1 + (length(.) > 0)]])()
}

identical(createBucket(10), createBucket2(10))
1 TRUE

**edit** as for speed differences, the `lapply` variant would be about 10% faster than the `loop` variant (not shown) but the *real boost* in performance - three times as fast - comes from [creating the bucket dataframe via][1] `as.data.frame(matrix(...))` rather than via `data.frame(...)`.

loop variant: 314.8 μs

lapply variant: 77.2 μs

(in microseconds, median of 5000 runs using {microbenchmark})
英文:

one approach:

createBucket2 &lt;- function(nbr_rows){
  series &lt;- gsub(&quot;\\s+&quot;, &quot;&quot;, dfVector())
  series |&gt;
    lapply(FUN = \(series_name){
      if(&#39;Boy&#39; %in% dfList[[series_name]]){
        ## here&#39;s the actual performance boost:
        as.data.frame(matrix(0, nbr_rows, 2)) |&gt;
          setNames(nm = c(&#39;A&#39;, &#39;B&#39;))
      }
    }) |&gt;
    setNames(nm = paste0(&#39;bucket&#39;, series)) |&gt;
    (\(.) list(NULL, .)[[1 + (length(.) &gt; 0)]])()
}
&gt; identical(createBucket(10), createBucket2(10))
[1] TRUE 

edit as for speed differences, the lapply variant would be about 10% faster than the loop variant (not shown) but the real boost in performance - three times as fast - comes from creating the bucket dataframe via as.data.frame(matrix(...)) rather than via data.frame(...).

loop variant: 314.8 µs

lapply variant: 77.2 µs

(in microseconds, median of 5000 runs using {microbenchmark})

huangapple
  • 本文由 发表于 2023年6月9日 01:11:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76434238.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定