在R中,向现有列表附加项目的最节省内存的方法是什么?

huangapple go评论65阅读模式
英文:

What is the most memory efficient way to append items to an existing list in R?

问题

我有一个在R中的列表,如下例中的my_list2

我想以最小化峰值RAM使用的方式向列表添加项目。

除了使用append函数之外,是否有更节省内存的方法?

我知道按照下面的例子最佳实践是创建一个'空'列表,然后像my_list2一样填充它,但这不是一个选项,因为列表已经存在。

# 如果我可以从头开始创建列表,我会这样做:
my_list <- vector('list', 10)
for (i in 1:10) {
  my_list[[i]] <- i
}

# 除了'append'函数,是否有更好的方法?
my_list2 <- list(1)
for (i in 2:10) {
  my_list2 <- append(my_list2, i)
}
英文:

I have a list in R, my_list2 in the example below.

I want to add items to the list in a way that minimises the peak RAM usage.

Is there a more memory efficient way to do this than using the append function?

I'm aware that it's best practice to create an 'empty' list then fill it as per my_list2 in the example below, but this isn't an option as the list already exists.

# If I could create the list from scratch I&#39;d do it list this:
my_list &lt;- vector(&#39;list&#39;, 10)
for (i in 1:10) {
  my_list[[i]] &lt;- i
}

# Is there a better way than the &#39;append&#39; function?
my_list2 &lt;- list(1)
for (i in 2:10) {
  my_list2 &lt;- append(my_list2, i)
}

答案1

得分: 5

使用append()在每次迭代中,你可以创建一个临时列表,最后一次性将其附加到my_list2。这对你来说可以吗?

以下是在for循环中进行了5,000次迭代的示例:

my_list <- list(1)
my_list2 <- list(1)

bench::mark(
  orig = {
    for (i in 2:5000) {
      my_list <- append(my_list, i)
    }
    my_list
  },
  mine = {
    tmp <- vector("list", 4999)
    for (i in 1:4999) {
      tmp[[i]] <- i + 1
    }
    append(my_list2, tmp)
  },
  iterations = 10
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 orig       420.01ms    1.69s     0.567    95.7MB     13.6
#> 2 mine         1.52ms      2ms   406.       96.8KB      0
英文:

Rather than using append() in each iteration, you could create a temporary list and append it to my_list2 only once at the end. Would this do the job for you?

Here's an example with 5k iterations in the for loop:

my_list &lt;- list(1)
my_list2 &lt;- list(1)

bench::mark(
  orig = {
    for (i in 2:5000) {
      my_list &lt;- append(my_list, i)
    }
    my_list
  },
  mine = {
    tmp &lt;- vector(&quot;list&quot;, 4999)
    for (i in 1:4999) {
      tmp[[i]] &lt;- i + 1
    }
    append(my_list2, tmp)
  },
  iterations = 10
)
#&gt; Warning: Some expressions had a GC in every iteration; so filtering is
#&gt; disabled.
#&gt; # A tibble: 2 &#215; 6
#&gt;   expression      min   median `itr/sec` mem_alloc `gc/sec`
#&gt;   &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt;     &lt;dbl&gt; &lt;bch:byt&gt;    &lt;dbl&gt;
#&gt; 1 orig       420.01ms    1.69s     0.567    95.7MB     13.6
#&gt; 2 mine         1.52ms      2ms   406.       96.8KB      0

Note that bench::mark() automatically checks that both codes give the same output.

答案2

得分: 1

以下是翻译好的代码部分:

一个具有低峰值RAM使用率的实际解决方案可能如下所示:
```R
my_list <- list(1)
N <- length(my_list)
length(my_list) <- N + 9
for (i in 2:10) {
  my_list[[N + i -1]] <- i
  #gc() #可选
}

你可以使用 gc 来获取峰值RAM使用率。但这在执行过程中是否进行了垃圾收集会受到很大影响。要查看可能的最小峰值,可以打开 gctorture,但执行时间会变得更慢。由于结果可能受到调用方法的顺序影响,我每次都会启动一个新的基本会话。

#使用append
n <- 1e5
gctorture(on=TRUE)

set.seed(0)
L <- list(sample(n))
gc(reset=TRUE)
#         used (Mb) gc trigger (Mb) max used (Mb)
#Ncells 285638 15.3     664228 35.5   285638 15.3
#Vcells 633121  4.9    8388608 64.0   633121  4.9
for (i in 2:10) L <- append(L, list(sample(n)))
gc()
#          used (Mb) gc trigger (Mb) max used (Mb)
#Ncells  344156 18.4     664228 35.5   345174 18.5
#Vcells 1215086  9.3    8388608 64.0  1265554  9.7
#使用[[<-
n <- 1e5
gctorture(on=TRUE)

set.seed(0)
L <- list(sample(n))
gc(reset=TRUE)
#         used (Mb) gc trigger (Mb) max used (Mb)
#Ncells 285638 15.3     664228 35.5   285638 15.3
#Vcells 633121  4.9    8388608 64.0   633121  4.9
for (i in 2:10) L[[length(L)+1]] <- sample(n)
gc()
#          used (Mb) gc trigger (Mb) max used (Mb)
#Ncells  346937 18.6     664228 35.5   347919 18.6
#Vcells 1221639  9.4    8388608 64.0  1272088  9.8
#使用[[<-,但在之前调整列表大小
n <- 1e5
gctorture(on=TRUE)

set.seed(0)
L <- list(sample(n))
gc(reset=TRUE)
#         used (Mb) gc trigger (Mb) max used (Mb)
#Ncells 285638 15.3     664228 35.5   285638 15.3
#Vcells 633121  4.9    8388608 64.0   633121  4.9
N <- length(L)
length(L) <- N + 9
for (i in 2:10) L[[N - 1 + i]] <- sample(n)
gc()
#          used (Mb) gc trigger (Mb) max used (Mb)
#Ncells  346564 18.6     664228 35.5   347498 18.6
#Vcells 1220761  9.4    8388608 64.0  1271479  9.8

在这里,append 需要 8.0 Mb,而 [[<- 无论在调整列表大小之前与否,都需要 8.2 Mb。


如果不使用 gctorture,而是在每个步骤之后手动使用 gc,则得到以下结果:

#使用append
n <- 1e5

set.seed(0)
L <- list(sample(n))
gc(reset=TRUE)
#         used (Mb) gc trigger (Mb) max used (Mb)
#Ncells 285638 15.3     664228 35.5   285638 15.3
#Vcells 633121  4.9    8388608 64.0   633121  4.9
for (i in 2:10) {L <- append(L, list(sample(n))); gc()}
gc()
#          used (Mb) gc trigger (Mb) max used (Mb)
#Ncells  344145 18.4     664228 35.5   372952 20.0
#Vcells 1215054  9.3    8388608 64.0  1319826 10.1
#使用[[<-
n <- 1e5

set.seed(0)
L <- list(sample(n))
gc(reset=TRUE)
#         used (Mb) gc trigger (Mb) max used (Mb)
#Ncells 285638 15.3     664228 35.5   285638 15.3
#Vcells 633121  4.9    8388608 64.0   633121  4.9
for (i in 2:10) {L[[length(L)+1]] <- sample(n); gc()}
gc()
#          used (Mb) gc trigger (Mb) max used (Mb)
#Ncells  346926 18.6     664228 35.5   377474 20.2
#Vcells 1221607  9.4    8388608 64.0  1352555 10.4
n <- 1e5

set.seed(0)
L <- list(sample(n))
gc(reset=TRUE)
#         used (Mb) gc trigger (Mb) max used (Mb)
#Ncells 285638 15.3     664228 35.5   285638 15.3
#Vcells 633121  4.9    8388608 64.0   633121  4.9
N <- length(L)
length(L) <- N + 9
for (i in 2:10) {L[[N - 1 + i]] <- sample(n); gc()}
gc()
#          used (Mb) gc trigger (Mb) max used (Mb)
#Ncells  347659 18.6     

<details>
<summary>英文:</summary>

A practical solution with low peak RAM usage can look like:

my_list <- list(1)
N <- length(my_list)
length(my_list) <- N + 9
for (i in 2:10) {
my_list[[N + i -1]] <- i
#gc() #Optional
}

You can use `gc` to get the **peak RAM usage**. But this is much influenced whether there was a garbage collection or not during execution. To see the minimum possible peak `gctorture` could be turned on, but then the execution time gets typical much slower. As the result could be influenced by the order how the methods are called I start each time a new vanilla session.

#Using append
n <- 1e5
gctorture(on=TRUE)

set.seed(0)
L <- list(sample(n))
gc(reset=TRUE)

used (Mb) gc trigger (Mb) max used (Mb)

#Ncells 285638 15.3 664228 35.5 285638 15.3
#Vcells 633121 4.9 8388608 64.0 633121 4.9
for (i in 2:10) L <- append(L, list(sample(n)))
gc()

used (Mb) gc trigger (Mb) max used (Mb)

#Ncells 344156 18.4 664228 35.5 345174 18.5
#Vcells 1215086 9.3 8388608 64.0 1265554 9.7

#Using [[<-
n <- 1e5
gctorture(on=TRUE)

set.seed(0)
L <- list(sample(n))
gc(reset=TRUE)

used (Mb) gc trigger (Mb) max used (Mb)

#Ncells 285638 15.3 664228 35.5 285638 15.3
#Vcells 633121 4.9 8388608 64.0 633121 4.9
for (i in 2:10) L[[length(L)+1]] <- sample(n)
gc()

used (Mb) gc trigger (Mb) max used (Mb)

#Ncells 346937 18.6 664228 35.5 347919 18.6
#Vcells 1221639 9.4 8388608 64.0 1272088 9.8

#Using [[<- but resizing the list before
n <- 1e5
gctorture(on=TRUE)

set.seed(0)
L <- list(sample(n))
gc(reset=TRUE)

used (Mb) gc trigger (Mb) max used (Mb)

#Ncells 285638 15.3 664228 35.5 285638 15.3
#Vcells 633121 4.9 8388608 64.0 633121 4.9
N <- length(L)
length(L) <- N + 9
for (i in 2:10) L[[N - 1 + i]] <- sample(n)
gc()

used (Mb) gc trigger (Mb) max used (Mb)

#Ncells 346564 18.6 664228 35.5 347498 18.6
#Vcells 1220761 9.4 8388608 64.0 1271479 9.8

Here `append` needs 8.0 Mb and `[[&lt;-` 8.2 Mb independent if the list size is increased before or not.

---
Doing the same but without `gctorture` but manually using `gc` after each step gives:

#Using append
n <- 1e5

set.seed(0)
L <- list(sample(n))
gc(reset=TRUE)

used (Mb) gc trigger (Mb) max used (Mb)

#Ncells 285638 15.3 664228 35.5 285638 15.3
#Vcells 633121 4.9 8388608 64.0 633121 4.9
for (i in 2:10) {L <- append(L, list(sample(n))); gc()}
gc()

used (Mb) gc trigger (Mb) max used (Mb)

#Ncells 344145 18.4 664228 35.5 372952 20.0
#Vcells 1215054 9.3 8388608 64.0 1319826 10.1

#Using [[<-
n <- 1e5

set.seed(0)
L <- list(sample(n))
gc(reset=TRUE)

used (Mb) gc trigger (Mb) max used (Mb)

#Ncells 285638 15.3 664228 35.5 285638 15.3
#Vcells 633121 4.9 8388608 64.0 633121 4.9
for (i in 2:10) {L[[length(L)+1]] <- sample(n); gc()}
gc()

used (Mb) gc trigger (Mb) max used (Mb)

#Ncells 346926 18.6 664228 35.5 377474 20.2
#Vcells 1221607 9.4 8388608 64.0 1352555 10.4

n <- 1e5

set.seed(0)
L <- list(sample(n))
gc(reset=TRUE)

used (Mb) gc trigger (Mb) max used (Mb)

#Ncells 285638 15.3 664228 35.5 285638 15.3
#Vcells 633121 4.9 8388608 64.0 633121 4.9
N <- length(L)
length(L) <- N + 9
for (i in 2:10) {L[[N - 1 + i]] <- sample(n); gc()}
gc()

used (Mb) gc trigger (Mb) max used (Mb)

#Ncells 347659 18.6 664771 35.6 374526 20.1
#Vcells 1223042 9.4 8388608 64.0 1273592 9.8

Here `append` needs 9.9 Mb, `[[&lt;-` without resizing the list in advance 10.4 Mb and when the list size is increased before 9.7 Mb.

---
In case you want to know the total amount of allocated but maybe in the meantime also freed memory or other options have a look at [Monitor memory usage in R](https://stackoverflow.com/questions/7856306).

</details>



huangapple
  • 本文由 发表于 2023年5月15日 10:54:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76250613.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定