为什么在遍历 R 数据框的列时比遍历等价向量花费更长时间?

huangapple go评论94阅读模式
英文:

Why does it take longer to iterate over the column of a R data frame than to iterate over an equivalent vector?

问题

I work with a large data frame in R (containing 2,310,000 rows).

I found that a loop that iterate directly on the elements of the data frame column can be very slow. I compared this to iterating on the elements of a vector of equivalent size:

  1. df = data.frame(matrix(0, nrow = 2,310,000, ncol = 1))
  2. t0 = Sys.time()
  3. # iterating on data frame
  4. df$var = 0
  5. for (i in 1:100) {
  6. df$var[i] = 1
  7. }
  8. t1 = Sys.time()
  9. # iterating on vector
  10. df$var = 0
  11. v_var = df$var
  12. for (i in 1:100) {
  13. v_var[i] = 1
  14. }
  15. df$var = v_var
  16. t2 = Sys.time()
  17. print(t1 - t0) ; print(t2 - t1)

Output:

Time difference of 0.1035166 secs

Time difference of 0.0075109 secs

Can someone explain me why iterating on the elements of a large data frame is slower? Thanks in advance.

英文:

I work with a large data frame in R (containing 2310000 rows)

I found that a loop that iterate directly on the elements of the data frame column can be very slow. I compared this to iterating on the elements of a vector of equivalent size :

  1. df = data.frame(matrix(0, nrow = 2310000, ncol = 1))
  2. t0 = Sys.time()
  3. # iterating on data frame
  4. df$var = 0
  5. for (i in 1:100) {
  6. df$var[i] = 1
  7. }
  8. t1 = Sys.time()
  9. # iterating on vector
  10. df$var = 0
  11. v_var = df$var
  12. for (i in 1:100) {
  13. v_var[i] = 1
  14. }
  15. df$var = v_var
  16. t2 = Sys.time()
  17. print(t1 - t0) ; print(t2 - t1)

Output :
> Time difference of 0.1035166 secs

> Time difference of 0.0075109 secs

Can someone explain me why iterating on the elements of a large data frame is slower ?
Thanks in advance

答案1

得分: 1

以下是您要翻译的内容:

首先,为了搭建背景并展示不仅仅是从数据框中提取数据的操作 - 让我们比较不同对象类型之间的这种操作速度:

  1. library(microbenchmark)
  2. df = data.frame(matrix(0, nrow = 2310000, ncol = 1))
  3. df$var = 0
  4. lst = as.list(df)
  5. mat = data.matrix(df)
  6. microbenchmark(
  7. in_df = for (i in 1:100) {
  8. df$var[i] = 1
  9. },
  10. in_mat = for (i in 1:100) {
  11. mat[i,"var"] = 1
  12. },
  13. in_lst = for (i in 1:100) {
  14. lst$var[i] = 1
  15. }
  16. )

单位:毫秒
表达式 最小 下四分位 平均 中位数 上四分位 最大 评估
in_df 223.218706 248.705949 293.525415 274.549170 336.379693 394.847852 100
in_mat 1.304793 1.639721 1.932387 1.841773 2.086342 3.612089 100
in_lst 1.571596 1.705783 2.076625 2.005105 2.316363 5.913198 100

正如您所看到的,矩阵和列表没有任何问题。数据框在底层也是一个列表。那么这里发生了什么?答案可以在阅读数据框美元替换函数的帮助手册中找到:

  1. help(`$<-.data.frame`)

[...]

没有数据框方法用于‘$’,因此‘x$name’使用默认方法,将‘x’视为列表(如果匹配是唯一的,则使用列名称的部分匹配,请参阅‘Extract’)。替换方法(对于‘$’)检查‘value’的行数是否正确,并在必要时进行复制。

[...]

这告诉我们,与列表的情况不同,当分配给数据框的列时,我们还必须确保分配的长度与数据框中的行数匹配。这是因为数据框是一种特殊类型的列表 - 所有元素的长度相同,因此可以排列成表格格式。

您可以通过检查以下代码来查看数据框方法的所有操作:

  1. `$<-.data.frame`
  2. function (x, name, value)
  3. {
  4. cl <- oldClass(x)
  5. class(x) <- NULL
  6. nrows <- .row_names_info(x, 2L)
  7. if (!is.null(value)) {
  8. N <- NROW(value)
  9. if (N > nrows)
  10. stop(sprintf(ngettext(N, "replacement has %d row, data has %d",
  11. "replacement has %d rows, data has %d"), N, nrows),
  12. domain = NA)
  13. if (N < nrows)
  14. if (N > 0L && (nrows%%N == 0L) && length(dim(value)) <=
  15. 1L)
  16. value <- rep(value, length.out = nrows)
  17. else stop(sprintf(ngettext(N, "replacement has %d row, data has %d",
  18. "replacement has %d rows, data has %d"), N, nrows),
  19. domain = NA)
  20. if (is.atomic(value) && !is.null(names(value)))
  21. names(value) <- NULL
  22. }
  23. x[[name]] <- value
  24. class(x) <- cl
  25. return(x)
  26. }
  27. <bytecode: 0x7fd0667fce80>
  28. <environment: namespace:base>

解决此问题的一种方法是将数据框临时转换为列表,然后再转换回去。

  1. microbenchmark(
  2. in_df = {
  3. df <- unclass(df)
  4. in_df = for (i in 1:100) {
  5. df$var[i] = 1
  6. }
  7. df <- list2DF(df)
  8. }
  9. )

单位:毫秒
表达式 最小 下四分位 平均 中位数 上四分位 最大 评估
in_df 5.575034 5.938803 7.471679 5.988667 7.090439 16.38276 100

或者,如果您的所有数据都是相同的类型,请将其存储在矩阵而不是数据框中。

英文:

First, to set the stage and show that it's not only extraction from a data.frame that is happening - let's compare the speed of this operation across different object types:

  1. library(microbenchmark)
  2. df = data.frame(matrix(0, nrow = 2310000, ncol = 1))
  3. df$var = 0
  4. lst = as.list(df)
  5. mat = data.matrix(df)
  6. microbenchmark(
  7. in_df = for (i in 1:100) {
  8. df$var[i] = 1
  9. },
  10. in_mat = for (i in 1:100) {
  11. mat[i,&quot;var&quot;] = 1
  12. },
  13. in_lst = for (i in 1:100) {
  14. lst$var[i] = 1
  15. }
  16. )
  17. Unit: milliseconds
  18. expr min lq mean median uq max neval
  19. in_df 223.218706 248.705949 293.525415 274.549170 336.379693 394.847852 100
  20. in_mat 1.304793 1.639721 1.932387 1.841773 2.086342 3.612089 100
  21. in_lst 1.571596 1.705783 2.076625 2.005105 2.316363 5.913198 100

As you can see, matrix and lists didn't have any problems. data.frame is also a list under the hood. So what is going on here? The answer can be found in reading the help manual for the data.frame dollar replacement function:

  1. help(`$&lt;-.data.frame`)
  2. [...]
  3. There is no data.frame method for $’, so x$name uses the
  4. default method which treats x as a list (with partial matching
  5. of column names if the match is unique, see Extract’). The
  6. replacement method (for $’) checks value for the correct number
  7. of rows, and replicates it if necessary.
  8. [...]

So this tells us that, unlike in the case of list, when assigning to a column of a data.frame we also have to make sure that the length of assignment matches the number of rows in the data.frame. This is because a data.frame is a special kind of list - a list where all elements have the same length so it could be arranged into a table format.

You can look at everything the method for data.frame does by inspecting this code:

  1. &gt; `$&lt;-.data.frame`
  2. function (x, name, value)
  3. {
  4. cl &lt;- oldClass(x)
  5. class(x) &lt;- NULL
  6. nrows &lt;- .row_names_info(x, 2L)
  7. if (!is.null(value)) {
  8. N &lt;- NROW(value)
  9. if (N &gt; nrows)
  10. stop(sprintf(ngettext(N, &quot;replacement has %d row, data has %d&quot;,
  11. &quot;replacement has %d rows, data has %d&quot;), N, nrows),
  12. domain = NA)
  13. if (N &lt; nrows)
  14. if (N &gt; 0L &amp;&amp; (nrows%%N == 0L) &amp;&amp; length(dim(value)) &lt;=
  15. 1L)
  16. value &lt;- rep(value, length.out = nrows)
  17. else stop(sprintf(ngettext(N, &quot;replacement has %d row, data has %d&quot;,
  18. &quot;replacement has %d rows, data has %d&quot;), N, nrows),
  19. domain = NA)
  20. if (is.atomic(value) &amp;&amp; !is.null(names(value)))
  21. names(value) &lt;- NULL
  22. }
  23. x[[name]] &lt;- value
  24. class(x) &lt;- cl
  25. return(x)
  26. }
  27. &lt;bytecode: 0x7fd0667fce80&gt;
  28. &lt;environment: namespace:base&gt;

One solution to bypass this is to temporary transform your data.frame into a list, and then transform back.

  1. microbenchmark(
  2. in_df = {
  3. df &lt;- unclass(df)
  4. in_df = for (i in 1:100) {
  5. df$var[i] = 1
  6. }
  7. df &lt;- list2DF(df)
  8. }
  9. )
  10. Unit: milliseconds
  11. expr min lq mean median uq max neval
  12. in_df 5.575034 5.938803 7.471679 5.988667 7.090439 16.38276 100

Or, if all your data is of the same time, store it in a matrix instead of a data.frame.

答案2

得分: 0

Your first for loop consists of two steps. For each i:

  • You first subset df (df$var)
  • You then change a value in this vector (df$var)[i] = 1

These steps are performed 100 times.

In your second for loop there is only one operation in the for loop:

  • You change a value in the vector v_var (v_var[i] = 1)

You also do some operations outside the for loop, but because these are done once rather than 100 times, they have a negligible impact on the total time.

I used the microbenchmark package to demonstrate this:

  1. library(microbenchmark)
  2. df = data.frame(matrix(0, nrow = 23100, ncol = 1))
  3. # iterating on data frame
  4. df$var = 0
  5. v_var = df$var
  6. microbenchmark(
  7. in_df = for (i in 1:100) {
  8. df$var[i] = 1
  9. },
  10. outside_df = {
  11. v_var = df$var
  12. for (i in 1:100) {
  13. v_var[i] = 1
  14. }
  15. df$var = v_var
  16. },
  17. access_df = for (i in 1:100) {
  18. df$var
  19. }
  20. )
  21. #&gt; Unit: milliseconds
  22. #&gt; expr min lq mean median uq max neval
  23. #&gt; in_df 21.507375 23.080368 34.24766 26.196194 30.23929 239.06351 100
  24. #&gt; outside_df 11.200713 12.008956 14.52469 13.312613 14.74701 74.32721 100
  25. #&gt; access_df 8.769038 9.204913 10.35766 9.983931 10.78568 17.01080 100

We measure three things here:

  • in_df is your first for loop.
  • outside_df is your second for loop.
  • access_df is the time taken to access df$var 100 times.

Note that ignoring some outliers, the time taken by in_df is approximately the time

英文:

Your first for loop consists of two steps. For each i:

  • You first subset df (df$var)
  • You then change a value in this vector (df$var)[i] = 1

These steps are performed 100 times.

In your second for loop there is only one operation in the for loop:

  • You change a value in the vector v_var (v_var[i] = 1)

You also do some operations outside the for loop, but because these are done once rather than 100 times, they have a negligible impact on the total time.

I used the microbenchmark package to demonstrate this:

  1. library(microbenchmark)
  2. df = data.frame(matrix(0, nrow = 23100, ncol = 1))
  3. # iterating on data frame
  4. df$var = 0
  5. v_var = df$var
  6. microbenchmark(
  7. in_df = for (i in 1:100) {
  8. df$var[i] = 1
  9. },
  10. outside_df = {
  11. v_var = df$var
  12. for (i in 1:100) {
  13. v_var[i] = 1
  14. }
  15. df$var = v_var
  16. },
  17. access_df = for (i in 1:100) {
  18. df$var
  19. }
  20. )
  21. #&gt; Unit: milliseconds
  22. #&gt; expr min lq mean median uq max neval
  23. #&gt; in_df 21.507375 23.080368 34.24766 26.196194 30.23929 239.06351 100
  24. #&gt; outside_df 11.200713 12.008956 14.52469 13.312613 14.74701 74.32721 100
  25. #&gt; access_df 8.769038 9.204913 10.35766 9.983931 10.78568 17.01080 100

We measure three things here:

  • in_df is your first for loop.
  • outside_df is your second for loop.
  • access_df is the time taken to access df$var 100 times.

Note that ignoring some outliers, the time taken by in_df is approximately the time

huangapple
  • 本文由 发表于 2023年4月13日 15:52:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76002952.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定