2023年4月13日 15:52:20go评论94阅读模式

英文:

Why does it take longer to iterate over the column of a R data frame than to iterate over an equivalent vector?

问题

I work with a large data frame in R (containing 2,310,000 rows).

I found that a loop that iterate directly on the elements of the data frame column can be very slow. I compared this to iterating on the elements of a vector of equivalent size:

df = data.frame(matrix(0, nrow = 2,310,000, ncol = 1))
t0 = Sys.time()
# iterating on data frame
df$var = 0
for (i in 1:100) {
  df$var[i] = 1
}
t1 = Sys.time()
# iterating on vector
df$var = 0
v_var = df$var
for (i in 1:100) {
  v_var[i] = 1
}
df$var = v_var
t2 = Sys.time()
print(t1 - t0) ; print(t2 - t1)

Output:

Time difference of 0.1035166 secs

Time difference of 0.0075109 secs

Can someone explain me why iterating on the elements of a large data frame is slower? Thanks in advance.

英文:

I work with a large data frame in R (containing 2310000 rows)

I found that a loop that iterate directly on the elements of the data frame column can be very slow. I compared this to iterating on the elements of a vector of equivalent size :

df = data.frame(matrix(0, nrow = 2310000, ncol = 1))
t0 = Sys.time()
# iterating on data frame
df$var = 0
for (i in 1:100) {
  df$var[i] = 1
}
t1 = Sys.time()
# iterating on vector
df$var = 0
v_var = df$var
for (i in 1:100) {
  v_var[i] = 1
}
df$var = v_var
t2 = Sys.time()
print(t1 - t0) ; print(t2 - t1)

Output :
> Time difference of 0.1035166 secs

> Time difference of 0.0075109 secs

Can someone explain me why iterating on the elements of a large data frame is slower ?
Thanks in advance

答案1

得分: 1

以下是您要翻译的内容：

首先，为了搭建背景并展示不仅仅是从数据框中提取数据的操作 - 让我们比较不同对象类型之间的这种操作速度：

library(microbenchmark)
df  = data.frame(matrix(0, nrow = 2310000, ncol = 1))
df$var = 0
lst = as.list(df)
mat = data.matrix(df)
microbenchmark(
  in_df = for (i in 1:100) {
    df$var[i] = 1
  },
  in_mat = for (i in 1:100) {
    mat[i,"var"] = 1
  },
  in_lst = for (i in 1:100) {
    lst$var[i] = 1
  }
)

单位：毫秒
表达式最小下四分位平均中位数上四分位最大评估
in_df 223.218706 248.705949 293.525415 274.549170 336.379693 394.847852 100
in_mat 1.304793 1.639721 1.932387 1.841773 2.086342 3.612089 100
in_lst 1.571596 1.705783 2.076625 2.005105 2.316363 5.913198 100

正如您所看到的，矩阵和列表没有任何问题。数据框在底层也是一个列表。那么这里发生了什么？答案可以在阅读数据框美元替换函数的帮助手册中找到：

help(`$<-.data.frame`)

[...]

没有数据框方法用于‘$’，因此‘x$name’使用默认方法，将‘x’视为列表（如果匹配是唯一的，则使用列名称的部分匹配，请参阅‘Extract’）。替换方法（对于‘$’）检查‘value’的行数是否正确，并在必要时进行复制。

[...]

这告诉我们，与列表的情况不同，当分配给数据框的列时，我们还必须确保分配的长度与数据框中的行数匹配。这是因为数据框是一种特殊类型的列表 - 所有元素的长度相同，因此可以排列成表格格式。

您可以通过检查以下代码来查看数据框方法的所有操作：

`$<-.data.frame`
function (x, name, value) 
{
    cl <- oldClass(x)
    class(x) <- NULL
    nrows <- .row_names_info(x, 2L)
    if (!is.null(value)) {
        N <- NROW(value)
        if (N > nrows) 
            stop(sprintf(ngettext(N, "replacement has %d row, data has %d", 
                "replacement has %d rows, data has %d"), N, nrows), 
                domain = NA)
        if (N < nrows) 
            if (N > 0L && (nrows%%N == 0L) && length(dim(value)) <= 
                1L) 
                value <- rep(value, length.out = nrows)
            else stop(sprintf(ngettext(N, "replacement has %d row, data has %d", 
                "replacement has %d rows, data has %d"), N, nrows), 
                domain = NA)
        if (is.atomic(value) && !is.null(names(value))) 
            names(value) <- NULL
    }
    x[[name]] <- value
    class(x) <- cl
    return(x)
}
<bytecode: 0x7fd0667fce80>
<environment: namespace:base>

解决此问题的一种方法是将数据框临时转换为列表，然后再转换回去。

microbenchmark(
  in_df = {
    df <- unclass(df)
    in_df = for (i in 1:100) {
      df$var[i] = 1
    }
    df <- list2DF(df)
  }
)

单位：毫秒
表达式最小下四分位平均中位数上四分位最大评估
in_df 5.575034 5.938803 7.471679 5.988667 7.090439 16.38276 100

或者，如果您的所有数据都是相同的类型，请将其存储在矩阵而不是数据框中。

英文:

First, to set the stage and show that it's not only extraction from a data.frame that is happening - let's compare the speed of this operation across different object types:

library(microbenchmark)
df  = data.frame(matrix(0, nrow = 2310000, ncol = 1))
df$var = 0
lst = as.list(df)
mat = data.matrix(df)
microbenchmark(
  in_df = for (i in 1:100) {
    df$var[i] = 1
  },
  in_mat = for (i in 1:100) {
    mat[i,&quot;var&quot;] = 1
  },
  in_lst = for (i in 1:100) {
    lst$var[i] = 1
  }
)
Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
  in_df 223.218706 248.705949 293.525415 274.549170 336.379693 394.847852   100
 in_mat   1.304793   1.639721   1.932387   1.841773   2.086342   3.612089   100
 in_lst   1.571596   1.705783   2.076625   2.005105   2.316363   5.913198   100

As you can see, matrix and lists didn't have any problems. data.frame is also a list under the hood. So what is going on here? The answer can be found in reading the help manual for the data.frame dollar replacement function:

help(`$&lt;-.data.frame`)
  [...]
  There is no ‘data.frame’ method for ‘$’, so ‘x$name’ uses the
  default method which treats ‘x’ as a list (with partial matching
  of column names if the match is unique, see ‘Extract’).  The
  replacement method (for ‘$’) checks ‘value’ for the correct number
  of rows, and replicates it if necessary.
  [...]

So this tells us that, unlike in the case of list, when assigning to a column of a data.frame we also have to make sure that the length of assignment matches the number of rows in the data.frame. This is because a data.frame is a special kind of list - a list where all elements have the same length so it could be arranged into a table format.

You can look at everything the method for data.frame does by inspecting this code:

 &gt; `$&lt;-.data.frame`
function (x, name, value) 
{
    cl &lt;- oldClass(x)
    class(x) &lt;- NULL
    nrows &lt;- .row_names_info(x, 2L)
    if (!is.null(value)) {
        N &lt;- NROW(value)
        if (N &gt; nrows) 
            stop(sprintf(ngettext(N, &quot;replacement has %d row, data has %d&quot;, 
                &quot;replacement has %d rows, data has %d&quot;), N, nrows), 
                domain = NA)
        if (N &lt; nrows) 
            if (N &gt; 0L &amp;&amp; (nrows%%N == 0L) &amp;&amp; length(dim(value)) &lt;= 
                1L) 
                value &lt;- rep(value, length.out = nrows)
            else stop(sprintf(ngettext(N, &quot;replacement has %d row, data has %d&quot;, 
                &quot;replacement has %d rows, data has %d&quot;), N, nrows), 
                domain = NA)
        if (is.atomic(value) &amp;&amp; !is.null(names(value))) 
            names(value) &lt;- NULL
    }
    x[[name]] &lt;- value
    class(x) &lt;- cl
    return(x)
}
&lt;bytecode: 0x7fd0667fce80&gt;
&lt;environment: namespace:base&gt;

One solution to bypass this is to temporary transform your data.frame into a list, and then transform back.

microbenchmark(
  in_df = {
    df &lt;- unclass(df)
    in_df = for (i in 1:100) {
      df$var[i] = 1
    }
    df &lt;- list2DF(df)
  }
)
Unit: milliseconds
  expr      min       lq     mean   median       uq      max neval
 in_df 5.575034 5.938803 7.471679 5.988667 7.090439 16.38276   100

Or, if all your data is of the same time, store it in a matrix instead of a data.frame.

答案2

得分: 0

Your first for loop consists of two steps. For each i:

You first subset df (df$var)
You then change a value in this vector (df$var)[i] = 1

These steps are performed 100 times.

In your second for loop there is only one operation in the for loop:

You change a value in the vector v_var (v_var[i] = 1)

You also do some operations outside the for loop, but because these are done once rather than 100 times, they have a negligible impact on the total time.

I used the microbenchmark package to demonstrate this:

library(microbenchmark)
df = data.frame(matrix(0, nrow = 23100, ncol = 1))
# iterating on data frame
df$var = 0
v_var = df$var
microbenchmark(
  in_df = for (i in 1:100) {
    df$var[i] = 1
  },
  outside_df = {
    v_var = df$var
    for (i in 1:100) {
      v_var[i] = 1
    }
    df$var = v_var
  },
  access_df = for (i in 1:100) {
    df$var
  }
)
#&gt; Unit: milliseconds
#&gt;        expr       min        lq     mean    median       uq       max neval
#&gt;       in_df 21.507375 23.080368 34.24766 26.196194 30.23929 239.06351   100
#&gt;  outside_df 11.200713 12.008956 14.52469 13.312613 14.74701  74.32721   100
#&gt;   access_df  8.769038  9.204913 10.35766  9.983931 10.78568  17.01080   100

We measure three things here:

in_df is your first for loop.
outside_df is your second for loop.
access_df is the time taken to access df$var 100 times.

Note that ignoring some outliers, the time taken by in_df is approximately the time

英文:

Your first for loop consists of two steps. For each i:

You first subset df (df$var)
You then change a value in this vector (df$var)[i] = 1

These steps are performed 100 times.

In your second for loop there is only one operation in the for loop:

You change a value in the vector v_var (v_var[i] = 1)

You also do some operations outside the for loop, but because these are done once rather than 100 times, they have a negligible impact on the total time.

I used the microbenchmark package to demonstrate this:

library(microbenchmark)
df = data.frame(matrix(0, nrow = 23100, ncol = 1))
# iterating on data frame
df$var = 0
v_var = df$var
microbenchmark(
  in_df = for (i in 1:100) {
    df$var[i] = 1
  },
  outside_df = {
    v_var = df$var
    for (i in 1:100) {
      v_var[i] = 1
    }
    df$var = v_var
  },
  access_df = for (i in 1:100) {
    df$var
  }
)
#&gt; Unit: milliseconds
#&gt;        expr       min        lq     mean    median       uq       max neval
#&gt;       in_df 21.507375 23.080368 34.24766 26.196194 30.23929 239.06351   100
#&gt;  outside_df 11.200713 12.008956 14.52469 13.312613 14.74701  74.32721   100
#&gt;   access_df  8.769038  9.204913 10.35766  9.983931 10.78568  17.01080   100

We measure three things here:

in_df is your first for loop.
outside_df is your second for loop.
access_df is the time taken to access df$var 100 times.

Note that ignoring some outliers, the time taken by in_df is approximately the time

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么在遍历 R 数据框的列时比遍历等价向量花费更长时间？

问题

答案1

答案2

Replace values in one column with randomly generated values using group_by and mutate, while making sure every set of values is unique in R

Object of type 'closure' is not subsettable when training a geographically weighted random forest model

ggplot指定经度/纬度轴刻度值

如何在ggplot2中单独控制和增加不同分面的Y轴范围

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。