为什么在遍历 R 数据框的列时比遍历等价向量花费更长时间?

huangapple go评论53阅读模式
英文:

Why does it take longer to iterate over the column of a R data frame than to iterate over an equivalent vector?

问题

I work with a large data frame in R (containing 2,310,000 rows).

I found that a loop that iterate directly on the elements of the data frame column can be very slow. I compared this to iterating on the elements of a vector of equivalent size:

df = data.frame(matrix(0, nrow = 2,310,000, ncol = 1))

t0 = Sys.time()

# iterating on data frame
df$var = 0
for (i in 1:100) {
  df$var[i] = 1
}

t1 = Sys.time()

# iterating on vector
df$var = 0
v_var = df$var
for (i in 1:100) {
  v_var[i] = 1
}
df$var = v_var

t2 = Sys.time()

print(t1 - t0) ; print(t2 - t1)

Output:

Time difference of 0.1035166 secs

Time difference of 0.0075109 secs

Can someone explain me why iterating on the elements of a large data frame is slower? Thanks in advance.

英文:

I work with a large data frame in R (containing 2310000 rows)

I found that a loop that iterate directly on the elements of the data frame column can be very slow. I compared this to iterating on the elements of a vector of equivalent size :

df = data.frame(matrix(0, nrow = 2310000, ncol = 1))

t0 = Sys.time()

# iterating on data frame
df$var = 0
for (i in 1:100) {
  df$var[i] = 1
}

t1 = Sys.time()

# iterating on vector
df$var = 0
v_var = df$var
for (i in 1:100) {
  v_var[i] = 1
}
df$var = v_var

t2 = Sys.time()

print(t1 - t0) ; print(t2 - t1)

Output :
> Time difference of 0.1035166 secs

> Time difference of 0.0075109 secs

Can someone explain me why iterating on the elements of a large data frame is slower ?
Thanks in advance

答案1

得分: 1

以下是您要翻译的内容:

首先,为了搭建背景并展示不仅仅是从数据框中提取数据的操作 - 让我们比较不同对象类型之间的这种操作速度:

library(microbenchmark)

df  = data.frame(matrix(0, nrow = 2310000, ncol = 1))
df$var = 0

lst = as.list(df)
mat = data.matrix(df)


microbenchmark(
  in_df = for (i in 1:100) {
    df$var[i] = 1
  },
  in_mat = for (i in 1:100) {
    mat[i,"var"] = 1
  },
  in_lst = for (i in 1:100) {
    lst$var[i] = 1
  }
)

单位:毫秒
表达式 最小 下四分位 平均 中位数 上四分位 最大 评估
in_df 223.218706 248.705949 293.525415 274.549170 336.379693 394.847852 100
in_mat 1.304793 1.639721 1.932387 1.841773 2.086342 3.612089 100
in_lst 1.571596 1.705783 2.076625 2.005105 2.316363 5.913198 100

正如您所看到的,矩阵和列表没有任何问题。数据框在底层也是一个列表。那么这里发生了什么?答案可以在阅读数据框美元替换函数的帮助手册中找到:

help(`$<-.data.frame`)

[...]

没有数据框方法用于‘$’,因此‘x$name’使用默认方法,将‘x’视为列表(如果匹配是唯一的,则使用列名称的部分匹配,请参阅‘Extract’)。替换方法(对于‘$’)检查‘value’的行数是否正确,并在必要时进行复制。

[...]

这告诉我们,与列表的情况不同,当分配给数据框的列时,我们还必须确保分配的长度与数据框中的行数匹配。这是因为数据框是一种特殊类型的列表 - 所有元素的长度相同,因此可以排列成表格格式。

您可以通过检查以下代码来查看数据框方法的所有操作:

`$<-.data.frame`
function (x, name, value) 
{
    cl <- oldClass(x)
    class(x) <- NULL
    nrows <- .row_names_info(x, 2L)
    if (!is.null(value)) {
        N <- NROW(value)
        if (N > nrows) 
            stop(sprintf(ngettext(N, "replacement has %d row, data has %d", 
                "replacement has %d rows, data has %d"), N, nrows), 
                domain = NA)
        if (N < nrows) 
            if (N > 0L && (nrows%%N == 0L) && length(dim(value)) <= 
                1L) 
                value <- rep(value, length.out = nrows)
            else stop(sprintf(ngettext(N, "replacement has %d row, data has %d", 
                "replacement has %d rows, data has %d"), N, nrows), 
                domain = NA)
        if (is.atomic(value) && !is.null(names(value))) 
            names(value) <- NULL
    }
    x[[name]] <- value
    class(x) <- cl
    return(x)
}
<bytecode: 0x7fd0667fce80>
<environment: namespace:base>

解决此问题的一种方法是将数据框临时转换为列表,然后再转换回去。

microbenchmark(
  in_df = {
    df <- unclass(df)
    in_df = for (i in 1:100) {
      df$var[i] = 1
    }
    df <- list2DF(df)
  }
)

单位:毫秒
表达式 最小 下四分位 平均 中位数 上四分位 最大 评估
in_df 5.575034 5.938803 7.471679 5.988667 7.090439 16.38276 100

或者,如果您的所有数据都是相同的类型,请将其存储在矩阵而不是数据框中。

英文:

First, to set the stage and show that it's not only extraction from a data.frame that is happening - let's compare the speed of this operation across different object types:

library(microbenchmark)

df  = data.frame(matrix(0, nrow = 2310000, ncol = 1))
df$var = 0

lst = as.list(df)
mat = data.matrix(df)


microbenchmark(
  in_df = for (i in 1:100) {
    df$var[i] = 1
  },
  in_mat = for (i in 1:100) {
    mat[i,&quot;var&quot;] = 1
  },
  in_lst = for (i in 1:100) {
    lst$var[i] = 1
  }
)

Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
  in_df 223.218706 248.705949 293.525415 274.549170 336.379693 394.847852   100
 in_mat   1.304793   1.639721   1.932387   1.841773   2.086342   3.612089   100
 in_lst   1.571596   1.705783   2.076625   2.005105   2.316363   5.913198   100

As you can see, matrix and lists didn't have any problems. data.frame is also a list under the hood. So what is going on here? The answer can be found in reading the help manual for the data.frame dollar replacement function:

help(`$&lt;-.data.frame`)

  [...]

  There is no ‘data.frame’ method for ‘$’, so ‘x$name’ uses the
  default method which treats ‘x’ as a list (with partial matching
  of column names if the match is unique, see ‘Extract’).  The
  replacement method (for ‘$’) checks ‘value’ for the correct number
  of rows, and replicates it if necessary.

  [...]

So this tells us that, unlike in the case of list, when assigning to a column of a data.frame we also have to make sure that the length of assignment matches the number of rows in the data.frame. This is because a data.frame is a special kind of list - a list where all elements have the same length so it could be arranged into a table format.

You can look at everything the method for data.frame does by inspecting this code:

 &gt; `$&lt;-.data.frame`
function (x, name, value) 
{
    cl &lt;- oldClass(x)
    class(x) &lt;- NULL
    nrows &lt;- .row_names_info(x, 2L)
    if (!is.null(value)) {
        N &lt;- NROW(value)
        if (N &gt; nrows) 
            stop(sprintf(ngettext(N, &quot;replacement has %d row, data has %d&quot;, 
                &quot;replacement has %d rows, data has %d&quot;), N, nrows), 
                domain = NA)
        if (N &lt; nrows) 
            if (N &gt; 0L &amp;&amp; (nrows%%N == 0L) &amp;&amp; length(dim(value)) &lt;= 
                1L) 
                value &lt;- rep(value, length.out = nrows)
            else stop(sprintf(ngettext(N, &quot;replacement has %d row, data has %d&quot;, 
                &quot;replacement has %d rows, data has %d&quot;), N, nrows), 
                domain = NA)
        if (is.atomic(value) &amp;&amp; !is.null(names(value))) 
            names(value) &lt;- NULL
    }
    x[[name]] &lt;- value
    class(x) &lt;- cl
    return(x)
}
&lt;bytecode: 0x7fd0667fce80&gt;
&lt;environment: namespace:base&gt;

One solution to bypass this is to temporary transform your data.frame into a list, and then transform back.

microbenchmark(
  in_df = {
    df &lt;- unclass(df)
    in_df = for (i in 1:100) {
      df$var[i] = 1
    }
    df &lt;- list2DF(df)
  }
)

Unit: milliseconds
  expr      min       lq     mean   median       uq      max neval
 in_df 5.575034 5.938803 7.471679 5.988667 7.090439 16.38276   100

Or, if all your data is of the same time, store it in a matrix instead of a data.frame.

答案2

得分: 0

Your first for loop consists of two steps. For each i:

  • You first subset df (df$var)
  • You then change a value in this vector (df$var)[i] = 1

These steps are performed 100 times.

In your second for loop there is only one operation in the for loop:

  • You change a value in the vector v_var (v_var[i] = 1)

You also do some operations outside the for loop, but because these are done once rather than 100 times, they have a negligible impact on the total time.

I used the microbenchmark package to demonstrate this:

library(microbenchmark)

df = data.frame(matrix(0, nrow = 23100, ncol = 1))

# iterating on data frame
df$var = 0
v_var = df$var

microbenchmark(
  in_df = for (i in 1:100) {
    df$var[i] = 1
  },
  outside_df = {
    v_var = df$var
    for (i in 1:100) {
      v_var[i] = 1
    }
    df$var = v_var
  },
  access_df = for (i in 1:100) {
    df$var
  }
)
#&gt; Unit: milliseconds
#&gt;        expr       min        lq     mean    median       uq       max neval
#&gt;       in_df 21.507375 23.080368 34.24766 26.196194 30.23929 239.06351   100
#&gt;  outside_df 11.200713 12.008956 14.52469 13.312613 14.74701  74.32721   100
#&gt;   access_df  8.769038  9.204913 10.35766  9.983931 10.78568  17.01080   100

We measure three things here:

  • in_df is your first for loop.
  • outside_df is your second for loop.
  • access_df is the time taken to access df$var 100 times.

Note that ignoring some outliers, the time taken by in_df is approximately the time

英文:

Your first for loop consists of two steps. For each i:

  • You first subset df (df$var)
  • You then change a value in this vector (df$var)[i] = 1

These steps are performed 100 times.

In your second for loop there is only one operation in the for loop:

  • You change a value in the vector v_var (v_var[i] = 1)

You also do some operations outside the for loop, but because these are done once rather than 100 times, they have a negligible impact on the total time.

I used the microbenchmark package to demonstrate this:

library(microbenchmark)

df = data.frame(matrix(0, nrow = 23100, ncol = 1))

# iterating on data frame
df$var = 0
v_var = df$var

microbenchmark(
  in_df = for (i in 1:100) {
    df$var[i] = 1
  },
  outside_df = {
    v_var = df$var
    for (i in 1:100) {
      v_var[i] = 1
    }
    df$var = v_var
  },
  access_df = for (i in 1:100) {
    df$var
  }
)
#&gt; Unit: milliseconds
#&gt;        expr       min        lq     mean    median       uq       max neval
#&gt;       in_df 21.507375 23.080368 34.24766 26.196194 30.23929 239.06351   100
#&gt;  outside_df 11.200713 12.008956 14.52469 13.312613 14.74701  74.32721   100
#&gt;   access_df  8.769038  9.204913 10.35766  9.983931 10.78568  17.01080   100

We measure three things here:

  • in_df is your first for loop.
  • outside_df is your second for loop.
  • access_df is the time taken to access df$var 100 times.

Note that ignoring some outliers, the time taken by in_df is approximately the time

huangapple
  • 本文由 发表于 2023年4月13日 15:52:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76002952.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定