英文:
Why does it take longer to iterate over the column of a R data frame than to iterate over an equivalent vector?
问题
I work with a large data frame in R (containing 2,310,000 rows).
I found that a loop that iterate directly on the elements of the data frame column can be very slow. I compared this to iterating on the elements of a vector of equivalent size:
df = data.frame(matrix(0, nrow = 2,310,000, ncol = 1))
t0 = Sys.time()
# iterating on data frame
df$var = 0
for (i in 1:100) {
df$var[i] = 1
}
t1 = Sys.time()
# iterating on vector
df$var = 0
v_var = df$var
for (i in 1:100) {
v_var[i] = 1
}
df$var = v_var
t2 = Sys.time()
print(t1 - t0) ; print(t2 - t1)
Output:
Time difference of 0.1035166 secs
Time difference of 0.0075109 secs
Can someone explain me why iterating on the elements of a large data frame is slower? Thanks in advance.
英文:
I work with a large data frame in R (containing 2310000 rows)
I found that a loop that iterate directly on the elements of the data frame column can be very slow. I compared this to iterating on the elements of a vector of equivalent size :
df = data.frame(matrix(0, nrow = 2310000, ncol = 1))
t0 = Sys.time()
# iterating on data frame
df$var = 0
for (i in 1:100) {
df$var[i] = 1
}
t1 = Sys.time()
# iterating on vector
df$var = 0
v_var = df$var
for (i in 1:100) {
v_var[i] = 1
}
df$var = v_var
t2 = Sys.time()
print(t1 - t0) ; print(t2 - t1)
Output :
> Time difference of 0.1035166 secs
> Time difference of 0.0075109 secs
Can someone explain me why iterating on the elements of a large data frame is slower ?
Thanks in advance
答案1
得分: 1
以下是您要翻译的内容:
首先,为了搭建背景并展示不仅仅是从数据框中提取数据的操作 - 让我们比较不同对象类型之间的这种操作速度:
library(microbenchmark)
df = data.frame(matrix(0, nrow = 2310000, ncol = 1))
df$var = 0
lst = as.list(df)
mat = data.matrix(df)
microbenchmark(
in_df = for (i in 1:100) {
df$var[i] = 1
},
in_mat = for (i in 1:100) {
mat[i,"var"] = 1
},
in_lst = for (i in 1:100) {
lst$var[i] = 1
}
)
单位:毫秒
表达式 最小 下四分位 平均 中位数 上四分位 最大 评估
in_df 223.218706 248.705949 293.525415 274.549170 336.379693 394.847852 100
in_mat 1.304793 1.639721 1.932387 1.841773 2.086342 3.612089 100
in_lst 1.571596 1.705783 2.076625 2.005105 2.316363 5.913198 100
正如您所看到的,矩阵和列表没有任何问题。数据框在底层也是一个列表。那么这里发生了什么?答案可以在阅读数据框美元替换函数的帮助手册中找到:
help(`$<-.data.frame`)
[...]
没有数据框方法用于‘$’,因此‘x$name’使用默认方法,将‘x’视为列表(如果匹配是唯一的,则使用列名称的部分匹配,请参阅‘Extract’)。替换方法(对于‘$’)检查‘value’的行数是否正确,并在必要时进行复制。
[...]
这告诉我们,与列表的情况不同,当分配给数据框的列时,我们还必须确保分配的长度与数据框中的行数匹配。这是因为数据框是一种特殊类型的列表 - 所有元素的长度相同,因此可以排列成表格格式。
您可以通过检查以下代码来查看数据框方法的所有操作:
`$<-.data.frame`
function (x, name, value)
{
cl <- oldClass(x)
class(x) <- NULL
nrows <- .row_names_info(x, 2L)
if (!is.null(value)) {
N <- NROW(value)
if (N > nrows)
stop(sprintf(ngettext(N, "replacement has %d row, data has %d",
"replacement has %d rows, data has %d"), N, nrows),
domain = NA)
if (N < nrows)
if (N > 0L && (nrows%%N == 0L) && length(dim(value)) <=
1L)
value <- rep(value, length.out = nrows)
else stop(sprintf(ngettext(N, "replacement has %d row, data has %d",
"replacement has %d rows, data has %d"), N, nrows),
domain = NA)
if (is.atomic(value) && !is.null(names(value)))
names(value) <- NULL
}
x[[name]] <- value
class(x) <- cl
return(x)
}
<bytecode: 0x7fd0667fce80>
<environment: namespace:base>
解决此问题的一种方法是将数据框临时转换为列表,然后再转换回去。
microbenchmark(
in_df = {
df <- unclass(df)
in_df = for (i in 1:100) {
df$var[i] = 1
}
df <- list2DF(df)
}
)
单位:毫秒
表达式 最小 下四分位 平均 中位数 上四分位 最大 评估
in_df 5.575034 5.938803 7.471679 5.988667 7.090439 16.38276 100
或者,如果您的所有数据都是相同的类型,请将其存储在矩阵而不是数据框中。
英文:
First, to set the stage and show that it's not only extraction from a data.frame that is happening - let's compare the speed of this operation across different object types:
library(microbenchmark)
df = data.frame(matrix(0, nrow = 2310000, ncol = 1))
df$var = 0
lst = as.list(df)
mat = data.matrix(df)
microbenchmark(
in_df = for (i in 1:100) {
df$var[i] = 1
},
in_mat = for (i in 1:100) {
mat[i,"var"] = 1
},
in_lst = for (i in 1:100) {
lst$var[i] = 1
}
)
Unit: milliseconds
expr min lq mean median uq max neval
in_df 223.218706 248.705949 293.525415 274.549170 336.379693 394.847852 100
in_mat 1.304793 1.639721 1.932387 1.841773 2.086342 3.612089 100
in_lst 1.571596 1.705783 2.076625 2.005105 2.316363 5.913198 100
As you can see, matrix and lists didn't have any problems. data.frame is also a list under the hood. So what is going on here? The answer can be found in reading the help manual for the data.frame dollar replacement function:
help(`$<-.data.frame`)
[...]
There is no ‘data.frame’ method for ‘$’, so ‘x$name’ uses the
default method which treats ‘x’ as a list (with partial matching
of column names if the match is unique, see ‘Extract’). The
replacement method (for ‘$’) checks ‘value’ for the correct number
of rows, and replicates it if necessary.
[...]
So this tells us that, unlike in the case of list, when assigning to a column of a data.frame we also have to make sure that the length of assignment matches the number of rows in the data.frame. This is because a data.frame is a special kind of list - a list where all elements have the same length so it could be arranged into a table format.
You can look at everything the method for data.frame does by inspecting this code:
> `$<-.data.frame`
function (x, name, value)
{
cl <- oldClass(x)
class(x) <- NULL
nrows <- .row_names_info(x, 2L)
if (!is.null(value)) {
N <- NROW(value)
if (N > nrows)
stop(sprintf(ngettext(N, "replacement has %d row, data has %d",
"replacement has %d rows, data has %d"), N, nrows),
domain = NA)
if (N < nrows)
if (N > 0L && (nrows%%N == 0L) && length(dim(value)) <=
1L)
value <- rep(value, length.out = nrows)
else stop(sprintf(ngettext(N, "replacement has %d row, data has %d",
"replacement has %d rows, data has %d"), N, nrows),
domain = NA)
if (is.atomic(value) && !is.null(names(value)))
names(value) <- NULL
}
x[[name]] <- value
class(x) <- cl
return(x)
}
<bytecode: 0x7fd0667fce80>
<environment: namespace:base>
One solution to bypass this is to temporary transform your data.frame into a list, and then transform back.
microbenchmark(
in_df = {
df <- unclass(df)
in_df = for (i in 1:100) {
df$var[i] = 1
}
df <- list2DF(df)
}
)
Unit: milliseconds
expr min lq mean median uq max neval
in_df 5.575034 5.938803 7.471679 5.988667 7.090439 16.38276 100
Or, if all your data is of the same time, store it in a matrix instead of a data.frame.
答案2
得分: 0
Your first for loop consists of two steps. For each i:
- You first subset
df
(df$var
) - You then change a value in this vector
(df$var)[i] = 1
These steps are performed 100 times.
In your second for loop there is only one operation in the for loop:
- You change a value in the vector
v_var
(v_var[i] = 1
)
You also do some operations outside the for loop, but because these are done once rather than 100 times, they have a negligible impact on the total time.
I used the microbenchmark
package to demonstrate this:
library(microbenchmark)
df = data.frame(matrix(0, nrow = 23100, ncol = 1))
# iterating on data frame
df$var = 0
v_var = df$var
microbenchmark(
in_df = for (i in 1:100) {
df$var[i] = 1
},
outside_df = {
v_var = df$var
for (i in 1:100) {
v_var[i] = 1
}
df$var = v_var
},
access_df = for (i in 1:100) {
df$var
}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> in_df 21.507375 23.080368 34.24766 26.196194 30.23929 239.06351 100
#> outside_df 11.200713 12.008956 14.52469 13.312613 14.74701 74.32721 100
#> access_df 8.769038 9.204913 10.35766 9.983931 10.78568 17.01080 100
We measure three things here:
in_df
is your first for loop.outside_df
is your second for loop.access_df
is the time taken to accessdf$var
100 times.
Note that ignoring some outliers, the time taken by in_df
is approximately the time
英文:
Your first for loop consists of two steps. For each i:
- You first subset
df
(df$var
) - You then change a value in this vector
(df$var)[i] = 1
These steps are performed 100 times.
In your second for loop there is only one operation in the for loop:
- You change a value in the vector
v_var
(v_var[i] = 1
)
You also do some operations outside the for loop, but because these are done once rather than 100 times, they have a negligible impact on the total time.
I used the microbenchmark
package to demonstrate this:
library(microbenchmark)
df = data.frame(matrix(0, nrow = 23100, ncol = 1))
# iterating on data frame
df$var = 0
v_var = df$var
microbenchmark(
in_df = for (i in 1:100) {
df$var[i] = 1
},
outside_df = {
v_var = df$var
for (i in 1:100) {
v_var[i] = 1
}
df$var = v_var
},
access_df = for (i in 1:100) {
df$var
}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> in_df 21.507375 23.080368 34.24766 26.196194 30.23929 239.06351 100
#> outside_df 11.200713 12.008956 14.52469 13.312613 14.74701 74.32721 100
#> access_df 8.769038 9.204913 10.35766 9.983931 10.78568 17.01080 100
We measure three things here:
in_df
is your first for loop.outside_df
is your second for loop.access_df
is the time taken to accessdf$var
100 times.
Note that ignoring some outliers, the time taken by in_df
is approximately the time
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论