英文:
I have two vectors of indices (rows, columns) for a data.frame with corresponding values, whats the fastest way to build the new data.frame?
问题
| Column 1 | Column 2 |
| -------- | -------- |
| 3 | 1 |
| NA | 4 |
| 5 | 2 |
英文:
I have a data frame with 3 columns (df1). Two of which represent the indices of a new data.frame (df2) and the remaining column contains the values that shall be placed in df2. I'm sorry if i have overlooked the right answer.
df1 <- data.frame(
row = c(1,3,1,2,3),
col = c(2,2,1,2,1),
value = c(1:5)
)
As output i want the following:
Column 1 | Column 2 |
---|---|
3 | 1 |
NA | 4 |
5 | 2 |
Is there a way of achieving this without having to iterate through the table (or apply)?
I want to speed up my functions so a fast way would be very helpful. Thanks in advance!
The following does not work as desired:
df2 <- data.frame(matrix( nrow = 3, ncol = 2 ))
df2[df1$row, df1$col] <- df1$value
答案1
得分: 6
你必须提供行和列的索引作为 matrix
。cbind
可以使用这两个向量创建行和列的 matrix
。
df2 <- data.frame(matrix( nrow = 3, ncol = 2 ))
df2[cbind(df1$row, df1$col)] <- df1$value
df2
# X1 X2
#1 3 1
#2 NA 4
#3 5 2
或者更好的方法(更快,内存消耗更少)是先填充 matrix
,然后将其转换为 data.frame
。
df2 <- matrix(NA_integer_, 3, 2)
df2[cbind(df1$row, df1$col)] <- df1$value
df2 <- as.data.frame(df2)
或者一步完成。
df2 <- as.data.frame(`[<-`(matrix(NA, 3, 2), cbind(df1$row, df1$col), df1$value))
或者计算索引在向量中的位置。
df2 <- matrix(NA_integer_, 3, 2)
df2[df1$row + (df1$col - 1L) * 3L] <- df1$value
df2 <- as.data.frame(df2)
英文:
You have to provide the row and column indices as matrix
. cbind
can create of those two vectors the row and column matrix
.
df2 <- data.frame(matrix( nrow = 3, ncol = 2 ))
df2[cbind(df1$row, df1$col)] <- df1$value
df2
# X1 X2
#1 3 1
#2 NA 4
#3 5 2
Or better (faster, less memory consumption) filling the matrix
and converting it afterwards to a data.frame
.
df2 <- matrix(NA_integer_, 3, 2)
df2[cbind(df1$row, df1$col)] <- df1$value
df2 <- as.data.frame(df2)
Or do it in one step.
df2 <- as.data.frame(`[<-`(matrix(NA, 3, 2), cbind(df1$row, df1$col), df1$value))
Or calculate the position of the index in a vector.
df2 <- matrix(NA_integer_, 3, 2)
df2[df1$row + (df1$col - 1L) * 3L] <- df1$value
df2 <- as.data.frame(df2)
Benchmark
set.seed(0)
NR <- 10000L
NC <- 100L
df1 <- cbind(expand.grid(row=seq_len(NR), col=seq_len(NC)), value=
sample(0:9, NR*NC, TRUE))[sample(NR*NC, floor(0.9 * NR * NC)),]
library(tidyverse)
library(data.table)
library(Matrix)
bench::mark(check=FALSE,
cbindTwoStep = {df2 <- data.frame(matrix( nrow = NR, ncol = NC ))
df2[cbind(df1$row, df1$col)] <- df1$value} ,
cbindMatTwoStep = {df2 <- matrix(NA_integer_, NR, NC)
df2[cbind(df1$row, df1$col)] <- df1$value
df2 <- as.data.frame(df2)} ,
cbindMatDF = df2 <- data.frame(`[<-`(matrix(NA_integer_, NR, NC), cbind(df1$row, df1$col), df1$value)),
cbindMatADF = df2 <- as.data.frame(`[<-`(matrix(NA_integer_, NR, NC), cbind(df1$row, df1$col), df1$value)),
vecIndex = {df2 <- matrix(NA_integer_, NR, NC)
df2[df1$row + (df1$col - 1L) * NR] <- df1$value
df2 <- as.data.frame(df2)},
pivot = {df2 <- df1 |>
arrange(row, col) |>
pivot_wider(names_from = col, values_from = value) |>
select(!row)},
data.table = df2 <- dcast(as.data.table(df1), row ~ paste("Column", col))[, -1],
sparseMatrix = df2 <- as.matrix(sparseMatrix(i = df1[, 1], j = df1[, 2], x = df1[, 3])),
xtabs = df2 <- as.data.frame.matrix(xtabs(value ~ ., df1)) )
Result
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
1 cbindTwoStep 59.99ms 132.42ms 6.41 79.2MB 11.2 4 7
2 cbindMatTwoStep 8.6ms 9.48ms 92.5 21.8MB 21.6 47 11
3 cbindMatDF 8.63ms 9.51ms 61.2 21.8MB 13.8 31 7
4 cbindMatADF 8.24ms 9.4ms 95.7 21.8MB 19.9 48 10
5 vecIndex 8.12ms 9ms 97.7 14.9MB 19.9 49 10
6 pivot 97.59ms 99.75ms 9.99 92.6MB 12.0 5 6
7 data.table 171.63ms 172.34ms 5.41 88.4MB 5.41 3 3
8 sparseMatrix 23.73ms 30.18ms 17.6 38.5MB 7.82 9 4
9 xtabs 1.1s 1.1s 0.909 269.9MB 5.46 1 6
Filling up the values in the matrix
using cbind
and converting it afterwards to a data.frame
is more performant than filling a data.frame
. Calculating the index on a vector is slightly faster and uses the lowest amount of additional memory.
答案2
得分: 4
你也可以使用 xtabs
,即:
xtabs(value ~ row + col, data = df1)
col
row 1 2
1 3 1
2 0 4
3 5 2
英文:
You can also use xtabs
, i.e.
xtabs(value ~ row + col, data = df1)
col
row 1 2
1 3 1
2 0 4
3 5 2
答案3
得分: 3
可能不是最快的解决方案,但如果你喜欢 Tidyverse 函数,这可能会有用。思路是将“col”列旋转为实际的列:
library(tidyverse)
df1 <- tibble(
row = c(1,3,1,2,3),
col = c(2,2,1,2,1),
value = c(1:5)
)
df1 |>
arrange(col) |>
pivot_wider(names_from = col, values_from = value) |>
select(!row)
<sup>创建于 2023-04-06,使用 reprex v2.0.2</sup>
英文:
Probably not the fastest solution but perhaps useful if you like Tidyverse functions. The idea is to pivot the col
column into actual columns:
library(tidyverse)
df1 <- tibble(
row = c(1,3,1,2,3),
col = c(2,2,1,2,1),
value = c(1:5)
)
df1 |>
arrange(col) |>
pivot_wider(names_from = col, values_from = value) |>
select(!row)
#> # A tibble: 3 × 2
#> `1` `2`
#> <int> <int>
#> 1 3 1
#> 2 5 2
#> 3 NA 4
<sup>Created on 2023-04-06 with reprex v2.0.2</sup>
答案4
得分: 2
使用data.table
库:
library(data.table)
setDT(df1)
dcast(df1, row ~ paste("Column", col))[, -1]
# Column 1 Column 2
# 1: 3 1
# 2: NA 4
# 3: 5 2
使用sparseMatrix
方法:
library(Matrix)
as.matrix(sparseMatrix(i = df1[, 1], j = df1[, 2], x = df1[, 3]))
[,1] [,2]
[1,] 3 1
[2,] 0 4
[3,] 5 2
英文:
using data.table
library(data.table)
setDT(df1)
dcast(df1, row ~ paste("Column", col))[, -1]
# Column 1 Column 2
# 1: 3 1
# 2: NA 4
# 3: 5 2
sparseMatrix approach
Note that your data looks like a sparse matrix in data.frame format, depending on your situation you could consider this as an option too.
library(Matrix)
as.matrix(sparseMatrix(i = df1[, 1], j = df1[, 2], x = df1[, 3]))
[,1] [,2]
[1,] 3 1
[2,] 0 4
[3,] 5 2
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论