I have two vectors of indices (rows, columns) for a data.frame with corresponding values, whats the fastest way to build the new data.frame?

huangapple go评论71阅读模式
英文:

I have two vectors of indices (rows, columns) for a data.frame with corresponding values, whats the fastest way to build the new data.frame?

问题

| Column 1 | Column 2 |
| -------- | -------- |
| 3        | 1        |
| NA       | 4        |
| 5        | 2        |
英文:

I have a data frame with 3 columns (df1). Two of which represent the indices of a new data.frame (df2) and the remaining column contains the values that shall be placed in df2. I'm sorry if i have overlooked the right answer.

  df1 <- data.frame(
row = c(1,3,1,2,3),
col = c(2,2,1,2,1),
value = c(1:5)
)

As output i want the following:

Column 1 Column 2
3 1
NA 4
5 2

Is there a way of achieving this without having to iterate through the table (or apply)?
I want to speed up my functions so a fast way would be very helpful. Thanks in advance!

The following does not work as desired:

df2 <- data.frame(matrix( nrow = 3, ncol = 2 ))

df2[df1$row, df1$col] <- df1$value

答案1

得分: 6

你必须提供行和列的索引作为 matrixcbind 可以使用这两个向量创建行和列的 matrix

df2 <- data.frame(matrix( nrow = 3, ncol = 2 ))
df2[cbind(df1$row, df1$col)] <- df1$value
df2
#  X1 X2
#1  3  1
#2 NA  4
#3  5  2

或者更好的方法(更快,内存消耗更少)是先填充 matrix,然后将其转换为 data.frame

df2 <- matrix(NA_integer_, 3, 2)
df2[cbind(df1$row, df1$col)] <- df1$value
df2 <- as.data.frame(df2)

或者一步完成。

df2 <- as.data.frame(`[<-`(matrix(NA, 3, 2), cbind(df1$row, df1$col), df1$value))

或者计算索引在向量中的位置。

df2 <- matrix(NA_integer_, 3, 2)
df2[df1$row + (df1$col - 1L) * 3L] <- df1$value
df2 <- as.data.frame(df2)
英文:

You have to provide the row and column indices as matrix. cbind can create of those two vectors the row and column matrix.

df2 &lt;- data.frame(matrix( nrow = 3, ncol = 2 ))
df2[cbind(df1$row, df1$col)] &lt;- df1$value
df2
#  X1 X2
#1  3  1
#2 NA  4
#3  5  2

Or better (faster, less memory consumption) filling the matrix and converting it afterwards to a data.frame.

df2 &lt;- matrix(NA_integer_, 3, 2)
df2[cbind(df1$row, df1$col)] &lt;- df1$value
df2 &lt;- as.data.frame(df2)

Or do it in one step.

df2 &lt;- as.data.frame(`[&lt;-`(matrix(NA, 3, 2), cbind(df1$row, df1$col), df1$value))

Or calculate the position of the index in a vector.

df2 &lt;- matrix(NA_integer_, 3, 2)
df2[df1$row + (df1$col - 1L) * 3L] &lt;- df1$value
df2 &lt;- as.data.frame(df2)

Benchmark

set.seed(0)
NR &lt;- 10000L
NC &lt;- 100L
df1 &lt;- cbind(expand.grid(row=seq_len(NR), col=seq_len(NC)), value=
            sample(0:9, NR*NC, TRUE))[sample(NR*NC, floor(0.9 * NR * NC)),]

library(tidyverse)
library(data.table)
library(Matrix)
bench::mark(check=FALSE,
cbindTwoStep = {df2 &lt;- data.frame(matrix( nrow = NR, ncol = NC ))
    df2[cbind(df1$row, df1$col)] &lt;- df1$value} ,
cbindMatTwoStep = {df2 &lt;- matrix(NA_integer_, NR, NC)
  df2[cbind(df1$row, df1$col)] &lt;- df1$value
  df2 &lt;- as.data.frame(df2)} ,
cbindMatDF = df2 &lt;- data.frame(`[&lt;-`(matrix(NA_integer_, NR, NC), cbind(df1$row, df1$col), df1$value)),
cbindMatADF = df2 &lt;- as.data.frame(`[&lt;-`(matrix(NA_integer_, NR, NC), cbind(df1$row, df1$col), df1$value)),
vecIndex = {df2 &lt;- matrix(NA_integer_, NR, NC)
  df2[df1$row + (df1$col - 1L) * NR] &lt;- df1$value
  df2 &lt;- as.data.frame(df2)},
pivot = {df2 &lt;- df1 |&gt; 
  arrange(row, col) |&gt; 
  pivot_wider(names_from = col, values_from = value) |&gt; 
  select(!row)},
data.table = df2 &lt;- dcast(as.data.table(df1), row ~ paste(&quot;Column&quot;, col))[, -1],
sparseMatrix = df2 &lt;- as.matrix(sparseMatrix(i = df1[, 1], j = df1[, 2], x = df1[, 3])),
xtabs = df2 &lt;- as.data.frame.matrix(xtabs(value ~ ., df1)) )

Result

  expression           min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  &lt;bch:expr&gt;      &lt;bch:tm&gt; &lt;bch:tm&gt;     &lt;dbl&gt; &lt;bch:byt&gt;    &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
1 cbindTwoStep     59.99ms 132.42ms     6.41     79.2MB    11.2      4     7
2 cbindMatTwoStep    8.6ms   9.48ms    92.5      21.8MB    21.6     47    11
3 cbindMatDF        8.63ms   9.51ms    61.2      21.8MB    13.8     31     7
4 cbindMatADF       8.24ms    9.4ms    95.7      21.8MB    19.9     48    10
5 vecIndex          8.12ms      9ms    97.7      14.9MB    19.9     49    10
6 pivot            97.59ms  99.75ms     9.99     92.6MB    12.0      5     6
7 data.table      171.63ms 172.34ms     5.41     88.4MB     5.41     3     3
8 sparseMatrix     23.73ms  30.18ms    17.6      38.5MB     7.82     9     4
9 xtabs               1.1s     1.1s     0.909   269.9MB     5.46     1     6

Filling up the values in the matrix using cbind and converting it afterwards to a data.frame is more performant than filling a data.frame. Calculating the index on a vector is slightly faster and uses the lowest amount of additional memory.

答案2

得分: 4

你也可以使用 xtabs,即:

xtabs(value ~ row + col, data = df1)

   col
row 1 2
  1 3 1
  2 0 4
  3 5 2
英文:

You can also use xtabs, i.e.

xtabs(value ~ row + col, data = df1)

   col
row 1 2
  1 3 1
  2 0 4
  3 5 2

答案3

得分: 3

可能不是最快的解决方案,但如果你喜欢 Tidyverse 函数,这可能会有用。思路是将“col”列旋转为实际的列:

library(tidyverse)

df1 <- tibble(
  row = c(1,3,1,2,3),
  col = c(2,2,1,2,1),
  value = c(1:5)
)

df1 |> 
  arrange(col) |> 
  pivot_wider(names_from = col, values_from = value) |> 
  select(!row)

<sup>创建于 2023-04-06,使用 reprex v2.0.2</sup>

英文:

Probably not the fastest solution but perhaps useful if you like Tidyverse functions. The idea is to pivot the col column into actual columns:

library(tidyverse)

df1 &lt;- tibble(
  row = c(1,3,1,2,3),
  col = c(2,2,1,2,1),
  value = c(1:5)
)

df1 |&gt; 
  arrange(col) |&gt; 
  pivot_wider(names_from = col, values_from = value) |&gt; 
  select(!row)
#&gt; # A tibble: 3 &#215; 2
#&gt;     `1`   `2`
#&gt;   &lt;int&gt; &lt;int&gt;
#&gt; 1     3     1
#&gt; 2     5     2
#&gt; 3    NA     4

<sup>Created on 2023-04-06 with reprex v2.0.2</sup>

答案4

得分: 2

使用data.table库:

library(data.table)
setDT(df1)

dcast(df1, row ~ paste("Column", col))[, -1]
#    Column 1 Column 2
# 1:        3        1
# 2:       NA        4
# 3:        5        2

使用sparseMatrix方法:

library(Matrix)
as.matrix(sparseMatrix(i = df1[, 1], j = df1[, 2], x = df1[, 3]))

     [,1] [,2]
[1,]    3    1
[2,]    0    4
[3,]    5    2
英文:

using data.table

library(data.table)
setDT(df1)

dcast(df1, row ~ paste(&quot;Column&quot;, col))[, -1]

#    Column 1 Column 2
# 1:        3        1
# 2:       NA        4
# 3:        5        2

sparseMatrix approach

Note that your data looks like a sparse matrix in data.frame format, depending on your situation you could consider this as an option too.

library(Matrix)
as.matrix(sparseMatrix(i = df1[, 1], j = df1[, 2], x = df1[, 3]))

     [,1] [,2]
[1,]    3    1
[2,]    0    4
[3,]    5    2

huangapple
  • 本文由 发表于 2023年4月6日 21:19:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75950008.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定