I have two vectors of indices (rows, columns) for a data.frame with corresponding values, whats the fastest way to build the new data.frame?

huangapple go评论95阅读模式
英文:

I have two vectors of indices (rows, columns) for a data.frame with corresponding values, whats the fastest way to build the new data.frame?

问题

  1. | Column 1 | Column 2 |
  2. | -------- | -------- |
  3. | 3 | 1 |
  4. | NA | 4 |
  5. | 5 | 2 |
英文:

I have a data frame with 3 columns (df1). Two of which represent the indices of a new data.frame (df2) and the remaining column contains the values that shall be placed in df2. I'm sorry if i have overlooked the right answer.

  1. df1 <- data.frame(
  2. row = c(1,3,1,2,3),
  3. col = c(2,2,1,2,1),
  4. value = c(1:5)
  5. )

As output i want the following:

Column 1 Column 2
3 1
NA 4
5 2

Is there a way of achieving this without having to iterate through the table (or apply)?
I want to speed up my functions so a fast way would be very helpful. Thanks in advance!

The following does not work as desired:

  1. df2 <- data.frame(matrix( nrow = 3, ncol = 2 ))
  2. df2[df1$row, df1$col] <- df1$value

答案1

得分: 6

你必须提供行和列的索引作为 matrixcbind 可以使用这两个向量创建行和列的 matrix

  1. df2 <- data.frame(matrix( nrow = 3, ncol = 2 ))
  2. df2[cbind(df1$row, df1$col)] <- df1$value
  3. df2
  4. # X1 X2
  5. #1 3 1
  6. #2 NA 4
  7. #3 5 2

或者更好的方法(更快,内存消耗更少)是先填充 matrix,然后将其转换为 data.frame

  1. df2 <- matrix(NA_integer_, 3, 2)
  2. df2[cbind(df1$row, df1$col)] <- df1$value
  3. df2 <- as.data.frame(df2)

或者一步完成。

  1. df2 <- as.data.frame(`[<-`(matrix(NA, 3, 2), cbind(df1$row, df1$col), df1$value))

或者计算索引在向量中的位置。

  1. df2 <- matrix(NA_integer_, 3, 2)
  2. df2[df1$row + (df1$col - 1L) * 3L] <- df1$value
  3. df2 <- as.data.frame(df2)
英文:

You have to provide the row and column indices as matrix. cbind can create of those two vectors the row and column matrix.

  1. df2 &lt;- data.frame(matrix( nrow = 3, ncol = 2 ))
  2. df2[cbind(df1$row, df1$col)] &lt;- df1$value
  3. df2
  4. # X1 X2
  5. #1 3 1
  6. #2 NA 4
  7. #3 5 2

Or better (faster, less memory consumption) filling the matrix and converting it afterwards to a data.frame.

  1. df2 &lt;- matrix(NA_integer_, 3, 2)
  2. df2[cbind(df1$row, df1$col)] &lt;- df1$value
  3. df2 &lt;- as.data.frame(df2)

Or do it in one step.

  1. df2 &lt;- as.data.frame(`[&lt;-`(matrix(NA, 3, 2), cbind(df1$row, df1$col), df1$value))

Or calculate the position of the index in a vector.

  1. df2 &lt;- matrix(NA_integer_, 3, 2)
  2. df2[df1$row + (df1$col - 1L) * 3L] &lt;- df1$value
  3. df2 &lt;- as.data.frame(df2)

Benchmark

  1. set.seed(0)
  2. NR &lt;- 10000L
  3. NC &lt;- 100L
  4. df1 &lt;- cbind(expand.grid(row=seq_len(NR), col=seq_len(NC)), value=
  5. sample(0:9, NR*NC, TRUE))[sample(NR*NC, floor(0.9 * NR * NC)),]
  6. library(tidyverse)
  7. library(data.table)
  8. library(Matrix)
  9. bench::mark(check=FALSE,
  10. cbindTwoStep = {df2 &lt;- data.frame(matrix( nrow = NR, ncol = NC ))
  11. df2[cbind(df1$row, df1$col)] &lt;- df1$value} ,
  12. cbindMatTwoStep = {df2 &lt;- matrix(NA_integer_, NR, NC)
  13. df2[cbind(df1$row, df1$col)] &lt;- df1$value
  14. df2 &lt;- as.data.frame(df2)} ,
  15. cbindMatDF = df2 &lt;- data.frame(`[&lt;-`(matrix(NA_integer_, NR, NC), cbind(df1$row, df1$col), df1$value)),
  16. cbindMatADF = df2 &lt;- as.data.frame(`[&lt;-`(matrix(NA_integer_, NR, NC), cbind(df1$row, df1$col), df1$value)),
  17. vecIndex = {df2 &lt;- matrix(NA_integer_, NR, NC)
  18. df2[df1$row + (df1$col - 1L) * NR] &lt;- df1$value
  19. df2 &lt;- as.data.frame(df2)},
  20. pivot = {df2 &lt;- df1 |&gt;
  21. arrange(row, col) |&gt;
  22. pivot_wider(names_from = col, values_from = value) |&gt;
  23. select(!row)},
  24. data.table = df2 &lt;- dcast(as.data.table(df1), row ~ paste(&quot;Column&quot;, col))[, -1],
  25. sparseMatrix = df2 &lt;- as.matrix(sparseMatrix(i = df1[, 1], j = df1[, 2], x = df1[, 3])),
  26. xtabs = df2 &lt;- as.data.frame.matrix(xtabs(value ~ ., df1)) )

Result

  1. expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
  2. &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt; &lt;dbl&gt; &lt;bch:byt&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
  3. 1 cbindTwoStep 59.99ms 132.42ms 6.41 79.2MB 11.2 4 7
  4. 2 cbindMatTwoStep 8.6ms 9.48ms 92.5 21.8MB 21.6 47 11
  5. 3 cbindMatDF 8.63ms 9.51ms 61.2 21.8MB 13.8 31 7
  6. 4 cbindMatADF 8.24ms 9.4ms 95.7 21.8MB 19.9 48 10
  7. 5 vecIndex 8.12ms 9ms 97.7 14.9MB 19.9 49 10
  8. 6 pivot 97.59ms 99.75ms 9.99 92.6MB 12.0 5 6
  9. 7 data.table 171.63ms 172.34ms 5.41 88.4MB 5.41 3 3
  10. 8 sparseMatrix 23.73ms 30.18ms 17.6 38.5MB 7.82 9 4
  11. 9 xtabs 1.1s 1.1s 0.909 269.9MB 5.46 1 6

Filling up the values in the matrix using cbind and converting it afterwards to a data.frame is more performant than filling a data.frame. Calculating the index on a vector is slightly faster and uses the lowest amount of additional memory.

答案2

得分: 4

你也可以使用 xtabs,即:

  1. xtabs(value ~ row + col, data = df1)
  2. col
  3. row 1 2
  4. 1 3 1
  5. 2 0 4
  6. 3 5 2
英文:

You can also use xtabs, i.e.

  1. xtabs(value ~ row + col, data = df1)
  2. col
  3. row 1 2
  4. 1 3 1
  5. 2 0 4
  6. 3 5 2

答案3

得分: 3

可能不是最快的解决方案,但如果你喜欢 Tidyverse 函数,这可能会有用。思路是将“col”列旋转为实际的列:

  1. library(tidyverse)
  2. df1 <- tibble(
  3. row = c(1,3,1,2,3),
  4. col = c(2,2,1,2,1),
  5. value = c(1:5)
  6. )
  7. df1 |>
  8. arrange(col) |>
  9. pivot_wider(names_from = col, values_from = value) |>
  10. select(!row)

<sup>创建于 2023-04-06,使用 reprex v2.0.2</sup>

英文:

Probably not the fastest solution but perhaps useful if you like Tidyverse functions. The idea is to pivot the col column into actual columns:

  1. library(tidyverse)
  2. df1 &lt;- tibble(
  3. row = c(1,3,1,2,3),
  4. col = c(2,2,1,2,1),
  5. value = c(1:5)
  6. )
  7. df1 |&gt;
  8. arrange(col) |&gt;
  9. pivot_wider(names_from = col, values_from = value) |&gt;
  10. select(!row)
  11. #&gt; # A tibble: 3 &#215; 2
  12. #&gt; `1` `2`
  13. #&gt; &lt;int&gt; &lt;int&gt;
  14. #&gt; 1 3 1
  15. #&gt; 2 5 2
  16. #&gt; 3 NA 4

<sup>Created on 2023-04-06 with reprex v2.0.2</sup>

答案4

得分: 2

使用data.table库:

  1. library(data.table)
  2. setDT(df1)
  3. dcast(df1, row ~ paste("Column", col))[, -1]
  4. # Column 1 Column 2
  5. # 1: 3 1
  6. # 2: NA 4
  7. # 3: 5 2

使用sparseMatrix方法:

  1. library(Matrix)
  2. as.matrix(sparseMatrix(i = df1[, 1], j = df1[, 2], x = df1[, 3]))
  3. [,1] [,2]
  4. [1,] 3 1
  5. [2,] 0 4
  6. [3,] 5 2
英文:

using data.table

  1. library(data.table)
  2. setDT(df1)
  3. dcast(df1, row ~ paste(&quot;Column&quot;, col))[, -1]
  4. # Column 1 Column 2
  5. # 1: 3 1
  6. # 2: NA 4
  7. # 3: 5 2

sparseMatrix approach

Note that your data looks like a sparse matrix in data.frame format, depending on your situation you could consider this as an option too.

  1. library(Matrix)
  2. as.matrix(sparseMatrix(i = df1[, 1], j = df1[, 2], x = df1[, 3]))
  3. [,1] [,2]
  4. [1,] 3 1
  5. [2,] 0 4
  6. [3,] 5 2

huangapple
  • 本文由 发表于 2023年4月6日 21:19:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75950008.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定