2023年4月6日 21:19:43go评论95阅读模式

英文:

I have two vectors of indices (rows, columns) for a data.frame with corresponding values, whats the fastest way to build the new data.frame?

问题

| Column 1 | Column 2 |
| -------- | -------- |
| 3        | 1        |
| NA       | 4        |
| 5        | 2        |

英文:

I have a data frame with 3 columns (df1). Two of which represent the indices of a new data.frame (df2) and the remaining column contains the values that shall be placed in df2. I'm sorry if i have overlooked the right answer.

  df1 &lt;- data.frame(
row = c(1,3,1,2,3),
col = c(2,2,1,2,1),
value = c(1:5)
)

As output i want the following:

Column 1	Column 2
3	1
NA	4
5	2

Is there a way of achieving this without having to iterate through the table (or apply)?
I want to speed up my functions so a fast way would be very helpful. Thanks in advance!

The following does not work as desired:

df2 &lt;- data.frame(matrix( nrow = 3, ncol = 2 ))
df2[df1$row, df1$col] &lt;- df1$value

答案1

得分: 6

你必须提供行和列的索引作为 matrix。cbind 可以使用这两个向量创建行和列的 matrix。

df2 <- data.frame(matrix( nrow = 3, ncol = 2 ))
df2[cbind(df1$row, df1$col)] <- df1$value
df2
#  X1 X2
#1  3  1
#2 NA  4
#3  5  2

或者更好的方法（更快，内存消耗更少）是先填充 matrix，然后将其转换为 data.frame。

df2 <- matrix(NA_integer_, 3, 2)
df2[cbind(df1$row, df1$col)] <- df1$value
df2 <- as.data.frame(df2)

或者一步完成。

df2 <- as.data.frame(`[<-`(matrix(NA, 3, 2), cbind(df1$row, df1$col), df1$value))

或者计算索引在向量中的位置。

df2 <- matrix(NA_integer_, 3, 2)
df2[df1$row + (df1$col - 1L) * 3L] <- df1$value
df2 <- as.data.frame(df2)

英文:

You have to provide the row and column indices as matrix. cbind can create of those two vectors the row and column matrix.

df2 &lt;- data.frame(matrix( nrow = 3, ncol = 2 ))
df2[cbind(df1$row, df1$col)] &lt;- df1$value
df2
#  X1 X2
#1  3  1
#2 NA  4
#3  5  2

Or better (faster, less memory consumption) filling the matrix and converting it afterwards to a data.frame.

df2 &lt;- matrix(NA_integer_, 3, 2)
df2[cbind(df1$row, df1$col)] &lt;- df1$value
df2 &lt;- as.data.frame(df2)

Or do it in one step.

df2 &lt;- as.data.frame(`[&lt;-`(matrix(NA, 3, 2), cbind(df1$row, df1$col), df1$value))

Or calculate the position of the index in a vector.

df2 &lt;- matrix(NA_integer_, 3, 2)
df2[df1$row + (df1$col - 1L) * 3L] &lt;- df1$value
df2 &lt;- as.data.frame(df2)

Benchmark

set.seed(0)
NR &lt;- 10000L
NC &lt;- 100L
df1 &lt;- cbind(expand.grid(row=seq_len(NR), col=seq_len(NC)), value=
            sample(0:9, NR*NC, TRUE))[sample(NR*NC, floor(0.9 * NR * NC)),]
library(tidyverse)
library(data.table)
library(Matrix)
bench::mark(check=FALSE,
cbindTwoStep = {df2 &lt;- data.frame(matrix( nrow = NR, ncol = NC ))
    df2[cbind(df1$row, df1$col)] &lt;- df1$value} ,
cbindMatTwoStep = {df2 &lt;- matrix(NA_integer_, NR, NC)
  df2[cbind(df1$row, df1$col)] &lt;- df1$value
  df2 &lt;- as.data.frame(df2)} ,
cbindMatDF = df2 &lt;- data.frame(`[&lt;-`(matrix(NA_integer_, NR, NC), cbind(df1$row, df1$col), df1$value)),
cbindMatADF = df2 &lt;- as.data.frame(`[&lt;-`(matrix(NA_integer_, NR, NC), cbind(df1$row, df1$col), df1$value)),
vecIndex = {df2 &lt;- matrix(NA_integer_, NR, NC)
  df2[df1$row + (df1$col - 1L) * NR] &lt;- df1$value
  df2 &lt;- as.data.frame(df2)},
pivot = {df2 &lt;- df1 |&gt; 
  arrange(row, col) |&gt; 
  pivot_wider(names_from = col, values_from = value) |&gt; 
  select(!row)},
data.table = df2 &lt;- dcast(as.data.table(df1), row ~ paste(&quot;Column&quot;, col))[, -1],
sparseMatrix = df2 &lt;- as.matrix(sparseMatrix(i = df1[, 1], j = df1[, 2], x = df1[, 3])),
xtabs = df2 &lt;- as.data.frame.matrix(xtabs(value ~ ., df1)) )

Result

  expression           min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  &lt;bch:expr&gt;      &lt;bch:tm&gt; &lt;bch:tm&gt;     &lt;dbl&gt; &lt;bch:byt&gt;    &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
1 cbindTwoStep     59.99ms 132.42ms     6.41     79.2MB    11.2      4     7
2 cbindMatTwoStep    8.6ms   9.48ms    92.5      21.8MB    21.6     47    11
3 cbindMatDF        8.63ms   9.51ms    61.2      21.8MB    13.8     31     7
4 cbindMatADF       8.24ms    9.4ms    95.7      21.8MB    19.9     48    10
5 vecIndex          8.12ms      9ms    97.7      14.9MB    19.9     49    10
6 pivot            97.59ms  99.75ms     9.99     92.6MB    12.0      5     6
7 data.table      171.63ms 172.34ms     5.41     88.4MB     5.41     3     3
8 sparseMatrix     23.73ms  30.18ms    17.6      38.5MB     7.82     9     4
9 xtabs               1.1s     1.1s     0.909   269.9MB     5.46     1     6

Filling up the values in the matrix using cbind and converting it afterwards to a data.frame is more performant than filling a data.frame. Calculating the index on a vector is slightly faster and uses the lowest amount of additional memory.

答案2

得分: 4

你也可以使用 xtabs，即：

xtabs(value ~ row + col, data = df1)
   col
row 1 2
  1 3 1
  2 0 4
  3 5 2

英文:

You can also use xtabs, i.e.

xtabs(value ~ row + col, data = df1)
   col
row 1 2
  1 3 1
  2 0 4
  3 5 2

答案3

得分: 3

可能不是最快的解决方案，但如果你喜欢 Tidyverse 函数，这可能会有用。思路是将“col”列旋转为实际的列：

library(tidyverse)
df1 <- tibble(
  row = c(1,3,1,2,3),
  col = c(2,2,1,2,1),
  value = c(1:5)
)
df1 |> 
  arrange(col) |> 
  pivot_wider(names_from = col, values_from = value) |> 
  select(!row)

<sup>创建于 2023-04-06，使用 reprex v2.0.2</sup>

英文:

Probably not the fastest solution but perhaps useful if you like Tidyverse functions. The idea is to pivot the col column into actual columns:

library(tidyverse)
df1 &lt;- tibble(
  row = c(1,3,1,2,3),
  col = c(2,2,1,2,1),
  value = c(1:5)
)
df1 |&gt; 
  arrange(col) |&gt; 
  pivot_wider(names_from = col, values_from = value) |&gt; 
  select(!row)
#&gt; # A tibble: 3 &#215; 2
#&gt;     `1`   `2`
#&gt;   &lt;int&gt; &lt;int&gt;
#&gt; 1     3     1
#&gt; 2     5     2
#&gt; 3    NA     4

<sup>Created on 2023-04-06 with reprex v2.0.2</sup>

答案4

得分: 2

使用data.table库：

library(data.table)
setDT(df1)
dcast(df1, row ~ paste("Column", col))[, -1]
#    Column 1 Column 2
# 1:        3        1
# 2:       NA        4
# 3:        5        2

使用sparseMatrix方法：

library(Matrix)
as.matrix(sparseMatrix(i = df1[, 1], j = df1[, 2], x = df1[, 3]))
     [,1] [,2]
[1,]    3    1
[2,]    0    4
[3,]    5    2

英文:

using data.table

library(data.table)
setDT(df1)
dcast(df1, row ~ paste(&quot;Column&quot;, col))[, -1]
#    Column 1 Column 2
# 1:        3        1
# 2:       NA        4
# 3:        5        2

sparseMatrix approach

Note that your data looks like a sparse matrix in data.frame format, depending on your situation you could consider this as an option too.

library(Matrix)
as.matrix(sparseMatrix(i = df1[, 1], j = df1[, 2], x = df1[, 3]))
     [,1] [,2]
[1,]    3    1
[2,]    0    4
[3,]    5    2

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

I have two vectors of indices (rows, columns) for a data.frame with corresponding values, whats the fastest way to build the new data.frame?

问题

答案1

答案2

答案3

答案4

R: Weighted Bootstrap in R

Using R to plot a stacked bargraph but the legend does not show up, using GridDB as my database.

如何在R中对具有相同列数值的行进行排序

总结每组显著上调和下调基因的数量。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。