如何将数据框中的每一列分别乘以不同的值每列。

huangapple go评论141阅读模式
英文:

How to multiply each column in a data frame by a different value per column

问题

考虑以下数据框:

  1. x y z
  2. 1 0 0 0
  3. 2 1 0 0
  4. 3 0 1 0
  5. 4 1 1 0
  6. 5 0 0 1
  7. 6 1 0 1
  8. 7 0 1 1
  9. 8 1 1 1

我想要将每列分别与值相乘,例如 c(4, 2, 1),得到:

  1. x y z
  2. 1 0 0 0
  3. 2 4 0 0
  4. 3 0 2 0
  5. 4 4 2 0
  6. 5 0 0 1
  7. 6 4 0 1
  8. 7 0 2 1
  9. 8 4 2 1

以下是无需使用for循环的矢量化解决方案(基于基本R):

  1. pw2 <- c(4, 2, 1)
  2. df <- df * pw2
  3. df
英文:

Consider the following data frame

  1. x y z
  2. 1 0 0 0
  3. 2 1 0 0
  4. 3 0 1 0
  5. 4 1 1 0
  6. 5 0 0 1
  7. 6 1 0 1
  8. 7 0 1 1
  9. 8 1 1 1
  10. -------
  11. x 4 2 1 &lt;--- vector to multiply by

I would like to multiply each column by a seperate value, for example c(4,2,1).
Giving:

  1. x y z
  2. 1 0 0 0
  3. 2 4 0 0
  4. 3 0 2 0
  5. 4 4 2 0
  6. 5 0 0 1
  7. 6 4 0 1
  8. 7 0 2 1
  9. 8 4 2 1

Code:

  1. pw2 &lt;- c(4, 2, 1)
  2. s01 &lt;- seq_len(2) - 1
  3. df &lt;- expand.grid(x=s01, y=s01, z=s01)
  4. df
  5. for (d in seq_len(3)) df[,d] &lt;- df[,d] * pw2[d]
  6. df

Question: Find a vectorized solution without a for loop (in base R).

Note:
that the question https://stackoverflow.com/questions/36111444/multiply-columns-in-a-data-frame-by-a-vector is ambiguous because it includes:

  • multiply each row in the data frame column by a different value.
  • multiply each column in the data frame by a different value.

Both queries can be easily solved with a for loop. Here a vectorised solution is explicitly requested.

答案1

得分: 11

使用sweep函数来在数据框的边缘应用一个函数:

  1. sweep(df, 2, pw2, `*`)

或者使用col

  1. df * pw2[col(df)]

输出:

  1. x y z
  2. 1 0 0 0
  3. 2 4 0 0
  4. 3 0 2 0
  5. 4 4 2 0
  6. 5 0 0 1
  7. 6 4 0 1
  8. 7 0 2 1
  9. 8 4 2 1

对于大型数据框,可以检查collapse::TRA,它比其他答案快10倍(请参见基准测试):

  1. collapse::TRA(df, pw2, "&quot;*&quot;")

基准测试:

  1. bench::mark(sweep = sweep(df, 2, pw2, `*`),
  2. col = df * pw2[col(df)],
  3. &#39;%*%&#39; = setNames(
  4. as.data.frame(as.matrix(df) %*% diag(pw2)),
  5. names(df)
  6. ),
  7. TRA = collapse::TRA(df, pw2, "&quot;*&quot;"),
  8. mapply = data.frame(mapply(FUN = `*`, df, pw2)),
  9. apply = t(apply(df, 1, \(x) x*pw2)),
  10. t = t(t(df)*pw2), check = FALSE,
  11. )
英文:

Use sweep to apply a function on margins of a dataframe:

  1. sweep(df, 2, pw2, `*`)

or with col:

  1. df * pw2[col(df)]

output

  1. x y z
  2. 1 0 0 0
  3. 2 4 0 0
  4. 3 0 2 0
  5. 4 4 2 0
  6. 5 0 0 1
  7. 6 4 0 1
  8. 7 0 2 1
  9. 8 4 2 1

For large data frames, check collapse::TRA, which is 10x faster than any other answers (see benchmark):

  1. collapse::TRA(df, pw2, &quot;*&quot;)

Benchmark:

  1. bench::mark(sweep = sweep(df, 2, pw2, `*`),
  2. col = df * pw2[col(df)],
  3. &#39;%*%&#39; = setNames(
  4. as.data.frame(as.matrix(df) %*% diag(pw2)),
  5. names(df)
  6. ),
  7. TRA = collapse::TRA(df, pw2, &quot;*&quot;),
  8. mapply = data.frame(mapply(FUN = `*`, df, pw2)),
  9. apply = t(apply(df, 1, \(x) x*pw2)),
  10. t = t(t(df)*pw2), check = FALSE,
  11. )
  12. # A tibble: 7 &#215; 13
  13. expression min median itr/s&#185; mem_al…&#178; gc/se…&#179; n_itr n_gc total…⁴
  14. &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:t&gt; &lt;dbl&gt; &lt;bch:by&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;bch:t&gt;
  15. 1 sweep 346.7&#181;s 382.1&#181;s 2427. 1.23KB 10.6 1141 5 470.2ms
  16. 2 col 303.1&#181;s 330.4&#181;s 2760. 784B 8.45 1307 4 473.5ms
  17. 3 %*% 72.8&#181;s 77.9&#181;s 11861. 480B 10.6 5599 5 472.1ms
  18. 4 TRA 5&#181;s 5.5&#181;s 167050. 0B 16.7 9999 1 59.9ms
  19. 5 mapply 117.6&#181;s 127.9&#181;s 7309. 480B 10.6 3442 5 470.9ms
  20. 6 apply 107.8&#181;s 117.9&#181;s 7887. 6.49KB 12.9 3658 6 463.8ms
  21. 7 t 55.3&#181;s 59.7&#181;s 15238. 720B 8.13 5620 3 368.8ms

答案2

得分: 9

转换 dfpw2 成矩阵,使用 %*% 矩阵乘法运算符,然后转换回数据框。这将去除列名,因此用 setNames() 包装以保留它们。

  1. setNames(
  2. as.data.frame(as.matrix(df) %*% diag(pw2)),
  3. names(df)
  4. )
英文:

Convert df and pw2 to matrices, use the %*% matrix multiplication operator, then convert back to a dataframe. This will strip the column names, so wrap in setNames() to preserve them.

  1. setNames(
  2. as.data.frame(as.matrix(df) %*% diag(pw2)),
  3. names(df)
  4. )
  1. x y z
  2. 1 0 0 0
  3. 2 4 0 0
  4. 3 0 2 0
  5. 4 4 2 0
  6. 5 0 0 1
  7. 6 4 0 1
  8. 7 0 2 1
  9. 8 4 2 1

答案3

得分: 6

使用mapply()

  1. mapply(FUN = `*`, df, pw2)

作为数据框:

  1. data.frame(mapply(FUN = `*`, df, pw2))
英文:

using mapply():

  1. mapply(FUN = `*`, df, pw2)
  2. x y z
  3. [1,] 0 0 0
  4. [2,] 4 0 0
  5. [3,] 0 2 0
  6. [4,] 4 2 0
  7. [5,] 0 0 1
  8. [6,] 4 0 1
  9. [7,] 0 2 1
  10. [8,] 4 2 1

and as data frame:

  1. data.frame(mapply(FUN = `*`, df, pw2))
  2. x y z
  3. 1 0 0 0
  4. 2 4 0 0
  5. 3 0 2 0
  6. 4 4 2 0
  7. 5 0 0 1
  8. 6 4 0 1
  9. 7 0 2 1
  10. 8 4 2 1

答案4

得分: 6

  1. 另一种选择是使用 `apply` 与类似的转置:
  2. ``` r
  3. pw2 &lt;- c(4, 2, 1)
  4. t(apply(df, 1, \(x) x*pw2))
  5. #&gt; x y z
  6. #&gt; 1 0 0 0
  7. #&gt; 2 4 0 0
  8. #&gt; 3 0 2 0
  9. #&gt; 4 4 2 0
  10. #&gt; 5 0 0 1
  11. #&gt; 6 4 0 1
  12. #&gt; 7 0 2 1
  13. #&gt; 8 4 2 1
  1. <details>
  2. <summary>英文:</summary>
  3. Another option using `apply` with transpose like this:
  4. ``` r
  5. pw2 &lt;- c(4, 2, 1)
  6. t(apply(df, 1, \(x) x*pw2))
  7. #&gt; x y z
  8. #&gt; 1 0 0 0
  9. #&gt; 2 4 0 0
  10. #&gt; 3 0 2 0
  11. #&gt; 4 4 2 0
  12. #&gt; 5 0 0 1
  13. #&gt; 6 4 0 1
  14. #&gt; 7 0 2 1
  15. #&gt; 8 4 2 1

<sup>Created on 2023-04-10 with reprex v2.0.2</sup>

答案5

得分: 5

  1. 这是另一种选项,您将向量转换为与您的数据框具有相同维度的矩阵,然后简单地将两者相乘:

t(replicate(nrow(df), pw2)) * df

  1. **输出**

x y z
1 0 0 0
2 4 0 0
3 0 2 0
4 4 2 0
5 0 0 1
6 4 0 1
7 0 2 1
8 4 2 1

  1. <details>
  2. <summary>英文:</summary>
  3. Here is another option where you turn the vector into a matrix the same dimensions as your data frame and then simply multiply the two:

t(replicate(nrow(df), pw2)) * df

  1. **Output**

x y z
1 0 0 0
2 4 0 0
3 0 2 0
4 4 2 0
5 0 0 1
6 4 0 1
7 0 2 1
8 4 2 1

  1. </details>
  2. # 答案6
  3. **得分**: 5
  4. 以下是您要翻译的内容:
  5. 现有的 `mapply` 方法在所有答案中看起来不错,但我相信如果我们改用 `Map` + `list2DF`(特别是当您喜欢保持使用基本的 R 时),我们可以实现更高效的方法。
  6. 下面是 `mapply` 和 `Map` 变种的性能基准测试:
  7. ```R
  8. microbenchmark(
  9. "mapply1" = data.frame(mapply(FUN = `*`, df, pw2)),
  10. "mapply2" = as.data.frame(mapply(FUN = `*`, df, pw2)),
  11. "Map1" = list2DF(Map(`*`, df, pw2)),
  12. "Map2" = list2DF(Map(`*`, df, as.list(pw2)))
  13. )

结果如下:

  1. Unit: microseconds
  2. expr min lq mean median uq max neval
  3. mapply1 74.6 78.60 112.163 97.05 140.50 342.6 100
  4. mapply2 34.6 38.20 55.513 42.70 67.40 313.5 100
  5. Map1 23.8 25.25 33.728 27.60 41.30 113.8 100
  6. Map2 25.9 28.75 40.866 32.95 47.65 238.6 100

另外,让 Map 方法也参加基准测试,由 @Maël 提供,例如:

  1. bc <- bench::mark(
  2. sweep = sweep(df, 2, pw2, `*`),
  3. col = df * pw2[col(df)],
  4. "%*%" = setNames(
  5. as.data.frame(as.matrix(df) %*% diag(pw2)),
  6. names(df)
  7. ),
  8. TRA = collapse::TRA(df, pw2, "*"),
  9. mapply1 = data.frame(mapply(FUN = `*`, df, pw2)),
  10. mapply2 = as.data.frame(mapply(FUN = `*`, df, pw2)),
  11. Map1 = list2DF(Map(`*`, df, pw2)),
  12. Map2 = list2DF(Map(`*`, df, as.list(pw2))),
  13. apply = t(apply(df, 1, \(x) x * pw2)),
  14. t = t(t(df) * pw2),
  15. check = FALSE,
  16. )

我们可以看到,Map 在效率方面排名第二:

  1. # A tibble: 10 × 13
  2. expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
  3. <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
  4. 1 sweep 201.7μs 249.2μs 3526. 101.24KB 12.6 1680 6
  5. 2 col 174.9μs 225.6μs 3637. 9.02KB 10.4 1748 5
  6. 3 %*% 45.4μs 52.9μs 17026. 36.95KB 12.5 8158 6
  7. 4 TRA 3.4μs 3.8μs 226089. 905.09KB 22.6 9999 1
  8. 5 mapply1 71.6μs 78.4μs 11958. 480B 14.7 5681 7
  9. 6 mapply2 33.1μs 37.4μs 25339. 480B 17.7 9993 7
  10. 7 Map1 22.5μs 26.1μs 35649. 0B 17.8 9995 5
  11. 8 Map2 25.3μs 29.4μs 31785. 0B 19.1 9994 6
  12. 9 apply 70.2μs 80.7μs 11684. 11.91KB 14.7 5562 7
  13. 10 t 34.8μs 40.2μs 23608. 3.77KB 14.2 9994 6

autoplot(bc) 显示如下:

如何将数据框中的每一列分别乘以不同的值每列。

英文:

The existing mapply approach among all answers look great but I believe we can achieve more efficiency if we use Map + list2DF instead (specially when you prefer to stay with base R)


Below is a benchmark for mapply and Map variants

  1. microbenchmark(
  2. &quot;mapply1&quot; = data.frame(mapply(FUN = `*`, df, pw2)),
  3. &quot;mapply2&quot; = as.data.frame(mapply(FUN = `*`, df, pw2)),
  4. &quot;Map1&quot; = list2DF(Map(`*`, df, pw2)),
  5. &quot;Map2&quot; = list2DF(Map(`*`, df, as.list(pw2)))
  6. )

gives

  1. Unit: microseconds
  2. expr min lq mean median uq max neval
  3. mapply1 74.6 78.60 112.163 97.05 140.50 342.6 100
  4. mapply2 34.6 38.20 55.513 42.70 67.40 313.5 100
  5. Map1 23.8 25.25 33.728 27.60 41.30 113.8 100
  6. Map2 25.9 28.75 40.866 32.95 47.65 238.6 100

Also, let the Map approach join the benchmarking party as provided by @Maël, e.g.,

  1. bc &lt;- bench::mark(
  2. sweep = sweep(df, 2, pw2, `*`),
  3. col = df * pw2[col(df)],
  4. &quot;%*%&quot; = setNames(
  5. as.data.frame(as.matrix(df) %*% diag(pw2)),
  6. names(df)
  7. ),
  8. TRA = collapse::TRA(df, pw2, &quot;*&quot;),
  9. mapply1 = data.frame(mapply(FUN = `*`, df, pw2)),
  10. mapply2 = as.data.frame(mapply(FUN = `*`, df, pw2)),
  11. Map1 = list2DF(Map(`*`, df, pw2)),
  12. Map2 = list2DF(Map(`*`, df, as.list(pw2))),
  13. apply = t(apply(df, 1, \(x) x * pw2)),
  14. t = t(t(df) * pw2),
  15. check = FALSE,
  16. )

we will see that Map is in the second place in terms of efficiency

  1. # A tibble: 10 &#215; 13
  2. expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
  3. &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt; &lt;dbl&gt; &lt;bch:byt&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
  4. 1 sweep 201.7&#181;s 249.2&#181;s 3526. 101.24KB 12.6 1680 6
  5. 2 col 174.9&#181;s 225.6&#181;s 3637. 9.02KB 10.4 1748 5
  6. 3 %*% 45.4&#181;s 52.9&#181;s 17026. 36.95KB 12.5 8158 6
  7. 4 TRA 3.4&#181;s 3.8&#181;s 226089. 905.09KB 22.6 9999 1
  8. 5 mapply1 71.6&#181;s 78.4&#181;s 11958. 480B 14.7 5681 7
  9. 6 mapply2 33.1&#181;s 37.4&#181;s 25339. 480B 17.7 9993 7
  10. 7 Map1 22.5&#181;s 26.1&#181;s 35649. 0B 17.8 9995 5
  11. 8 Map2 25.3&#181;s 29.4&#181;s 31785. 0B 19.1 9994 6
  12. 9 apply 70.2&#181;s 80.7&#181;s 11684. 11.91KB 14.7 5562 7
  13. 10 t 34.8&#181;s 40.2&#181;s 23608. 3.77KB 14.2 9994 6
  14. # ℹ 5 more variables: total_time &lt;bch:tm&gt;, result &lt;list&gt;, memory &lt;list&gt;,
  15. # time &lt;list&gt;, gc &lt;list&gt;

and autoplot(bc) shows

如何将数据框中的每一列分别乘以不同的值每列。

huangapple
  • 本文由 发表于 2023年4月10日 20:07:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/75976974.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定