2023年3月9日 15:06:08go评论73阅读模式

英文:

Subtracting vectors by group from two dataframes

问题

我有两个R数据框。第一个数据框包含多个列（特征），以及一个列，指示某个样本（行）是否属于某个组（一个因子变量）。第二个数据框包含相同数量的列，行数等于唯一组的数量。我想从第一个数据框的每个样本中减去第二个数据框中对应组的向量，其中通过相同名称的列中的键-组指定对应关系。

以下是主数据集的示例：

df_repr <- structure(list(f1 = c(-3.9956064225704, 
-0.52380279948658, 0.61089389331505, -3.47273625634875, -4.486918671214, 
-6.1761970731672, -4.62305749757367, -4.42540643005429, -3.61613137597131, 
-3.29821425516253), f2 = c(-1.57918114753228, 
-4.10523012500727, -1.80270009366593, -0.00905317702835884, -0.899585192079915, 
-2.89341515186212, 0.0132542126386332, -3.32639898550135, -0.867793877742314, 
0.0911950321630834), f3 = c(-6.02532301769732, 
-4.90073348094302, -3.73159604513274, -3.55290209472808, -6.63194560195811, 
2.69409789701296, -4.17675978927128, -3.84141885970095, -1.20571283849034, 
1.54287440902102), group = structure(c(1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor")), class = c("tbl_df", "tbl", 
"data.frame"), row names = c(NA, -10L))

以下是要从第一个数据框的每行减去的向量的示例数据框：

to_subtract <- structure(list(group = structure(1:2, .Label = c("A", 
"B"), class = "factor"), f1 = c(-2.78048744402161, 
-2.33583431665818), f2 = c(-2.56086962108741, 
-0.689157827347865), f3 = c(-3.60224982918457, 
-0.782365376308658)), row.names = c(NA, -2L), class = c("tbl_df", 
"tbl", "data.frame"))

我尝试像这样执行操作：

df_repr %>%
  group_by(group) %>%
  mutate(across(where(is.numeric),
         ~ . - to_subtract[to_subtract$group == unique(.$group), -1]))

但我遇到以下错误：

Error in `mutate()`:
ℹ️ In argument: `across(...)`.
ℹ️ In group 1: `group = A`.
Caused by error in `across()`:
! Can't compute column `f1`.
Caused by error in `f1$group`:
! $ operator is invalid for atomic vectors

此示例的期望输出：

       f1     f2      f3 group
1 -1.22   0.982 -2.42   A    
2  2.26  -1.54  -1.30   A    
3  3.39   0.758 -0.129  A    
4 -0.692  2.55   0.0493 A    
5 -1.71   1.66  -3.03   A    
6 -3.84  -2.20   3.48   B    
7 -2.29   0.702 -3.39   B    
8 -2.09  -2.64  -3.06   B    
9 -1.28  -0.179 -0.423  B    
10 -0.962  0.780  2.33   B

英文:

I have two dataframes in R.
The first dataframe contains several columns-features, as well as a column that tells whether a particular sample (row) belongs to a certain group (a factor variable). The second dataframe contains the same number of columns, and the number of rows equals the number of unique groups. I want to subtract from each sample of the first dataframe the corresponding vector from the second dataframe, where the correspondence is specified using the key-group in the column of the same name.

Here is an example of the main dataset:

df_repr &lt;- structure(list(f1 = c(-3.9956064225704, 
-0.52380279948658, 0.61089389331505, -3.47273625634875, -4.486918671214, 
-6.1761970731672, -4.62305749757367, -4.42540643005429, -3.61613137597131, 
-3.29821425516253), f2 = c(-1.57918114753228, 
-4.10523012500727, -1.80270009366593, -0.00905317702835884, -0.899585192079915, 
-2.89341515186212, 0.0132542126386332, -3.32639898550135, -0.867793877742314, 
0.0911950321630834), f3 = c(-6.02532301769732, 
-4.90073348094302, -3.73159604513274, -3.55290209472808, -6.63194560195811, 
2.69409789701296, -4.17675978927128, -3.84141885970095, -1.20571283849034, 
1.54287440902102), group = structure(c(1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L), .Label = c(&quot;A&quot;, &quot;B&quot;), class = &quot;factor&quot;)), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, 
&quot;data.frame&quot;), row.names = c(NA, -10L))

Here is an example dataframe with vectors to be subtracted from each row of the corresponding group of the first dataframe:

to_subtract &lt;- structure(list(group = structure(1:2, .Label = c(&quot;A&quot;, 
&quot;B&quot;), class = &quot;factor&quot;), f1 = c(-2.78048744402161, 
-2.33583431665818), f2 = c(-2.56086962108741, 
-0.689157827347865), f3 = c(-3.60224982918457, 
-0.782365376308658)), row.names = c(NA, -2L), class = c(&quot;tbl_df&quot;, 
&quot;tbl&quot;, &quot;data.frame&quot;))

# # A tibble: 2 &#215; 4
#   group    f1     f2     f3
#   &lt;fct&gt; &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;
# 1 A     -2.78 -2.56  -3.60
# 2 B     -2.34 -0.689 -0.782

I tried to do it like this:

df_repr %&gt;%
  group_by(group) %&gt;%
  mutate(across(where(is.numeric),
         ~ . - to_subtract[to_subtract$group == unique(.$group), -1]))

But I get the following error:

Error in `mutate()`:
ℹ️ In argument: `across(...)`.
ℹ️ In group 1: `group = A`.
Caused by error in `across()`:
! Can&#39;t compute column `f1`.
Caused by error in `f1$group`:
! $ operator is invalid for atomic vectors

Expected output for this example:

       f1     f2      f3 group
    &lt;dbl&gt;  &lt;dbl&gt;   &lt;dbl&gt; &lt;fct&gt;
 1 -1.22   0.982 -2.42   A    
 2  2.26  -1.54  -1.30   A    
 3  3.39   0.758 -0.129  A    
 4 -0.692  2.55   0.0493 A    
 5 -1.71   1.66  -3.03   A    
 6 -3.84  -2.20   3.48   B    
 7 -2.29   0.702 -3.39   B    
 8 -2.09  -2.64  -3.06   B    
 9 -1.28  -0.179 -0.423  B    
10 -0.962  0.780  2.33   B

答案1

得分: 4

你可以使用 powerjoin 与 (冲突 = -)：

library(powerjoin)

power_left_join(df_repr, to_subtract, by = "group", conflict = `-`)

# 一个数据框: 10 × 4
   group     f1     f2      f3
   <fct>  <dbl>  <dbl>   <dbl>
 1 A     -1.22   0.982 -2.42
 2 A      2.26  -1.54  -1.30  
 3 A      3.39   0.758 -0.129
 4 A     -0.692  2.55   0.0493
 5 A     -1.71   1.66  -3.03
 6 B     -3.84  -2.20   3.48
 7 B     -2.29   0.702 -3.39
 8 B     -2.09  -2.64  -3.06  
 9 B     -1.28  -0.179 -0.423
10 B     -0.962  0.780  2.33

另一种 dplyr::group_modify 的方法：

df_repr %>%
  group_by(group) %>%
  group_modify(~ mutate(.x, across(f1:f3, \(val) {
    val - filter(to_subtract, group == .y$group)[[cur_column()]]
  })) %>%
  ungroup()

英文:

You can use powerjoin with (conflict = `-`):

library(powerjoin)

power_left_join(df_repr, to_subtract, by = &quot;group&quot;, conflict = `-`)

# A tibble: 10 &#215; 4
   group     f1     f2      f3
   &lt;fct&gt;  &lt;dbl&gt;  &lt;dbl&gt;   &lt;dbl&gt;
 1 A     -1.22   0.982 -2.42
 2 A      2.26  -1.54  -1.30  
 3 A      3.39   0.758 -0.129
 4 A     -0.692  2.55   0.0493
 5 A     -1.71   1.66  -3.03
 6 B     -3.84  -2.20   3.48
 7 B     -2.29   0.702 -3.39
 8 B     -2.09  -2.64  -3.06  
 9 B     -1.28  -0.179 -0.423
10 B     -0.962  0.780  2.33

Another dplyr::group_modify approach:

df_repr %&gt;%
  group_by(group) %&gt;%
  group_modify(~ mutate(.x, across(f1:f3, \(val) {
    val - filter(to_subtract, group == .y$group)[[cur_column()]]
  }))) %&gt;%
  ungroup()

答案2

得分: 2

你可以将目标数据框与 to_subtract 组合在一起，并同时设置一个逻辑列来指示从哪个数据框中减去。然后在 mutate 中执行减法操作，并重新整理为你期望的格式。

要使用 mutate(.by) 函数，你需要使用 dplyr 版本大于等于 1.1.0。如果不是，可以在执行 mutate 之前使用传统的 group_by(group) 方法。

library(dplyr)

rbind(to_subtract %>% mutate(target = TRUE), df_repr %>% mutate(target = FALSE)) %>%
  mutate(across(where(is.numeric), ~ .x - .x[target]), .by = group) %>%
  filter(!target) %>%
  select(-target)

一个数据框: 10 行 x 4 列

group f1 f2 f3
1 A -1.22 0.982 -2.42
2 A 2.26 -1.54 -1.30
3 A 3.39 0.758 -0.129
4 A -0.692 2.55 0.0493
5 A -1.71 1.66 -3.03
6 B -3.84 -2.20 3.48
7 B -2.29 0.702 -3.39
8 B -2.09 -2.64 -3.06
9 B -1.28 -0.179 -0.423
10 B -0.962 0.780 2.33


<details>
<summary>英文:</summary>

You can combine your target data frame together with `to_subtract`, and at the same time set a logical column to indicate which one to subtract from. Then do the subtraction in `mutate`, and re-shape to your desired format.

To use the `mutate(.by)` function, you need to have `dplyr` version &gt;= 1.1.0. If not, use the traditional `group_by(group)` method before `mutate`.

library(dplyr)

rbind(to_subtract %>% mutate(target = T), df_repr %>% mutate(target = F)) %>%
mutate(across(where(is.numeric), ~ .x - .x[target]), .by = group) %>%
filter(!target) %>%
select(-target)

A tibble: 10 × 4

group f1 f2 f3
<fct> <dbl> <dbl> <dbl>
1 A -1.22 0.982 -2.42
2 A 2.26 -1.54 -1.30
3 A 3.39 0.758 -0.129
4 A -0.692 2.55 0.0493
5 A -1.71 1.66 -3.03
6 B -3.84 -2.20 3.48
7 B -2.29 0.702 -3.39
8 B -2.09 -2.64 -3.06
9 B -1.28 -0.179 -0.423
10 B -0.962 0.780 2.33


</details>



# 答案3
**得分**: 2

另一种方法是使用 `group_modify()` 并进行 `data.frame` 操作。为此，`to_subtract` 和 `df_rep` 的行号必须匹配，因此我们为 `to_substract` 中的每个组复制每一行，以匹配 `df_rep`：

<details>
<summary>英文:</summary>

Another approach is to use `group_modify()` and do `data.frame` operations. For this the row numbers of `to_subtract` and `df_rep` have to match, which is why we replicate each row for each group in `to_substract` to match `df_rep`:

``` r
library(dplyr)

df_repr %&gt;%
  group_by(group) %&gt;% 
  group_modify(\(df, grp) {
    # get current group in `to_subtract` and drop `group` column
    df2 &lt;- to_subtract[to_subtract$group == grp$group, -1]
    # match row numbers of `df` and  substract
    df - df2[rep(1, nrow(df)), ]
  })
#&gt; # A tibble: 10 &#215; 4
#&gt; # Groups:   group [2]
#&gt;    group     f1     f2      f3
#&gt;    &lt;fct&gt;  &lt;dbl&gt;  &lt;dbl&gt;   &lt;dbl&gt;
#&gt;  1 A     -1.22   0.982 -2.42  
#&gt;  2 A      2.26  -1.54  -1.30  
#&gt;  3 A      3.39   0.758 -0.129 
#&gt;  4 A     -0.692  2.55   0.0493
#&gt;  5 A     -1.71   1.66  -3.03  
#&gt;  6 B     -3.84  -2.20   3.48  
#&gt;  7 B     -2.29   0.702 -3.39  
#&gt;  8 B     -2.09  -2.64  -3.06  
#&gt;  9 B     -1.28  -0.179 -0.423 
#&gt; 10 B     -0.962  0.780  2.33

Data from OP

df_repr &lt;- structure(list(f1 = c(-3.9956064225704, 
                                 -0.52380279948658, 0.61089389331505, -3.47273625634875, -4.486918671214, 
                                 -6.1761970731672, -4.62305749757367, -4.42540643005429, -3.61613137597131, 
                                 -3.29821425516253), f2 = c(-1.57918114753228, 
                                                            -4.10523012500727, -1.80270009366593, -0.00905317702835884, -0.899585192079915, 
                                                            -2.89341515186212, 0.0132542126386332, -3.32639898550135, -0.867793877742314, 
                                                            0.0911950321630834), f3 = c(-6.02532301769732, 
                                                                                        -4.90073348094302, -3.73159604513274, -3.55290209472808, -6.63194560195811, 
                                                                                        2.69409789701296, -4.17675978927128, -3.84141885970095, -1.20571283849034, 
                                                                                        1.54287440902102), group = structure(c(1L, 1L, 1L, 1L, 1L, 
                                                                                                                               2L, 2L, 2L, 2L, 2L), .Label = c(&quot;A&quot;, &quot;B&quot;), class = &quot;factor&quot;)), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, 
                                                                                                                                                                                                        &quot;data.frame&quot;), row.names = c(NA, -10L))


to_subtract &lt;- structure(list(group = structure(1:2, .Label = c(&quot;A&quot;, 
                                                                &quot;B&quot;), class = &quot;factor&quot;), f1 = c(-2.78048744402161, 
                                                                                                -2.33583431665818), f2 = c(-2.56086962108741, 
                                                                                                                           -0.689157827347865), f3 = c(-3.60224982918457, 
                                                                                                                                                       -0.782365376308658)), row.names = c(NA, -2L), class = c(&quot;tbl_df&quot;, 
                                                                                                                                                                                                               &quot;tbl&quot;, &quot;data.frame&quot;))

<sup>Created on 2023-03-09 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从两个数据框中按组减去向量。

问题

答案1

答案2

一个数据框: 10 行 x 4 列

A tibble: 10 × 4

I want to change values below 60 in a column of my dataframe but values below 10 are not taken with it

Python Pandas合并两列但丢弃重复值

在pandas计算中出现错误。

如何使用pandas重新排序数据

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论