英文:
Subtracting vectors by group from two dataframes
问题
我有两个R数据框。第一个数据框包含多个列(特征),以及一个列,指示某个样本(行)是否属于某个组(一个因子变量)。第二个数据框包含相同数量的列,行数等于唯一组的数量。我想从第一个数据框的每个样本中减去第二个数据框中对应组的向量,其中通过相同名称的列中的键-组指定对应关系。
以下是主数据集的示例:
df_repr <- structure(list(f1 = c(-3.9956064225704,
-0.52380279948658, 0.61089389331505, -3.47273625634875, -4.486918671214,
-6.1761970731672, -4.62305749757367, -4.42540643005429, -3.61613137597131,
-3.29821425516253), f2 = c(-1.57918114753228,
-4.10523012500727, -1.80270009366593, -0.00905317702835884, -0.899585192079915,
-2.89341515186212, 0.0132542126386332, -3.32639898550135, -0.867793877742314,
0.0911950321630834), f3 = c(-6.02532301769732,
-4.90073348094302, -3.73159604513274, -3.55290209472808, -6.63194560195811,
2.69409789701296, -4.17675978927128, -3.84141885970095, -1.20571283849034,
1.54287440902102), group = structure(c(1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor")), class = c("tbl_df", "tbl",
"data.frame"), row names = c(NA, -10L))
以下是要从第一个数据框的每行减去的向量的示例数据框:
to_subtract <- structure(list(group = structure(1:2, .Label = c("A",
"B"), class = "factor"), f1 = c(-2.78048744402161,
-2.33583431665818), f2 = c(-2.56086962108741,
-0.689157827347865), f3 = c(-3.60224982918457,
-0.782365376308658)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
我尝试像这样执行操作:
df_repr %>%
group_by(group) %>%
mutate(across(where(is.numeric),
~ . - to_subtract[to_subtract$group == unique(.$group), -1]))
但我遇到以下错误:
Error in `mutate()`:
ℹ️ In argument: `across(...)`.
ℹ️ In group 1: `group = A`.
Caused by error in `across()`:
! Can't compute column `f1`.
Caused by error in `f1$group`:
! $ operator is invalid for atomic vectors
此示例的期望输出:
f1 f2 f3 group
1 -1.22 0.982 -2.42 A
2 2.26 -1.54 -1.30 A
3 3.39 0.758 -0.129 A
4 -0.692 2.55 0.0493 A
5 -1.71 1.66 -3.03 A
6 -3.84 -2.20 3.48 B
7 -2.29 0.702 -3.39 B
8 -2.09 -2.64 -3.06 B
9 -1.28 -0.179 -0.423 B
10 -0.962 0.780 2.33 B
英文:
I have two dataframes in R.
The first dataframe contains several columns-features, as well as a column that tells whether a particular sample (row) belongs to a certain group (a factor variable). The second dataframe contains the same number of columns, and the number of rows equals the number of unique groups. I want to subtract from each sample of the first dataframe the corresponding vector from the second dataframe, where the correspondence is specified using the key-group in the column of the same name.
Here is an example of the main dataset:
df_repr <- structure(list(f1 = c(-3.9956064225704,
-0.52380279948658, 0.61089389331505, -3.47273625634875, -4.486918671214,
-6.1761970731672, -4.62305749757367, -4.42540643005429, -3.61613137597131,
-3.29821425516253), f2 = c(-1.57918114753228,
-4.10523012500727, -1.80270009366593, -0.00905317702835884, -0.899585192079915,
-2.89341515186212, 0.0132542126386332, -3.32639898550135, -0.867793877742314,
0.0911950321630834), f3 = c(-6.02532301769732,
-4.90073348094302, -3.73159604513274, -3.55290209472808, -6.63194560195811,
2.69409789701296, -4.17675978927128, -3.84141885970095, -1.20571283849034,
1.54287440902102), group = structure(c(1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L))
Here is an example dataframe with vectors to be subtracted from each row of the corresponding group of the first dataframe:
to_subtract <- structure(list(group = structure(1:2, .Label = c("A",
"B"), class = "factor"), f1 = c(-2.78048744402161,
-2.33583431665818), f2 = c(-2.56086962108741,
-0.689157827347865), f3 = c(-3.60224982918457,
-0.782365376308658)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
# # A tibble: 2 × 4
# group f1 f2 f3
# <fct> <dbl> <dbl> <dbl>
# 1 A -2.78 -2.56 -3.60
# 2 B -2.34 -0.689 -0.782
I tried to do it like this:
df_repr %>%
group_by(group) %>%
mutate(across(where(is.numeric),
~ . - to_subtract[to_subtract$group == unique(.$group), -1]))
But I get the following error:
Error in `mutate()`:
ℹ️ In argument: `across(...)`.
ℹ️ In group 1: `group = A`.
Caused by error in `across()`:
! Can't compute column `f1`.
Caused by error in `f1$group`:
! $ operator is invalid for atomic vectors
Expected output for this example:
f1 f2 f3 group
<dbl> <dbl> <dbl> <fct>
1 -1.22 0.982 -2.42 A
2 2.26 -1.54 -1.30 A
3 3.39 0.758 -0.129 A
4 -0.692 2.55 0.0493 A
5 -1.71 1.66 -3.03 A
6 -3.84 -2.20 3.48 B
7 -2.29 0.702 -3.39 B
8 -2.09 -2.64 -3.06 B
9 -1.28 -0.179 -0.423 B
10 -0.962 0.780 2.33 B
答案1
得分: 4
你可以使用 powerjoin
与 (冲突 =
-)
:
library(powerjoin)
power_left_join(df_repr, to_subtract, by = "group", conflict = `-`)
# 一个数据框: 10 × 4
group f1 f2 f3
<fct> <dbl> <dbl> <dbl>
1 A -1.22 0.982 -2.42
2 A 2.26 -1.54 -1.30
3 A 3.39 0.758 -0.129
4 A -0.692 2.55 0.0493
5 A -1.71 1.66 -3.03
6 B -3.84 -2.20 3.48
7 B -2.29 0.702 -3.39
8 B -2.09 -2.64 -3.06
9 B -1.28 -0.179 -0.423
10 B -0.962 0.780 2.33
另一种 dplyr::group_modify
的方法:
df_repr %>%
group_by(group) %>%
group_modify(~ mutate(.x, across(f1:f3, \(val) {
val - filter(to_subtract, group == .y$group)[[cur_column()]]
})) %>%
ungroup()
英文:
You can use powerjoin
with (conflict = `-`)
:
library(powerjoin)
power_left_join(df_repr, to_subtract, by = "group", conflict = `-`)
# A tibble: 10 × 4
group f1 f2 f3
<fct> <dbl> <dbl> <dbl>
1 A -1.22 0.982 -2.42
2 A 2.26 -1.54 -1.30
3 A 3.39 0.758 -0.129
4 A -0.692 2.55 0.0493
5 A -1.71 1.66 -3.03
6 B -3.84 -2.20 3.48
7 B -2.29 0.702 -3.39
8 B -2.09 -2.64 -3.06
9 B -1.28 -0.179 -0.423
10 B -0.962 0.780 2.33
Another dplyr::group_modify
approach:
df_repr %>%
group_by(group) %>%
group_modify(~ mutate(.x, across(f1:f3, \(val) {
val - filter(to_subtract, group == .y$group)[[cur_column()]]
}))) %>%
ungroup()
答案2
得分: 2
你可以将目标数据框与 to_subtract
组合在一起,并同时设置一个逻辑列来指示从哪个数据框中减去。然后在 mutate
中执行减法操作,并重新整理为你期望的格式。
要使用 mutate(.by)
函数,你需要使用 dplyr
版本大于等于 1.1.0。如果不是,可以在执行 mutate
之前使用传统的 group_by(group)
方法。
library(dplyr)
rbind(to_subtract %>% mutate(target = TRUE), df_repr %>% mutate(target = FALSE)) %>%
mutate(across(where(is.numeric), ~ .x - .x[target]), .by = group) %>%
filter(!target) %>%
select(-target)
一个数据框: 10 行 x 4 列
group f1 f2 f3
1 A -1.22 0.982 -2.42
2 A 2.26 -1.54 -1.30
3 A 3.39 0.758 -0.129
4 A -0.692 2.55 0.0493
5 A -1.71 1.66 -3.03
6 B -3.84 -2.20 3.48
7 B -2.29 0.702 -3.39
8 B -2.09 -2.64 -3.06
9 B -1.28 -0.179 -0.423
10 B -0.962 0.780 2.33
<details>
<summary>英文:</summary>
You can combine your target data frame together with `to_subtract`, and at the same time set a logical column to indicate which one to subtract from. Then do the subtraction in `mutate`, and re-shape to your desired format.
To use the `mutate(.by)` function, you need to have `dplyr` version >= 1.1.0. If not, use the traditional `group_by(group)` method before `mutate`.
library(dplyr)
rbind(to_subtract %>% mutate(target = T), df_repr %>% mutate(target = F)) %>%
mutate(across(where(is.numeric), ~ .x - .x[target]), .by = group) %>%
filter(!target) %>%
select(-target)
A tibble: 10 × 4
group f1 f2 f3
<fct> <dbl> <dbl> <dbl>
1 A -1.22 0.982 -2.42
2 A 2.26 -1.54 -1.30
3 A 3.39 0.758 -0.129
4 A -0.692 2.55 0.0493
5 A -1.71 1.66 -3.03
6 B -3.84 -2.20 3.48
7 B -2.29 0.702 -3.39
8 B -2.09 -2.64 -3.06
9 B -1.28 -0.179 -0.423
10 B -0.962 0.780 2.33
</details>
# 答案3
**得分**: 2
另一种方法是使用 `group_modify()` 并进行 `data.frame` 操作。为此,`to_subtract` 和 `df_rep` 的行号必须匹配,因此我们为 `to_substract` 中的每个组复制每一行,以匹配 `df_rep`:
<details>
<summary>英文:</summary>
Another approach is to use `group_modify()` and do `data.frame` operations. For this the row numbers of `to_subtract` and `df_rep` have to match, which is why we replicate each row for each group in `to_substract` to match `df_rep`:
``` r
library(dplyr)
df_repr %>%
group_by(group) %>%
group_modify(\(df, grp) {
# get current group in `to_subtract` and drop `group` column
df2 <- to_subtract[to_subtract$group == grp$group, -1]
# match row numbers of `df` and substract
df - df2[rep(1, nrow(df)), ]
})
#> # A tibble: 10 × 4
#> # Groups: group [2]
#> group f1 f2 f3
#> <fct> <dbl> <dbl> <dbl>
#> 1 A -1.22 0.982 -2.42
#> 2 A 2.26 -1.54 -1.30
#> 3 A 3.39 0.758 -0.129
#> 4 A -0.692 2.55 0.0493
#> 5 A -1.71 1.66 -3.03
#> 6 B -3.84 -2.20 3.48
#> 7 B -2.29 0.702 -3.39
#> 8 B -2.09 -2.64 -3.06
#> 9 B -1.28 -0.179 -0.423
#> 10 B -0.962 0.780 2.33
Data from OP
df_repr <- structure(list(f1 = c(-3.9956064225704,
-0.52380279948658, 0.61089389331505, -3.47273625634875, -4.486918671214,
-6.1761970731672, -4.62305749757367, -4.42540643005429, -3.61613137597131,
-3.29821425516253), f2 = c(-1.57918114753228,
-4.10523012500727, -1.80270009366593, -0.00905317702835884, -0.899585192079915,
-2.89341515186212, 0.0132542126386332, -3.32639898550135, -0.867793877742314,
0.0911950321630834), f3 = c(-6.02532301769732,
-4.90073348094302, -3.73159604513274, -3.55290209472808, -6.63194560195811,
2.69409789701296, -4.17675978927128, -3.84141885970095, -1.20571283849034,
1.54287440902102), group = structure(c(1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L))
to_subtract <- structure(list(group = structure(1:2, .Label = c("A",
"B"), class = "factor"), f1 = c(-2.78048744402161,
-2.33583431665818), f2 = c(-2.56086962108741,
-0.689157827347865), f3 = c(-3.60224982918457,
-0.782365376308658)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
<sup>Created on 2023-03-09 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论