英文:
Performing a mutate function incorporating character strings
问题
这是我的数据框:
data.frame(
condition = as.factor(c("ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM", "ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM")),
time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)),
value = as.numeric(c(0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2))
)
我有许多条件,代表细菌,比如大肠杆菌(ecoli)或葡萄球菌(staph),然后是它们生长的培养基,因此条件被写成如下形式: "ecoli_RPMI", "staph_RPMI", "ecoli_DMEM", "staph_DMEM"。我有多个时间点(大约50个左右)和多种细菌和培养基。
我还有一些仅用于培养基控制的条件,例如 "RPMI", "DMEM",它们也有相应的数值,跨越多个时间点。
我尝试从所有具有RPMI后缀的细菌(x_RPMI,如 "ecoli_RPMI", "staph_RPMI")中减去与之对应的媒体控制的"value"(在同一行上),即ecoli_RPMI的值减去RPMI的值,并将结果赋给一个名为"corrected.values"的新列,例如:ecoli_RPMI的值 - RPMI的值。
所需的结果如下:
data.frame(
condition = as.factor(c("ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM", "ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM")),
time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)),
value = as.numeric(c(0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2)),
corrected_value = as.numeric(c(0.1, 0.1, 0, 0.3, 0.3, 0, 0.8, 0.7, 0, 0.5, 0.6, 0))
)
我尝试了各种方法:
- 使用
mutate
和case_when
进行分组,类似于:
df %>%
group_by(time) %>%
mutate(corrected_value = case_when(
condition == "ecoli_RPMI" ~ value - value[condition == "RPMI"],
# 其他条件的处理...
))
但是这似乎不起作用。我想知道是否可能使用一个字符串参数来简化这个过程,因为所有的条件中字符串都是一致的,即始终是 "细菌_培养基"。
- 我也尝试了
pivot_wider
,但没有成功。
非常感谢您的帮助!
英文:
This is my dataframe
data.frame(
condition = as.factor(c("ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM", "ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM")),
time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)),
value = as.numeric(c(0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2)))
I have many more conditions, representing bacteria i.e. ecoli or staph, followed by the media they were grown in so that the conditions are written like this i.e. "ecoli_RPMI", "staph_RPMI", "ecoli_DMEM", "staph_DMEM". I have multiple time points (50) or so and multiple bacteria and media.
I also have conditions for just the media controls. i.e. "RPMI", "DMEM" which also have corresponding value again across multiple time points
I am trying to subtract the "value" corresponding to (on the same row as) the media control i.e. the value for "RPMI" from all bacteria with the RPMI suffix, x_RPMI i.e. "ecoli_RPMI", "staph_RPMI" and assign the values to a new column named "corrected.values" for example: value for ecoli_RPMI - value for RPMI
The desired result would like this
data.frame(
condition = as.factor(c("ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM", "ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM")),
time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)),
value = as.numeric(c(0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2)),
corrected_value = as.numeric(c(0.1, 0.1, 0, 0.3, 0.3, 0, 0.8, 0.7, 0, 0.5, 0.6, 0)))
I have tried all sorts:
-
doing a group by statement with
mutate
andcase_when
a bit like this
df %>% group_by(time)%>% mutate(corrected_value = case_when( conditions == "ecoli_RPMI" ~ value - value[Conditions == "RPMI"],
inputing all possibilities but his doesn't seem to work. I wondered if it should be possible to use a string argument to simplify this as the strings are consistent across all of the conditions i.e. always
"bacteria"_"media"
-
I also tried to
pivot_wider
but didnt have any luck
Many thanks for your help!
答案1
得分: 1
我认为最简单的解决方案是将您的条件变量拆分为两个变量(bacterium,medium),然后对分组数据进行减法运算。您可以这样做:
data.frame(condition = as.factor(
c(
"ecoli_RPMI",
"staph_RPMI",
"RPMI",
"ecoli_DMEM",
"staph_DMEM",
"DMEM",
"ecoli_RPMI",
"staph_RPMI",
"RPMI",
"ecoli_DMEM",
"staph_DMEM",
"DMEM"
)
),
time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)),
value = as.numeric(c(
0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2
))) %>%
mutate(
bacterium = case_when(
str_detect(condition, "ecoli") ~ "Ecoli",
str_detect(condition, "staph") ~ "Staph",
TRUE ~ "None"
),
medium = case_when(
str_detect(condition, "RPMI") ~ "RPMI",
str_detect(condition, "DMEM") ~ "DMEM",
TRUE ~ "None"
)
) %>%
group_by(medium, time) %>%
mutate(corrected.values = value - value[bacterium == "None"]) %>%
ungroup()
我使用str_detect()
函数从条件中提取了bacterium和medium。然后,对于每个时间点和介质组合,您可以从整个组中减去没有细菌的值。
这会产生以下结果,似乎是您正在寻找的内容:
# A tibble: 12 × 6
condition time value bacterium medium corrected.values
<fct> <dbl> <dbl> <chr> <chr> <dbl>
1 ecoli_RPMI 1 0.3 Ecoli RPMI 0.1
2 staph_RPMI 1 0.3 Staph RPMI 0.1
3 RPMI 1 0.2 None RPMI 0
4 ecoli_DMEM 1 0.4 Ecoli DMEM 0.3
5 staph_DMEM 1 0.4 Staph DMEM 0.3
6 DMEM 1 0.1 None DMEM 0
7 ecoli_RPMI 2 0.9 Ecoli RPMI 0.8
8 staph_RPMI 2 0.8 Staph RPMI 0.7
9 RPMI 2 0.1 None RPMI 0
10 ecoli_DMEM 2 0.7 Ecoli DMEM 0.5
11 staph_DMEM 2 0.8 Staph DMEM 0.6
12 DMEM 2 0.2 None DMEM 0
英文:
I think the simplest solution would be to split your condition variable into two variables (bacterium, medium) and then perform the subtraction on grouped data. You could do this:
data.frame(condition = as.factor(
c(
"ecoli_RPMI",
"staph_RPMI",
"RPMI",
"ecoli_DMEM",
"staph_DMEM",
"DMEM",
"ecoli_RPMI",
"staph_RPMI",
"RPMI",
"ecoli_DMEM",
"staph_DMEM",
"DMEM"
)
),
time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)),
value = as.numeric(c(
0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2
))) %>%
mutate(
bacterium = case_when(
str_detect(condition, "ecoli") ~ "Ecoli",
str_detect(condition, "staph") ~ "Staph",
TRUE ~ "None"
),
medium = case_when(
str_detect(condition, "RPMI") ~ "RPMI",
str_detect(condition, "DMEM") ~ "DMEM",
TRUE ~ "None"
)
) %>%
group_by(medium, time) %>%
mutate(corrected.values = value - value[bacterium == "None"]) %>%
ungroup()
I am using str_detect() to extract bacterium and medium from conditions. Then for each timepoint and medium combination you can subtract the value you get without bacteria from the whole group.
This produces this result, which seems to be what you are looking for.
# A tibble: 12 × 6
condition time value bacterium medium corrected.values
<fct> <dbl> <dbl> <chr> <chr> <dbl>
1 ecoli_RPMI 1 0.3 Ecoli RPMI 0.1
2 staph_RPMI 1 0.3 Staph RPMI 0.1
3 RPMI 1 0.2 None RPMI 0
4 ecoli_DMEM 1 0.4 Ecoli DMEM 0.3
5 staph_DMEM 1 0.4 Staph DMEM 0.3
6 DMEM 1 0.1 None DMEM 0
7 ecoli_RPMI 2 0.9 Ecoli RPMI 0.8
8 staph_RPMI 2 0.8 Staph RPMI 0.7
9 RPMI 2 0.1 None RPMI 0
10 ecoli_DMEM 2 0.7 Ecoli DMEM 0.5
11 staph_DMEM 2 0.8 Staph DMEM 0.6
12 DMEM 2 0.2 None DMEM 0
答案2
得分: 1
第一种方法使用了 group_modify()
:
首先创建一个变量 bacteria
。然后从整个 value
列中减去当 condition
等于当前细菌组 cur_group()$bacteria
时的 value
。
library(tidyverse)
dat |>
mutate(bacteria = gsub("^.*_(\\w*$)", "\", condition),
.after = "condition") |>
mutate(corrected_values = value - value[condition == cur_group()$bacteria],
.by = c("time", "bacteria")) |>
ungroup()
第二种方法使用了 pivot_wider() |> mutate(across()) |> pivot_longer() |> left_join()
。
关键在于 across()
语句,我们在每一列(除了 time
)上迭代,然后使用对应后缀的列进行减法运算,从当前列 col
中减去它。
library(tidyverse)
dat |>
pivot_wider(names_from = condition,
values_from = value) |>
mutate(across(! time,
\(col) col - get(gsub("^.*_(\\w*$)", "\", cur_column()))
)
) |>
pivot_longer(cols = !time,
names_to = "condition",
values_to = "corrected_values") |>
left_join(dat, by = c("time", "condition"))
OP 的数据如下:
dat <- data.frame( condition = as.factor(c("ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM", "ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM")), time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)), value = as.numeric(c(0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2)))
英文:
Below are two approaches:
The first uses group_modify()
:
We first create a variable bacteria
. Then we subtract the value
where the condition
equals the current bacteria group cur_group()$bacteria
from the whole value
column.
library(tidyverse)
dat |>
mutate(bacteria = gsub("^.*_(\\w*$)", "\", condition),
.after = "condition") |>
mutate(corrected_values = value - value[condition == cur_group()$bacteria],
.by = c("time", "bacteria")) |>
ungroup()
#> condition bacteria time value corrected_values
#> 1 ecoli_RPMI RPMI 1 0.3 0.1
#> 2 staph_RPMI RPMI 1 0.3 0.1
#> 3 RPMI RPMI 1 0.2 0.0
#> 4 ecoli_DMEM DMEM 1 0.4 0.3
#> 5 staph_DMEM DMEM 1 0.4 0.3
#> 6 DMEM DMEM 1 0.1 0.0
#> 7 ecoli_RPMI RPMI 2 0.9 0.8
#> 8 staph_RPMI RPMI 2 0.8 0.7
#> 9 RPMI RPMI 2 0.1 0.0
#> 10 ecoli_DMEM DMEM 2 0.7 0.5
#> 11 staph_DMEM DMEM 2 0.8 0.6
#> 12 DMEM DMEM 2 0.2 0.0
<sup>Created on 2023-07-28 with reprex v2.0.2</sup>
The second uses pivot_wider() |> mutate(across()) |> pivot_longer() |> left_join()
.
The trick lies in the across()
statement where we iterate over each column (except from time
) and then get
the column with the corresponding suffix and subtract it from the current column col
.
library(tidyverse)
dat |>
pivot_wider(names_from = condition,
values_from = value) |>
mutate(across(! time,
\(col) col - get(gsub("^.*_(\\w*$)", "\", cur_column()))
)
) |>
pivot_longer(cols = !time,
names_to = "condition",
values_to = "corrected_values") |>
left_join(dat, by = c("time", "condition"))
#> # A tibble: 12 x 4
#> time condition corrected_values value
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 ecoli_RPMI 0.1 0.3
#> 2 1 staph_RPMI 0.1 0.3
#> 3 1 RPMI 0 0.2
#> 4 1 ecoli_DMEM 0.3 0.4
#> 5 1 staph_DMEM 0.3 0.4
#> 6 1 DMEM 0 0.1
#> 7 2 ecoli_RPMI 0.8 0.9
#> 8 2 staph_RPMI 0.7 0.8
#> 9 2 RPMI 0 0.1
#> 10 2 ecoli_DMEM 0.5 0.7
#> 11 2 staph_DMEM 0.6 0.8
#> 12 2 DMEM 0 0.2
Data from OP
dat <- data.frame( condition = as.factor(c("ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM", "ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM")), time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)), value = as.numeric(c(0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2)))
<sup>Created on 2023-07-27 by the reprex package (v2.0.1)</sup>
答案3
得分: 1
你可以使用以下代码,将条件分割为成分菌株和培养基,然后(按时间)从包含菌株的行中减去不包含菌株的行:
```R
data |>
separate(condition, c("strain", "medium"), "_", remove = FALSE, fill = "left") |>
group_by(time, medium) |>
mutate(corrected_value = value - value[is.na(strain)]) |>
ungroup()
得到:
# A tibble: 12 × 6
condition strain medium time value corrected_value
<fct> <chr> <chr> <dbl> <dbl> <dbl>
1 ecoli_RPMI ecoli RPMI 1 0.3 0.1
2 staph_RPMI staph RPMI 1 0.3 0.1
3 RPMI NA RPMI 1 0.2 0
4 ecoli_DMEM ecoli DMEM 1 0.4 0.3
5 staph_DMEM staph DMEM 1 0.4 0.3
6 DMEM NA DMEM 1 0.1 0
7 ecoli_RPMI ecoli RPMI 2 0.9 0.8
8 staph_RPMI staph RPMI 2 0.8 0.7
9 RPMI NA RPMI 2 0.1 0
10 ecoli_DMEM ecoli DMEM 2 0.7 0.5
11 staph_DMEM staph DMEM 2 0.8 0.6
12 DMEM NA DMEM 2 0.2 0
或者,考虑将数据重新整理成一个更有语义意义的格式:
data |>
separate(condition, c("strain", "medium"), "_", fill = "left") |>
pivot_wider(names_from = strain) |>
rename(baseline = `NA`) |>
pivot_longer(! c(medium, time, baseline), names_to = "strain") |>
mutate(corrected_value = value - baseline)
# A tibble: 8 × 6
medium time baseline strain value corrected_value
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 RPMI 1 0.2 ecoli 0.3 0.1
2 RPMI 1 0.2 staph 0.3 0.1
3 DMEM 1 0.1 ecoli 0.4 0.3
4 DMEM 1 0.1 staph 0.4 0.3
5 RPMI 2 0.1 ecoli 0.9 0.8
6 RPMI 2 0.1 staph 0.8 0.7
7 DMEM 2 0.2 ecoli 0.7 0.5
8 DMEM 2 0.2 staph 0.8 0.6
<details>
<summary>英文:</summary>
You can use the following, which splits the condition into the constituent strain and medium, and then (per time) subtracts the row which contains no strain from those that do:
data |>
separate(condition, c("strain", "medium"), "_", remove = FALSE, fill = "left") |>
group_by(time, medium) |>
mutate(corrected_value = value - value[is.na(strain)]) |>
ungroup()
Yielding:
```none
# A tibble: 12 × 6
condition strain medium time value corrected_value
<fct> <chr> <chr> <dbl> <dbl> <dbl>
1 ecoli_RPMI ecoli RPMI 1 0.3 0.1
2 staph_RPMI staph RPMI 1 0.3 0.1
3 RPMI NA RPMI 1 0.2 0
4 ecoli_DMEM ecoli DMEM 1 0.4 0.3
5 staph_DMEM staph DMEM 1 0.4 0.3
6 DMEM NA DMEM 1 0.1 0
7 ecoli_RPMI ecoli RPMI 2 0.9 0.8
8 staph_RPMI staph RPMI 2 0.8 0.7
9 RPMI NA RPMI 2 0.1 0
10 ecoli_DMEM ecoli DMEM 2 0.7 0.5
11 staph_DMEM staph DMEM 2 0.8 0.6
12 DMEM NA DMEM 2 0.2 0
Alternatively, consider reshaping your data into a more semantically meaningful format:
data |>
separate(condition, c("strain", "medium"), "_", fill = "left") |>
pivot_wider(names_from = strain) |>
rename(baseline = `NA`) |>
pivot_longer(! c(medium, time, baseline), names_to = "strain") |>
mutate(corrected_value = value - baseline)
# A tibble: 8 × 6
medium time baseline strain value corrected_value
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 RPMI 1 0.2 ecoli 0.3 0.1
2 RPMI 1 0.2 staph 0.3 0.1
3 DMEM 1 0.1 ecoli 0.4 0.3
4 DMEM 1 0.1 staph 0.4 0.3
5 RPMI 2 0.1 ecoli 0.9 0.8
6 RPMI 2 0.1 staph 0.8 0.7
7 DMEM 2 0.2 ecoli 0.7 0.5
8 DMEM 2 0.2 staph 0.8 0.6
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论