执行一个包含字符字符串的变异函数。

huangapple go评论107阅读模式
英文:

Performing a mutate function incorporating character strings

问题

这是我的数据框:

data.frame(
  condition = as.factor(c("ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM", "ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM")),
  time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)),
  value = as.numeric(c(0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2))
)

我有许多条件,代表细菌,比如大肠杆菌(ecoli)或葡萄球菌(staph),然后是它们生长的培养基,因此条件被写成如下形式: "ecoli_RPMI", "staph_RPMI", "ecoli_DMEM", "staph_DMEM"。我有多个时间点(大约50个左右)和多种细菌和培养基。

我还有一些仅用于培养基控制的条件,例如 "RPMI", "DMEM",它们也有相应的数值,跨越多个时间点。

我尝试从所有具有RPMI后缀的细菌(x_RPMI,如 "ecoli_RPMI", "staph_RPMI")中减去与之对应的媒体控制的"value"(在同一行上),即ecoli_RPMI的值减去RPMI的值,并将结果赋给一个名为"corrected.values"的新列,例如:ecoli_RPMI的值 - RPMI的值。

所需的结果如下:

data.frame(
  condition = as.factor(c("ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM", "ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM")),
  time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)),
  value = as.numeric(c(0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2)),
  corrected_value = as.numeric(c(0.1, 0.1, 0, 0.3, 0.3, 0, 0.8, 0.7, 0, 0.5, 0.6, 0))
)

我尝试了各种方法:

  1. 使用 mutatecase_when 进行分组,类似于:
df %>%
  group_by(time) %>%
  mutate(corrected_value = case_when(
    condition == "ecoli_RPMI" ~ value - value[condition == "RPMI"],
    # 其他条件的处理...
  ))

但是这似乎不起作用。我想知道是否可能使用一个字符串参数来简化这个过程,因为所有的条件中字符串都是一致的,即始终是 "细菌_培养基"。

  1. 我也尝试了 pivot_wider,但没有成功。

非常感谢您的帮助!

英文:

This is my dataframe

data.frame(
  condition = as.factor(c("ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM", "ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM")),
  time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)),
  value = as.numeric(c(0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2)))

I have many more conditions, representing bacteria i.e. ecoli or staph, followed by the media they were grown in so that the conditions are written like this i.e. "ecoli_RPMI", "staph_RPMI", "ecoli_DMEM", "staph_DMEM". I have multiple time points (50) or so and multiple bacteria and media.

I also have conditions for just the media controls. i.e. "RPMI", "DMEM" which also have corresponding value again across multiple time points

I am trying to subtract the "value" corresponding to (on the same row as) the media control i.e. the value for "RPMI" from all bacteria with the RPMI suffix, x_RPMI i.e. "ecoli_RPMI", "staph_RPMI" and assign the values to a new column named "corrected.values" for example: value for ecoli_RPMI - value for RPMI

The desired result would like this

data.frame(
  condition = as.factor(c("ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM", "ecoli_RPMI", "staph_RPMI", "RPMI", "ecoli_DMEM", "staph_DMEM", "DMEM")),
  time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)),
  value = as.numeric(c(0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2)),
  corrected_value = as.numeric(c(0.1, 0.1, 0, 0.3, 0.3, 0, 0.8, 0.7, 0, 0.5, 0.6, 0)))

I have tried all sorts:

  1. doing a group by statement with mutate and case_when

    a bit like this

    df %>%
      group_by(time)%>%
      mutate(corrected_value = case_when(
      conditions == "ecoli_RPMI" ~ value - value[Conditions == "RPMI"],
    

    inputing all possibilities but his doesn't seem to work. I wondered if it should be possible to use a string argument to simplify this as the strings are consistent across all of the conditions i.e. always "bacteria"_"media"

  2. I also tried to pivot_wider but didnt have any luck

Many thanks for your help!

答案1

得分: 1

我认为最简单的解决方案是将您的条件变量拆分为两个变量(bacterium,medium),然后对分组数据进行减法运算。您可以这样做:

data.frame(condition = as.factor(
  c(
    "ecoli_RPMI",
    "staph_RPMI",
    "RPMI",
    "ecoli_DMEM",
    "staph_DMEM",
    "DMEM",
    "ecoli_RPMI",
    "staph_RPMI",
    "RPMI",
    "ecoli_DMEM",
    "staph_DMEM",
    "DMEM"
  )
),
time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)),
value = as.numeric(c(
  0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2
))) %>%
  mutate(
    bacterium = case_when(
      str_detect(condition, "ecoli") ~ "Ecoli",
      str_detect(condition, "staph") ~ "Staph",
      TRUE ~ "None"
    ),
    medium = case_when(
      str_detect(condition, "RPMI") ~ "RPMI",
      str_detect(condition, "DMEM") ~ "DMEM",
      TRUE ~ "None"
    )
  ) %>%
  group_by(medium, time) %>%
  mutate(corrected.values = value - value[bacterium == "None"]) %>%
  ungroup()

我使用str_detect()函数从条件中提取了bacterium和medium。然后,对于每个时间点和介质组合,您可以从整个组中减去没有细菌的值。

这会产生以下结果,似乎是您正在寻找的内容:

# A tibble: 12 × 6
   condition   time value bacterium medium corrected.values
   <fct>      <dbl> <dbl> <chr>     <chr>             <dbl>
 1 ecoli_RPMI     1   0.3 Ecoli     RPMI                0.1
 2 staph_RPMI     1   0.3 Staph     RPMI                0.1
 3 RPMI           1   0.2 None      RPMI                0  
 4 ecoli_DMEM     1   0.4 Ecoli     DMEM                0.3
 5 staph_DMEM     1   0.4 Staph     DMEM                0.3
 6 DMEM           1   0.1 None      DMEM                0  
 7 ecoli_RPMI     2   0.9 Ecoli     RPMI                0.8
 8 staph_RPMI     2   0.8 Staph     RPMI                0.7
 9 RPMI           2   0.1 None      RPMI                0  
10 ecoli_DMEM     2   0.7 Ecoli     DMEM                0.5
11 staph_DMEM     2   0.8 Staph     DMEM                0.6
12 DMEM           2   0.2 None      DMEM                0  
英文:

I think the simplest solution would be to split your condition variable into two variables (bacterium, medium) and then perform the subtraction on grouped data. You could do this:

data.frame(condition = as.factor(
  c(
    &quot;ecoli_RPMI&quot;,
    &quot;staph_RPMI&quot;,
    &quot;RPMI&quot;,
    &quot;ecoli_DMEM&quot;,
    &quot;staph_DMEM&quot;,
    &quot;DMEM&quot;,
    &quot;ecoli_RPMI&quot;,
    &quot;staph_RPMI&quot;,
    &quot;RPMI&quot;,
    &quot;ecoli_DMEM&quot;,
    &quot;staph_DMEM&quot;,
    &quot;DMEM&quot;
  )
),
time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)),
value = as.numeric(c(
  0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2
))) %&gt;%
  mutate(
    bacterium = case_when(
      str_detect(condition, &quot;ecoli&quot;) ~ &quot;Ecoli&quot;,
      str_detect(condition, &quot;staph&quot;) ~ &quot;Staph&quot;,
      TRUE ~ &quot;None&quot;
    ),
    medium = case_when(
      str_detect(condition, &quot;RPMI&quot;) ~ &quot;RPMI&quot;,
      str_detect(condition, &quot;DMEM&quot;) ~ &quot;DMEM&quot;,
      TRUE ~ &quot;None&quot;
    )
  ) %&gt;%
  group_by(medium, time) %&gt;%
  mutate(corrected.values = value - value[bacterium == &quot;None&quot;]) %&gt;%
  ungroup()

I am using str_detect() to extract bacterium and medium from conditions. Then for each timepoint and medium combination you can subtract the value you get without bacteria from the whole group.

This produces this result, which seems to be what you are looking for.

# A tibble: 12 &#215; 6
   condition   time value bacterium medium corrected.values
   &lt;fct&gt;      &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;     &lt;chr&gt;             &lt;dbl&gt;
 1 ecoli_RPMI     1   0.3 Ecoli     RPMI                0.1
 2 staph_RPMI     1   0.3 Staph     RPMI                0.1
 3 RPMI           1   0.2 None      RPMI                0  
 4 ecoli_DMEM     1   0.4 Ecoli     DMEM                0.3
 5 staph_DMEM     1   0.4 Staph     DMEM                0.3
 6 DMEM           1   0.1 None      DMEM                0  
 7 ecoli_RPMI     2   0.9 Ecoli     RPMI                0.8
 8 staph_RPMI     2   0.8 Staph     RPMI                0.7
 9 RPMI           2   0.1 None      RPMI                0  
10 ecoli_DMEM     2   0.7 Ecoli     DMEM                0.5
11 staph_DMEM     2   0.8 Staph     DMEM                0.6
12 DMEM           2   0.2 None      DMEM                0  

答案2

得分: 1

第一种方法使用了 group_modify()

首先创建一个变量 bacteria。然后从整个 value 列中减去当 condition 等于当前细菌组 cur_group()$bacteria 时的 value

library(tidyverse)

dat |&gt; 
  mutate(bacteria = gsub(&quot;^.*_(\\w*$)&quot;, &quot;\&quot;, condition),
         .after = &quot;condition&quot;) |&gt; 
  mutate(corrected_values = value - value[condition == cur_group()$bacteria],
         .by = c(&quot;time&quot;, &quot;bacteria&quot;)) |&gt;
  ungroup()

第二种方法使用了 pivot_wider() |&gt; mutate(across()) |&gt; pivot_longer() |&gt; left_join()

关键在于 across() 语句,我们在每一列(除了 time)上迭代,然后使用对应后缀的列进行减法运算,从当前列 col 中减去它。

library(tidyverse)

dat |&gt; 
  pivot_wider(names_from = condition,
              values_from = value) |&gt; 
  mutate(across(! time,
                \(col) col - get(gsub(&quot;^.*_(\\w*$)&quot;, &quot;\&quot;, cur_column()))
                )
         ) |&gt; 
  pivot_longer(cols = !time,
               names_to = &quot;condition&quot;,
               values_to = &quot;corrected_values&quot;) |&gt; 
  left_join(dat, by = c(&quot;time&quot;, &quot;condition&quot;))

OP 的数据如下:

dat &lt;- data.frame( condition = as.factor(c(&quot;ecoli_RPMI&quot;, &quot;staph_RPMI&quot;, &quot;RPMI&quot;, &quot;ecoli_DMEM&quot;, &quot;staph_DMEM&quot;, &quot;DMEM&quot;, &quot;ecoli_RPMI&quot;, &quot;staph_RPMI&quot;, &quot;RPMI&quot;, &quot;ecoli_DMEM&quot;, &quot;staph_DMEM&quot;, &quot;DMEM&quot;)), time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)), value = as.numeric(c(0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2)))
英文:

Below are two approaches:

The first uses group_modify():

We first create a variable bacteria. Then we subtract the value where the condition equals the current bacteria group cur_group()$bacteria from the whole value column.

library(tidyverse)

dat |&gt; 
  mutate(bacteria = gsub(&quot;^.*_(\\w*$)&quot;, &quot;\&quot;, condition),
         .after = &quot;condition&quot;) |&gt; 
  mutate(corrected_values = value - value[condition == cur_group()$bacteria],
         .by = c(&quot;time&quot;, &quot;bacteria&quot;)) |&gt;
  ungroup()

#&gt;     condition bacteria time value corrected_values
#&gt; 1  ecoli_RPMI     RPMI    1   0.3              0.1
#&gt; 2  staph_RPMI     RPMI    1   0.3              0.1
#&gt; 3        RPMI     RPMI    1   0.2              0.0
#&gt; 4  ecoli_DMEM     DMEM    1   0.4              0.3
#&gt; 5  staph_DMEM     DMEM    1   0.4              0.3
#&gt; 6        DMEM     DMEM    1   0.1              0.0
#&gt; 7  ecoli_RPMI     RPMI    2   0.9              0.8
#&gt; 8  staph_RPMI     RPMI    2   0.8              0.7
#&gt; 9        RPMI     RPMI    2   0.1              0.0
#&gt; 10 ecoli_DMEM     DMEM    2   0.7              0.5
#&gt; 11 staph_DMEM     DMEM    2   0.8              0.6
#&gt; 12       DMEM     DMEM    2   0.2              0.0

<sup>Created on 2023-07-28 with reprex v2.0.2</sup>

The second uses pivot_wider() |&gt; mutate(across()) |&gt; pivot_longer() |&gt; left_join().

The trick lies in the across() statement where we iterate over each column (except from time) and then get the column with the corresponding suffix and subtract it from the current column col.

library(tidyverse)

dat |&gt; 
  pivot_wider(names_from = condition,
              values_from = value) |&gt; 
  mutate(across(! time,
                \(col) col - get(gsub(&quot;^.*_(\\w*$)&quot;, &quot;\&quot;, cur_column()))
                )
         ) |&gt; 
  pivot_longer(cols = !time,
               names_to = &quot;condition&quot;,
               values_to = &quot;corrected_values&quot;) |&gt; 
  left_join(dat, by = c(&quot;time&quot;, &quot;condition&quot;))

#&gt; # A tibble: 12 x 4
#&gt;     time condition  corrected_values value
#&gt;    &lt;dbl&gt; &lt;chr&gt;                 &lt;dbl&gt; &lt;dbl&gt;
#&gt;  1     1 ecoli_RPMI              0.1   0.3
#&gt;  2     1 staph_RPMI              0.1   0.3
#&gt;  3     1 RPMI                    0     0.2
#&gt;  4     1 ecoli_DMEM              0.3   0.4
#&gt;  5     1 staph_DMEM              0.3   0.4
#&gt;  6     1 DMEM                    0     0.1
#&gt;  7     2 ecoli_RPMI              0.8   0.9
#&gt;  8     2 staph_RPMI              0.7   0.8
#&gt;  9     2 RPMI                    0     0.1
#&gt; 10     2 ecoli_DMEM              0.5   0.7
#&gt; 11     2 staph_DMEM              0.6   0.8
#&gt; 12     2 DMEM                    0     0.2

Data from OP

dat &lt;- data.frame( condition = as.factor(c(&quot;ecoli_RPMI&quot;, &quot;staph_RPMI&quot;, &quot;RPMI&quot;, &quot;ecoli_DMEM&quot;, &quot;staph_DMEM&quot;, &quot;DMEM&quot;, &quot;ecoli_RPMI&quot;, &quot;staph_RPMI&quot;, &quot;RPMI&quot;, &quot;ecoli_DMEM&quot;, &quot;staph_DMEM&quot;, &quot;DMEM&quot;)), time = as.numeric(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)), value = as.numeric(c(0.3, 0.3, 0.2, 0.4, 0.4, 0.1, 0.9, 0.8, 0.1, 0.7, 0.8, 0.2)))

<sup>Created on 2023-07-27 by the reprex package (v2.0.1)</sup>

答案3

得分: 1

你可以使用以下代码,将条件分割为成分菌株和培养基,然后(按时间)从包含菌株的行中减去不包含菌株的行:

```R
data |&gt;
  separate(condition, c(&quot;strain&quot;, &quot;medium&quot;), &quot;_&quot;, remove = FALSE, fill = &quot;left&quot;) |&gt;
  group_by(time, medium) |&gt;
  mutate(corrected_value = value - value[is.na(strain)]) |&gt;
  ungroup()

得到:

# A tibble: 12 &#215; 6
   condition  strain medium  time value corrected_value
   &lt;fct&gt;      &lt;chr&gt;  &lt;chr&gt;  &lt;dbl&gt; &lt;dbl&gt;           &lt;dbl&gt;
 1 ecoli_RPMI ecoli  RPMI       1   0.3             0.1
 2 staph_RPMI staph  RPMI       1   0.3             0.1
 3 RPMI       NA     RPMI       1   0.2             0
 4 ecoli_DMEM ecoli  DMEM       1   0.4             0.3
 5 staph_DMEM staph  DMEM       1   0.4             0.3
 6 DMEM       NA     DMEM       1   0.1             0
 7 ecoli_RPMI ecoli  RPMI       2   0.9             0.8
 8 staph_RPMI staph  RPMI       2   0.8             0.7
 9 RPMI       NA     RPMI       2   0.1             0
10 ecoli_DMEM ecoli  DMEM       2   0.7             0.5
11 staph_DMEM staph  DMEM       2   0.8             0.6
12 DMEM       NA     DMEM       2   0.2             0

或者,考虑将数据重新整理成一个更有语义意义的格式:

data |&gt;
  separate(condition, c(&quot;strain&quot;, &quot;medium&quot;), &quot;_&quot;, fill = &quot;left&quot;) |&gt;
  pivot_wider(names_from = strain) |&gt;
  rename(baseline = `NA`) |&gt;
  pivot_longer(! c(medium, time, baseline), names_to = &quot;strain&quot;) |&gt;
  mutate(corrected_value = value - baseline)
# A tibble: 8 &#215; 6
  medium  time baseline strain value corrected_value
  &lt;chr&gt;  &lt;dbl&gt;    &lt;dbl&gt; &lt;chr&gt;  &lt;dbl&gt;           &lt;dbl&gt;
1 RPMI       1      0.2 ecoli    0.3             0.1
2 RPMI       1      0.2 staph    0.3             0.1
3 DMEM       1      0.1 ecoli    0.4             0.3
4 DMEM       1      0.1 staph    0.4             0.3
5 RPMI       2      0.1 ecoli    0.9             0.8
6 RPMI       2      0.1 staph    0.8             0.7
7 DMEM       2      0.2 ecoli    0.7             0.5
8 DMEM       2      0.2 staph    0.8             0.6

<details>
<summary>英文:</summary>

You can use the following, which splits the condition into the constituent strain and medium, and then (per time) subtracts the row which contains no strain from those that do:

data |>
separate(condition, c("strain", "medium"), "_", remove = FALSE, fill = "left") |>
group_by(time, medium) |>
mutate(corrected_value = value - value[is.na(strain)]) |>
ungroup()


Yielding:

```none
# A tibble: 12 &#215; 6
   condition  strain medium  time value corrected_value
   &lt;fct&gt;      &lt;chr&gt;  &lt;chr&gt;  &lt;dbl&gt; &lt;dbl&gt;           &lt;dbl&gt;
 1 ecoli_RPMI ecoli  RPMI       1   0.3             0.1
 2 staph_RPMI staph  RPMI       1   0.3             0.1
 3 RPMI       NA     RPMI       1   0.2             0
 4 ecoli_DMEM ecoli  DMEM       1   0.4             0.3
 5 staph_DMEM staph  DMEM       1   0.4             0.3
 6 DMEM       NA     DMEM       1   0.1             0
 7 ecoli_RPMI ecoli  RPMI       2   0.9             0.8
 8 staph_RPMI staph  RPMI       2   0.8             0.7
 9 RPMI       NA     RPMI       2   0.1             0
10 ecoli_DMEM ecoli  DMEM       2   0.7             0.5
11 staph_DMEM staph  DMEM       2   0.8             0.6
12 DMEM       NA     DMEM       2   0.2             0

Alternatively, consider reshaping your data into a more semantically meaningful format:

data |&gt;
  separate(condition, c(&quot;strain&quot;, &quot;medium&quot;), &quot;_&quot;, fill = &quot;left&quot;) |&gt;
  pivot_wider(names_from = strain) |&gt;
  rename(baseline = `NA`) |&gt;
  pivot_longer(! c(medium, time, baseline), names_to = &quot;strain&quot;) |&gt;
  mutate(corrected_value = value - baseline)
# A tibble: 8 &#215; 6
  medium  time baseline strain value corrected_value
  &lt;chr&gt;  &lt;dbl&gt;    &lt;dbl&gt; &lt;chr&gt;  &lt;dbl&gt;           &lt;dbl&gt;
1 RPMI       1      0.2 ecoli    0.3             0.1
2 RPMI       1      0.2 staph    0.3             0.1
3 DMEM       1      0.1 ecoli    0.4             0.3
4 DMEM       1      0.1 staph    0.4             0.3
5 RPMI       2      0.1 ecoli    0.9             0.8
6 RPMI       2      0.1 staph    0.8             0.7
7 DMEM       2      0.2 ecoli    0.7             0.5
8 DMEM       2      0.2 staph    0.8             0.6

huangapple
  • 本文由 发表于 2023年7月27日 18:18:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76778754.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定