替换基于列的两行之间的整行数据。

huangapple go评论62阅读模式
英文:

How to replace an entire row between two rows based on a column

问题

我理解你想要的翻译是代码部分,以下是你提供的R代码的翻译:

# 导入必要的库
library(tidyverse)

# 使用 group_by 和 mutate 进行替换操作
test_replace <- test_df %>%
  group_by(gene_id) %>%
  mutate(start = replace(start, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), start[exon_idnetity == "Event"]),
         end = replace(end, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), end[exon_idnetity == "Event"]),
         exon_idnetity = replace(exon_idnetity, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), "Event")
         )

希望这可以帮助你完成所需的操作。如果有任何其他问题,请随时提问。

英文:

I am dealing a with a very large mRNA splicing dataset. Here is a toy dataset to exemplify the problem:

test_df &lt;- data.frame(
  start = c(2, 9, 13, 19, 13, 20, 25, 35, 39),
  end = c(8, 12, 18, 24, 16, 24, 30, 38, 45),
  gene_id = c(&quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;B&quot;, &quot;B&quot;, &quot;B&quot;, &quot;B&quot;),
  exon_identity = c(NA, &quot;Upstream&quot;, NA, &quot;Downstream&quot;, &quot;Event&quot;, NA, &quot;Upstream&quot;, &quot;Downstream&quot;, NA)
)

&gt; test_df
  start end gene_id exon_identity
1     2   8       A          &lt;NA&gt;
2     9  12       A      Upstream
3    13  18       A          &lt;NA&gt;
4    19  24       A    Downstream
5    13  16       A         Event
6    20  24       B          &lt;NA&gt;
7    25  30       B      Upstream
8    35  38       B    Downstream
9    39  45       B          &lt;NA&gt;

For every unique value in gene_id column, I would like to replace an entire row if it is present between "Upstream" and "Downstream" values in the exon_identity column i.e. replace row 3 with row 5. What makes it difficult for me is that there are certain genes in the gene_id column which do not have a row that needs to be replaced, e.g. "B" in the gene_id column.

This question goes in the direction of previously asked questions here and here.

Based on those and other resources, I have tried:

library(tidyverse)

test_replace &lt;- test_df %&gt;% 
  group_by(gene_id) %&gt;% 
  mutate(start = replace(start, row_number() &gt; which(exon_idnetity == &quot;Upstream&quot;) &amp; row_number() &lt; which(exon_idnetity == &quot;Downstream&quot;), start[exon_idnetity == &quot;Event&quot;]),
         end = replace(end, row_number() &gt; which(exon_idnetity == &quot;Upstream&quot;) &amp; row_number() &lt; which(exon_idnetity == &quot;Downstream&quot;), end[exon_idnetity == &quot;Event&quot;]),
         exon_idnetity = replace(exon_idnetity, row_number() &gt; which(exon_idnetity == &quot;Upstream&quot;) &amp; row_number() &lt; which(exon_idnetity == &quot;Downstream&quot;), &quot;Event&quot;)
         )


Warning message:
There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `start = replace(...)`.
ℹ In group 1: `gene_id = &quot;A&quot;`.
Caused by warning in `x[list] &lt;- values`:
! number of items to replace is not a multiple of replacement length
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning. 
&gt; 
&gt; test_replace
# A tibble: 9 &#215; 4
# Groups:   gene_id [2]
  start   end gene_id exon_idnetity
  &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;   &lt;chr&gt;        
1     2     8 A       NA           
2     9    12 A       Upstream     
3    NA    NA A       Event        
4    19    24 A       Downstream   
5    13    16 A       Event        
6    20    24 B       NA           
7    25    30 B       Upstream     
8    35    38 B       Downstream   
9    39    45 B       NA     

Desired output:


&gt; desired_outcome 
  start end gene_id exon_idnetity
1     2   8       A          &lt;NA&gt;
2     9  12       A      Upstream
3    13  16       A         Event
4    19  24       A    Downstream
5    20  24       B          &lt;NA&gt;
6    25  30       B      Upstream
7    35  38       B    Downstream
8    39  45       B          &lt;NA&gt;

A solution, preferably using tidyverse package would be highly appreciated.

Thank you!

答案1

得分: 2

在玩具示例中,重新排序你的数据集几乎可以满足你的要求。这在真实数据集中是否有效呢?例如。

library(tidyverse)
test_df |&gt;
  mutate(
    sandwich = lag(exon_identity == &#39;Upstream&#39;) &amp; lead(exon_identity == &#39;Downstream&#39;)
  ) |&gt;
  replace_na(list(sandwich = FALSE)) |&gt;
  group_by(gene_id) |&gt;
  arrange(start) |&gt;
  ungroup() |&gt;
  filter(!sandwich) |&gt;
  select(-sandwich)

(在玩具示例中,group_byungroup 需要。我添加它们以防在真实数据集中需要/有用。)

英文:

In the toy example, reordering your data set gives you almost all of what you want. Will that work in the real data set? E.g.

library(tidyverse)
test_df |&gt;
  mutate(
    sandwich = lag(exon_identity == &#39;Upstream&#39;) &amp; lead(exon_identity == &#39;Downstream&#39;)
  ) |&gt;
  replace_na(list(sandwich = FALSE)) |&gt;
  group_by(gene_id) |&gt;
  arrange(start) |&gt;
  ungroup() |&gt;
  filter(!sandwich) |&gt;
  select(-sandwich)

(In the toy example, group_by and ungroup are not needed. I added them in case it was needed/useful in the real data set.)

答案2

得分: 0

如果@MelissaKey关于您实际数据的结构是正确的,他们的解决方案将非常有效。否则,以下是一个执行此任务的函数,以及group_modify()

library(dplyr)
library(tidyr)

replace_rows <- function(x, ...) {
  is_bad <- replace_na(
    lag(x$exon_identity) == "Upstream" & lead(x$exon_identity) == "Downstream",
    FALSE
  )
  if (any(is_bad)) {
    is_event <- replace_na(x$exon_identity == "Event", FALSE)
    x <- x %>%
      filter(!is_bad, !is_event) %>%
      add_row(
        filter(x, is_event),
        .before = which(is_bad)
      )
  }
  x
}

test_df %>%
  group_by(gene_id) %>%
  group_modify(replace_rows) %>%
  ungroup()
# A tibble: 8 × 4
  gene_id start   end exon_identity
1 A           2     8 <NA>         
2 A           9    12 Upstream     
3 A          13    16 Event        
4 A          19    24 Downstream   
5 B          20    24 <NA>         
6 B          25    30 Upstream     
7 B          35    38 Downstream   
8 B          39    45 <NA>   
英文:

If @MelissaKey is right about the structure of your actual data, their solution will work nicely. Otherwise, here’s a function that does the job along with group_modify():

library(dplyr)
library(tidyr)

replace_rows &lt;- function(x, ...) {
  is_bad &lt;- replace_na(
    lag(x$exon_identity) == &quot;Upstream&quot; &amp; lead(x$exon_identity) == &quot;Downstream&quot;,
	FALSE
  )
  if (any(is_bad)) {
    is_event &lt;- replace_na(x$exon_identity == &quot;Event&quot;, FALSE)
    x &lt;- x %&gt;%
      filter(!is_bad, !is_event) %&gt;%
      add_row(
        filter(x, is_event),
        .before = which(is_bad)
      )
  }
  x
}

test_df %&gt;% 
  group_by(gene_id) %&gt;% 
  group_modify(replace_rows) %&gt;%
  ungroup()
# A tibble: 8 &#215; 4
  gene_id start   end exon_identity
  &lt;chr&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;        
1 A           2     8 &lt;NA&gt;         
2 A           9    12 Upstream     
3 A          13    16 Event        
4 A          19    24 Downstream   
5 B          20    24 &lt;NA&gt;         
6 B          25    30 Upstream     
7 B          35    38 Downstream   
8 B          39    45 &lt;NA&gt;   

huangapple
  • 本文由 发表于 2023年6月13日 04:35:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76460150.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定