替换基于列的两行之间的整行数据。

huangapple go评论97阅读模式
英文:

How to replace an entire row between two rows based on a column

问题

我理解你想要的翻译是代码部分,以下是你提供的R代码的翻译:

  1. # 导入必要的库
  2. library(tidyverse)
  3. # 使用 group_by 和 mutate 进行替换操作
  4. test_replace <- test_df %>%
  5. group_by(gene_id) %>%
  6. mutate(start = replace(start, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), start[exon_idnetity == "Event"]),
  7. end = replace(end, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), end[exon_idnetity == "Event"]),
  8. exon_idnetity = replace(exon_idnetity, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), "Event")
  9. )

希望这可以帮助你完成所需的操作。如果有任何其他问题,请随时提问。

英文:

I am dealing a with a very large mRNA splicing dataset. Here is a toy dataset to exemplify the problem:

  1. test_df &lt;- data.frame(
  2. start = c(2, 9, 13, 19, 13, 20, 25, 35, 39),
  3. end = c(8, 12, 18, 24, 16, 24, 30, 38, 45),
  4. gene_id = c(&quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;B&quot;, &quot;B&quot;, &quot;B&quot;, &quot;B&quot;),
  5. exon_identity = c(NA, &quot;Upstream&quot;, NA, &quot;Downstream&quot;, &quot;Event&quot;, NA, &quot;Upstream&quot;, &quot;Downstream&quot;, NA)
  6. )
  7. &gt; test_df
  8. start end gene_id exon_identity
  9. 1 2 8 A &lt;NA&gt;
  10. 2 9 12 A Upstream
  11. 3 13 18 A &lt;NA&gt;
  12. 4 19 24 A Downstream
  13. 5 13 16 A Event
  14. 6 20 24 B &lt;NA&gt;
  15. 7 25 30 B Upstream
  16. 8 35 38 B Downstream
  17. 9 39 45 B &lt;NA&gt;

For every unique value in gene_id column, I would like to replace an entire row if it is present between "Upstream" and "Downstream" values in the exon_identity column i.e. replace row 3 with row 5. What makes it difficult for me is that there are certain genes in the gene_id column which do not have a row that needs to be replaced, e.g. "B" in the gene_id column.

This question goes in the direction of previously asked questions here and here.

Based on those and other resources, I have tried:

  1. library(tidyverse)
  2. test_replace &lt;- test_df %&gt;%
  3. group_by(gene_id) %&gt;%
  4. mutate(start = replace(start, row_number() &gt; which(exon_idnetity == &quot;Upstream&quot;) &amp; row_number() &lt; which(exon_idnetity == &quot;Downstream&quot;), start[exon_idnetity == &quot;Event&quot;]),
  5. end = replace(end, row_number() &gt; which(exon_idnetity == &quot;Upstream&quot;) &amp; row_number() &lt; which(exon_idnetity == &quot;Downstream&quot;), end[exon_idnetity == &quot;Event&quot;]),
  6. exon_idnetity = replace(exon_idnetity, row_number() &gt; which(exon_idnetity == &quot;Upstream&quot;) &amp; row_number() &lt; which(exon_idnetity == &quot;Downstream&quot;), &quot;Event&quot;)
  7. )
  8. Warning message:
  9. There were 2 warnings in `mutate()`.
  10. The first warning was:
  11. In argument: `start = replace(...)`.
  12. In group 1: `gene_id = &quot;A&quot;`.
  13. Caused by warning in `x[list] &lt;- values`:
  14. ! number of items to replace is not a multiple of replacement length
  15. Run dplyr::last_dplyr_warnings() to see the 1 remaining warning.
  16. &gt;
  17. &gt; test_replace
  18. # A tibble: 9 &#215; 4
  19. # Groups: gene_id [2]
  20. start end gene_id exon_idnetity
  21. &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
  22. 1 2 8 A NA
  23. 2 9 12 A Upstream
  24. 3 NA NA A Event
  25. 4 19 24 A Downstream
  26. 5 13 16 A Event
  27. 6 20 24 B NA
  28. 7 25 30 B Upstream
  29. 8 35 38 B Downstream
  30. 9 39 45 B NA

Desired output:

  1. &gt; desired_outcome
  2. start end gene_id exon_idnetity
  3. 1 2 8 A &lt;NA&gt;
  4. 2 9 12 A Upstream
  5. 3 13 16 A Event
  6. 4 19 24 A Downstream
  7. 5 20 24 B &lt;NA&gt;
  8. 6 25 30 B Upstream
  9. 7 35 38 B Downstream
  10. 8 39 45 B &lt;NA&gt;

A solution, preferably using tidyverse package would be highly appreciated.

Thank you!

答案1

得分: 2

在玩具示例中,重新排序你的数据集几乎可以满足你的要求。这在真实数据集中是否有效呢?例如。

  1. library(tidyverse)
  2. test_df |&gt;
  3. mutate(
  4. sandwich = lag(exon_identity == &#39;Upstream&#39;) &amp; lead(exon_identity == &#39;Downstream&#39;)
  5. ) |&gt;
  6. replace_na(list(sandwich = FALSE)) |&gt;
  7. group_by(gene_id) |&gt;
  8. arrange(start) |&gt;
  9. ungroup() |&gt;
  10. filter(!sandwich) |&gt;
  11. select(-sandwich)

(在玩具示例中,group_byungroup 需要。我添加它们以防在真实数据集中需要/有用。)

英文:

In the toy example, reordering your data set gives you almost all of what you want. Will that work in the real data set? E.g.

  1. library(tidyverse)
  2. test_df |&gt;
  3. mutate(
  4. sandwich = lag(exon_identity == &#39;Upstream&#39;) &amp; lead(exon_identity == &#39;Downstream&#39;)
  5. ) |&gt;
  6. replace_na(list(sandwich = FALSE)) |&gt;
  7. group_by(gene_id) |&gt;
  8. arrange(start) |&gt;
  9. ungroup() |&gt;
  10. filter(!sandwich) |&gt;
  11. select(-sandwich)

(In the toy example, group_by and ungroup are not needed. I added them in case it was needed/useful in the real data set.)

答案2

得分: 0

如果@MelissaKey关于您实际数据的结构是正确的,他们的解决方案将非常有效。否则,以下是一个执行此任务的函数,以及group_modify()

  1. library(dplyr)
  2. library(tidyr)
  3. replace_rows <- function(x, ...) {
  4. is_bad <- replace_na(
  5. lag(x$exon_identity) == "Upstream" & lead(x$exon_identity) == "Downstream",
  6. FALSE
  7. )
  8. if (any(is_bad)) {
  9. is_event <- replace_na(x$exon_identity == "Event", FALSE)
  10. x <- x %>%
  11. filter(!is_bad, !is_event) %>%
  12. add_row(
  13. filter(x, is_event),
  14. .before = which(is_bad)
  15. )
  16. }
  17. x
  18. }
  19. test_df %>%
  20. group_by(gene_id) %>%
  21. group_modify(replace_rows) %>%
  22. ungroup()
  1. # A tibble: 8 × 4
  2. gene_id start end exon_identity
  3. 1 A 2 8 <NA>
  4. 2 A 9 12 Upstream
  5. 3 A 13 16 Event
  6. 4 A 19 24 Downstream
  7. 5 B 20 24 <NA>
  8. 6 B 25 30 Upstream
  9. 7 B 35 38 Downstream
  10. 8 B 39 45 <NA>
英文:

If @MelissaKey is right about the structure of your actual data, their solution will work nicely. Otherwise, here’s a function that does the job along with group_modify():

  1. library(dplyr)
  2. library(tidyr)
  3. replace_rows &lt;- function(x, ...) {
  4. is_bad &lt;- replace_na(
  5. lag(x$exon_identity) == &quot;Upstream&quot; &amp; lead(x$exon_identity) == &quot;Downstream&quot;,
  6. FALSE
  7. )
  8. if (any(is_bad)) {
  9. is_event &lt;- replace_na(x$exon_identity == &quot;Event&quot;, FALSE)
  10. x &lt;- x %&gt;%
  11. filter(!is_bad, !is_event) %&gt;%
  12. add_row(
  13. filter(x, is_event),
  14. .before = which(is_bad)
  15. )
  16. }
  17. x
  18. }
  19. test_df %&gt;%
  20. group_by(gene_id) %&gt;%
  21. group_modify(replace_rows) %&gt;%
  22. ungroup()
  1. # A tibble: 8 &#215; 4
  2. gene_id start end exon_identity
  3. &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;
  4. 1 A 2 8 &lt;NA&gt;
  5. 2 A 9 12 Upstream
  6. 3 A 13 16 Event
  7. 4 A 19 24 Downstream
  8. 5 B 20 24 &lt;NA&gt;
  9. 6 B 25 30 Upstream
  10. 7 B 35 38 Downstream
  11. 8 B 39 45 &lt;NA&gt;

huangapple
  • 本文由 发表于 2023年6月13日 04:35:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76460150.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定