英文:
How to replace an entire row between two rows based on a column
问题
我理解你想要的翻译是代码部分,以下是你提供的R代码的翻译:
# 导入必要的库
library(tidyverse)
# 使用 group_by 和 mutate 进行替换操作
test_replace <- test_df %>%
group_by(gene_id) %>%
mutate(start = replace(start, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), start[exon_idnetity == "Event"]),
end = replace(end, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), end[exon_idnetity == "Event"]),
exon_idnetity = replace(exon_idnetity, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), "Event")
)
希望这可以帮助你完成所需的操作。如果有任何其他问题,请随时提问。
英文:
I am dealing a with a very large mRNA splicing dataset. Here is a toy dataset to exemplify the problem:
test_df <- data.frame(
start = c(2, 9, 13, 19, 13, 20, 25, 35, 39),
end = c(8, 12, 18, 24, 16, 24, 30, 38, 45),
gene_id = c("A", "A", "A", "A", "A", "B", "B", "B", "B"),
exon_identity = c(NA, "Upstream", NA, "Downstream", "Event", NA, "Upstream", "Downstream", NA)
)
> test_df
start end gene_id exon_identity
1 2 8 A <NA>
2 9 12 A Upstream
3 13 18 A <NA>
4 19 24 A Downstream
5 13 16 A Event
6 20 24 B <NA>
7 25 30 B Upstream
8 35 38 B Downstream
9 39 45 B <NA>
For every unique value in gene_id
column, I would like to replace an entire row if it is present between "Upstream" and "Downstream" values in the exon_identity
column i.e. replace row 3 with row 5. What makes it difficult for me is that there are certain genes in the gene_id
column which do not have a row that needs to be replaced, e.g. "B" in the gene_id
column.
This question goes in the direction of previously asked questions here and here.
Based on those and other resources, I have tried:
library(tidyverse)
test_replace <- test_df %>%
group_by(gene_id) %>%
mutate(start = replace(start, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), start[exon_idnetity == "Event"]),
end = replace(end, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), end[exon_idnetity == "Event"]),
exon_idnetity = replace(exon_idnetity, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), "Event")
)
Warning message:
There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `start = replace(...)`.
ℹ In group 1: `gene_id = "A"`.
Caused by warning in `x[list] <- values`:
! number of items to replace is not a multiple of replacement length
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning.
>
> test_replace
# A tibble: 9 × 4
# Groups: gene_id [2]
start end gene_id exon_idnetity
<dbl> <dbl> <chr> <chr>
1 2 8 A NA
2 9 12 A Upstream
3 NA NA A Event
4 19 24 A Downstream
5 13 16 A Event
6 20 24 B NA
7 25 30 B Upstream
8 35 38 B Downstream
9 39 45 B NA
Desired output:
> desired_outcome
start end gene_id exon_idnetity
1 2 8 A <NA>
2 9 12 A Upstream
3 13 16 A Event
4 19 24 A Downstream
5 20 24 B <NA>
6 25 30 B Upstream
7 35 38 B Downstream
8 39 45 B <NA>
A solution, preferably using tidyverse package would be highly appreciated.
Thank you!
答案1
得分: 2
在玩具示例中,重新排序你的数据集几乎可以满足你的要求。这在真实数据集中是否有效呢?例如。
library(tidyverse)
test_df |>
mutate(
sandwich = lag(exon_identity == 'Upstream') & lead(exon_identity == 'Downstream')
) |>
replace_na(list(sandwich = FALSE)) |>
group_by(gene_id) |>
arrange(start) |>
ungroup() |>
filter(!sandwich) |>
select(-sandwich)
(在玩具示例中,group_by
和 ungroup
不 需要。我添加它们以防在真实数据集中需要/有用。)
英文:
In the toy example, reordering your data set gives you almost all of what you want. Will that work in the real data set? E.g.
library(tidyverse)
test_df |>
mutate(
sandwich = lag(exon_identity == 'Upstream') & lead(exon_identity == 'Downstream')
) |>
replace_na(list(sandwich = FALSE)) |>
group_by(gene_id) |>
arrange(start) |>
ungroup() |>
filter(!sandwich) |>
select(-sandwich)
(In the toy example, group_by
and ungroup
are not needed. I added them in case it was needed/useful in the real data set.)
答案2
得分: 0
如果@MelissaKey关于您实际数据的结构是正确的,他们的解决方案将非常有效。否则,以下是一个执行此任务的函数,以及group_modify()
:
library(dplyr)
library(tidyr)
replace_rows <- function(x, ...) {
is_bad <- replace_na(
lag(x$exon_identity) == "Upstream" & lead(x$exon_identity) == "Downstream",
FALSE
)
if (any(is_bad)) {
is_event <- replace_na(x$exon_identity == "Event", FALSE)
x <- x %>%
filter(!is_bad, !is_event) %>%
add_row(
filter(x, is_event),
.before = which(is_bad)
)
}
x
}
test_df %>%
group_by(gene_id) %>%
group_modify(replace_rows) %>%
ungroup()
# A tibble: 8 × 4
gene_id start end exon_identity
1 A 2 8 <NA>
2 A 9 12 Upstream
3 A 13 16 Event
4 A 19 24 Downstream
5 B 20 24 <NA>
6 B 25 30 Upstream
7 B 35 38 Downstream
8 B 39 45 <NA>
英文:
If @MelissaKey is right about the structure of your actual data, their solution will work nicely. Otherwise, here’s a function that does the job along with group_modify()
:
library(dplyr)
library(tidyr)
replace_rows <- function(x, ...) {
is_bad <- replace_na(
lag(x$exon_identity) == "Upstream" & lead(x$exon_identity) == "Downstream",
FALSE
)
if (any(is_bad)) {
is_event <- replace_na(x$exon_identity == "Event", FALSE)
x <- x %>%
filter(!is_bad, !is_event) %>%
add_row(
filter(x, is_event),
.before = which(is_bad)
)
}
x
}
test_df %>%
group_by(gene_id) %>%
group_modify(replace_rows) %>%
ungroup()
# A tibble: 8 × 4
gene_id start end exon_identity
<chr> <dbl> <dbl> <chr>
1 A 2 8 <NA>
2 A 9 12 Upstream
3 A 13 16 Event
4 A 19 24 Downstream
5 B 20 24 <NA>
6 B 25 30 Upstream
7 B 35 38 Downstream
8 B 39 45 <NA>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论