英文:
How to filter a dataframe in R but also get the row before and after the filter
问题
现在我的问题是,值不总是匹配,所以我还需要看到过滤后的数据框中过滤器前后的一行数据(基本上是过滤器周围的一行数据),以确保我没有漏掉什么。我尝试了lag和lead以及data.table的shift,但总是会有一些不必要的数据输出,我无法摆脱。
我有一个表示基因组距离的文件,看起来像这样:
chr | snp | pos | cM |
---|---|---|---|
chr8 | rs2003497 | 9432942 | 0.0 |
chr8 | rs1241241 | 9437099 | 0.0 |
chr8 | rs5262363 | 9440613 | 0.0 |
chr8 | rs5152525 | 94355216 | 0.0 |
chr8 | rs5135151 | 94371918 | 0.1 |
chr8 | rs5253252 | 94374354 | 0.1 |
chr8 | rs5135151 | 94392948 | 0.0 |
我想基于chr和pos进行过滤,并输出切片。
filter(data, chr == chromosome & pos >= start_pos & pos <= end_pos)
英文:
Now my problem is that the values are not always matching so I need to also see the line before and after of the dataframe after the filter (basically one row around the filter) so as to make sure Im not missing something. I tried lag and lead and data.table shift but there is always some unnecessary data output that I cant get rid of
I have a file indicating genomic distance that looks like this
chr | snp | pos | cM |
---|---|---|---|
chr8 | rs2003497 | 9432942 | 0.0 |
chr8 | rs1241241 | 9437099 | 0.0 |
chr8 | rs5262363 | 9440613 | 0.0 |
chr8 | rs5152525 | 94355216 | 0.0 |
chr8 | rs5135151 | 94371918 | 0.1 |
chr8 | rs5253252 | 94374354 | 0.1 |
chr8 | rs5135151 | 94392948 | 0.0 |
I want to filter based on chr and pos and output the slice.
filter(data, chr == chromosome & pos >= start_pos & pos <= end_pos)
答案1
得分: 1
鉴于你的数据似乎不足以充分反映问题(如评论所述),这里提供一些虚拟数据和一个解决方案,以解决你的问题(顺便说一下,这是一个常见的问题或任务,但令人惊讶的是尚没有一个简洁的可用函数):
数据:
df <- data.frame(
x = c("B","A","A","A","B","C","B", "A", "C", "A"),
y = c(1,2,1,3,1,2,1,5,1,2)
)
任务:
假设我需要筛选所有满足 x == "A" & y > 2
条件的行,以及紧邻的行(上面和下面的行)。
解决方案:
我提出的解决方案涉及编写一个函数,该函数会:
- 获取满足筛选条件的行的索引,以及周围行的索引:
函数:
row_sequence <- function(value1, value2) {
inds <- which(value1 == "A" & value2 > 2)
sort(unique(c(inds-1, inds, inds + 1)))
}
现在只需将函数 row_sequence
输入到 slice
函数的调用中:
实施:
library(dplyr)
df %>%
slice(row_sequence(x, y))
这将返回以下结果:
x y
1 A 1
2 A 3 # <- 筛选出的行
3 B 1
4 B 1
5 A 5 # <- 筛选出的行
6 C 1
英文:
Given that your data seems insufficient to reflect the issue adequately (as per comments), here's some toy data and a generic solution to your problem (which, BTW, is a frequent problem or task and where there is suprisingly not yet a neat function available):
Data:
df <- data.frame(
x = c("B","A","A","A","B","C","B", "A", "C", "A"),
y = c(1,2,1,3,1,2,1,5,1,2)
)
Task:
Suppose I need to filter all rows where x == "A" & y > 2
PLUS the immediately surrounding rows (above and below).
Solution:
The solution I propose involves writing a function that:
- gets the indices of the filtered rows PLUS those of the surrounding rows:
Function:
row_sequence <- function(value1, value2) {
inds <- which(value1 == "A" & value2 > 2)
sort(unique(c(inds-1, inds, inds + 1)))
}
Now just input the function row_sequence
into a call to slice
:
Implementation:
library(dplyr)
df %>%
slice(row_sequence(x, y))
x y
1 A 1
2 A 3 # <- filtered
3 B 1
4 B 1
5 A 5 # <- filtered
6 C 1
答案2
得分: 0
你可以首先向你的数据集添加一对筛选标记,这里的 filter_mark
用于标识筛选后的记录,而 window_mark
用于标识前后行。在实际的子集操作中,你可以选择包括或排除这些额外的行:
library(dplyr)
chromosome <- "chr8"
start_pos <- 166818 + 1
end_pos <- 181076 - 1
df_ <- df_ %>%
mutate(filter_mark = chr == chromosome & pos >= start_pos & pos <= end_pos,
window_mark = lag(filter_mark) | lead(filter_mark))
df_ %>% filter(filter_mark | window_mark)
#> # A tibble: 3 × 6
#> chr snp pos cM filter_mark window_mark
#> <chr> <chr> <dbl> <lgl> <lgl> <lgl>
#> 1 chr8 rs2003497 166818 NA FALSE TRUE
#> 2 chr8 rs10488368 180568 NA TRUE FALSE
#> 3 chr8 rs10488369 181076 NA FALSE TRUE
输入数据:
df_ <- readr::read_table("
chr snp pos cM
chr8 rs2003497 166818 NA
chr8 rs10488368 180568 NA
chr8 rs10488369 181076 NA")
创建于 2023-05-28,使用 reprex v2.0.2。
英文:
You could first add a couple of filtering markers to your dataset, here filter_mark
identifies filtered records and window_mark
leading/lagging rows. During actual subsetting you either include or exclude those extra rows:
library(dplyr)
chromosome <- "chr8"
start_pos <- 166818 + 1
end_pos <- 181076 - 1
df_ <- df_ %>%
mutate(filter_mark = chr == chromosome & pos >= start_pos & pos <= end_pos,
window_mark = lag(filter_mark) | lead(filter_mark))
df_ %>% filter(filter_mark | window_mark)
#> # A tibble: 3 × 6
#> chr snp pos cM filter_mark window_mark
#> <chr> <chr> <dbl> <lgl> <lgl> <lgl>
#> 1 chr8 rs2003497 166818 NA FALSE TRUE
#> 2 chr8 rs10488368 180568 NA TRUE FALSE
#> 3 chr8 rs10488369 181076 NA FALSE TRUE
Input data:
df_ <- readr::read_table("
chr snp pos cM
chr8 rs2003497 166818 NA
chr8 rs10488368 180568 NA
chr8 rs10488369 181076 NA")
<sup>Created on 2023-05-28 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论