如何在R中筛选数据框,同时获取筛选前后的行。

huangapple go评论83阅读模式
英文:

How to filter a dataframe in R but also get the row before and after the filter

问题

现在我的问题是,值不总是匹配,所以我还需要看到过滤后的数据框中过滤器前后的一行数据(基本上是过滤器周围的一行数据),以确保我没有漏掉什么。我尝试了lag和lead以及data.table的shift,但总是会有一些不必要的数据输出,我无法摆脱。

我有一个表示基因组距离的文件,看起来像这样:

chr snp pos cM
chr8 rs2003497 9432942 0.0
chr8 rs1241241 9437099 0.0
chr8 rs5262363 9440613 0.0
chr8 rs5152525 94355216 0.0
chr8 rs5135151 94371918 0.1
chr8 rs5253252 94374354 0.1
chr8 rs5135151 94392948 0.0

我想基于chr和pos进行过滤,并输出切片。

filter(data, chr == chromosome & pos >= start_pos & pos <= end_pos)
英文:

Now my problem is that the values are not always matching so I need to also see the line before and after of the dataframe after the filter (basically one row around the filter) so as to make sure Im not missing something. I tried lag and lead and data.table shift but there is always some unnecessary data output that I cant get rid of

I have a file indicating genomic distance that looks like this

chr snp pos cM
chr8 rs2003497 9432942 0.0
chr8 rs1241241 9437099 0.0
chr8 rs5262363 9440613 0.0
chr8 rs5152525 94355216 0.0
chr8 rs5135151 94371918 0.1
chr8 rs5253252 94374354 0.1
chr8 rs5135151 94392948 0.0

I want to filter based on chr and pos and output the slice.
filter(data, chr == chromosome & pos >= start_pos & pos <= end_pos)

答案1

得分: 1

鉴于你的数据似乎不足以充分反映问题(如评论所述),这里提供一些虚拟数据和一个解决方案,以解决你的问题(顺便说一下,这是一个常见的问题或任务,但令人惊讶的是尚没有一个简洁的可用函数):

数据

df <- data.frame(
  x = c("B","A","A","A","B","C","B", "A", "C", "A"),
  y = c(1,2,1,3,1,2,1,5,1,2)
)

任务

假设我需要筛选所有满足 x == "A" & y > 2 条件的行,以及紧邻的行(上面和下面的行)。

解决方案

我提出的解决方案涉及编写一个函数,该函数会:

  • 获取满足筛选条件的行的索引,以及周围行的索引:

函数

row_sequence <- function(value1, value2) {
  inds <- which(value1 == "A" & value2 > 2)  
  sort(unique(c(inds-1, inds, inds + 1)))
}

现在只需将函数 row_sequence 输入到 slice 函数的调用中:

实施

library(dplyr)
df %>% 
  slice(row_sequence(x, y))

这将返回以下结果:

  x y
1 A 1
2 A 3   # <- 筛选出的行
3 B 1
4 B 1
5 A 5   # <- 筛选出的行
6 C 1
英文:

Given that your data seems insufficient to reflect the issue adequately (as per comments), here's some toy data and a generic solution to your problem (which, BTW, is a frequent problem or task and where there is suprisingly not yet a neat function available):

Data:

df &lt;- data.frame(
  x = c(&quot;B&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;B&quot;,&quot;C&quot;,&quot;B&quot;, &quot;A&quot;, &quot;C&quot;, &quot;A&quot;),
  y = c(1,2,1,3,1,2,1,5,1,2)
)

Task:

Suppose I need to filter all rows where x == &quot;A&quot; &amp; y &gt; 2 PLUS the immediately surrounding rows (above and below).

Solution:

The solution I propose involves writing a function that:

  • gets the indices of the filtered rows PLUS those of the surrounding rows:

Function:

row_sequence &lt;- function(value1, value2) {
  inds &lt;- which(value1 == &quot;A&quot; &amp; value2 &gt; 2)  
  sort(unique(c(inds-1, inds, inds + 1)))
}

Now just input the function row_sequence into a call to slice:

Implementation:

library(dplyr)
df %&gt;% 
  slice(row_sequence(x, y))
  x y
1 A 1
2 A 3   # &lt;- filtered
3 B 1
4 B 1
5 A 5   # &lt;- filtered
6 C 1

答案2

得分: 0

你可以首先向你的数据集添加一对筛选标记,这里的 filter_mark 用于标识筛选后的记录,而 window_mark 用于标识前后行。在实际的子集操作中,你可以选择包括或排除这些额外的行:

library(dplyr)
chromosome <- "chr8"
start_pos  <- 166818 + 1
end_pos    <- 181076 - 1

df_ <- df_ %>% 
  mutate(filter_mark = chr == chromosome & pos >= start_pos & pos <= end_pos,
         window_mark = lag(filter_mark) | lead(filter_mark))

df_ %>% filter(filter_mark | window_mark)
#> # A tibble: 3 × 6
#>   chr   snp           pos cM    filter_mark window_mark
#>   <chr> <chr>       <dbl> <lgl> <lgl>       <lgl>      
#> 1 chr8  rs2003497  166818 NA    FALSE       TRUE       
#> 2 chr8  rs10488368 180568 NA    TRUE        FALSE      
#> 3 chr8  rs10488369 181076 NA    FALSE       TRUE

输入数据:

df_ <- readr::read_table("
chr snp pos cM
chr8    rs2003497   166818  NA
chr8    rs10488368  180568  NA
chr8    rs10488369  181076  NA")

创建于 2023-05-28,使用 reprex v2.0.2

英文:

You could first add a couple of filtering markers to your dataset, here filter_mark identifies filtered records and window_mark leading/lagging rows. During actual subsetting you either include or exclude those extra rows:

library(dplyr)
chromosome &lt;- &quot;chr8&quot;
start_pos  &lt;- 166818 + 1
end_pos    &lt;- 181076 - 1

df_ &lt;- df_ %&gt;% 
  mutate(filter_mark = chr == chromosome &amp; pos &gt;= start_pos &amp; pos &lt;= end_pos,
         window_mark = lag(filter_mark) | lead(filter_mark))

df_ %&gt;% filter(filter_mark | window_mark)
#&gt; # A tibble: 3 &#215; 6
#&gt;   chr   snp           pos cM    filter_mark window_mark
#&gt;   &lt;chr&gt; &lt;chr&gt;       &lt;dbl&gt; &lt;lgl&gt; &lt;lgl&gt;       &lt;lgl&gt;      
#&gt; 1 chr8  rs2003497  166818 NA    FALSE       TRUE       
#&gt; 2 chr8  rs10488368 180568 NA    TRUE        FALSE      
#&gt; 3 chr8  rs10488369 181076 NA    FALSE       TRUE

Input data:

df_ &lt;- readr::read_table(&quot;
chr snp pos cM
chr8    rs2003497   166818  NA
chr8    rs10488368  180568  NA
chr8    rs10488369  181076  NA&quot;)

<sup>Created on 2023-05-28 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年5月28日 20:46:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76351575.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定