2023年5月28日 20:46:25go评论112阅读模式

英文:

How to filter a dataframe in R but also get the row before and after the filter

问题

现在我的问题是，值不总是匹配，所以我还需要看到过滤后的数据框中过滤器前后的一行数据（基本上是过滤器周围的一行数据），以确保我没有漏掉什么。我尝试了lag和lead以及data.table的shift，但总是会有一些不必要的数据输出，我无法摆脱。

我有一个表示基因组距离的文件，看起来像这样：

chr	snp	pos	cM
chr8	rs2003497	9432942	0.0
chr8	rs1241241	9437099	0.0
chr8	rs5262363	9440613	0.0
chr8	rs5152525	94355216	0.0
chr8	rs5135151	94371918	0.1
chr8	rs5253252	94374354	0.1
chr8	rs5135151	94392948	0.0

我想基于chr和pos进行过滤，并输出切片。

filter(data, chr == chromosome & pos >= start_pos & pos <= end_pos)

英文:

Now my problem is that the values are not always matching so I need to also see the line before and after of the dataframe after the filter (basically one row around the filter) so as to make sure Im not missing something. I tried lag and lead and data.table shift but there is always some unnecessary data output that I cant get rid of

I have a file indicating genomic distance that looks like this

chr	snp	pos	cM
chr8	rs2003497	9432942	0.0
chr8	rs1241241	9437099	0.0
chr8	rs5262363	9440613	0.0
chr8	rs5152525	94355216	0.0
chr8	rs5135151	94371918	0.1
chr8	rs5253252	94374354	0.1
chr8	rs5135151	94392948	0.0

I want to filter based on chr and pos and output the slice.
filter(data, chr == chromosome & pos >= start_pos & pos <= end_pos)

答案1

得分: 1

鉴于你的数据似乎不足以充分反映问题（如评论所述），这里提供一些虚拟数据和一个解决方案，以解决你的问题（顺便说一下，这是一个常见的问题或任务，但令人惊讶的是尚没有一个简洁的可用函数）：

数据：

df <- data.frame(
  x = c("B","A","A","A","B","C","B", "A", "C", "A"),
  y = c(1,2,1,3,1,2,1,5,1,2)
)

任务：

假设我需要筛选所有满足 x == "A" & y > 2 条件的行，以及紧邻的行（上面和下面的行）。

解决方案：

我提出的解决方案涉及编写一个函数，该函数会：

获取满足筛选条件的行的索引，以及周围行的索引：

函数：

row_sequence <- function(value1, value2) {
  inds <- which(value1 == "A" & value2 > 2)  
  sort(unique(c(inds-1, inds, inds + 1)))
}

现在只需将函数 row_sequence 输入到 slice 函数的调用中：

实施：

library(dplyr)
df %>% 
  slice(row_sequence(x, y))

这将返回以下结果：

  x y
1 A 1
2 A 3   # <- 筛选出的行
3 B 1
4 B 1
5 A 5   # <- 筛选出的行
6 C 1

英文:

Given that your data seems insufficient to reflect the issue adequately (as per comments), here's some toy data and a generic solution to your problem (which, BTW, is a frequent problem or task and where there is suprisingly not yet a neat function available):

Data:

df &lt;- data.frame(
  x = c(&quot;B&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;B&quot;,&quot;C&quot;,&quot;B&quot;, &quot;A&quot;, &quot;C&quot;, &quot;A&quot;),
  y = c(1,2,1,3,1,2,1,5,1,2)
)

Task:

Suppose I need to filter all rows where x == "A" & y > 2 PLUS the immediately surrounding rows (above and below).

Solution:

The solution I propose involves writing a function that:

gets the indices of the filtered rows PLUS those of the surrounding rows:

Function:

row_sequence &lt;- function(value1, value2) {
  inds &lt;- which(value1 == &quot;A&quot; &amp; value2 &gt; 2)  
  sort(unique(c(inds-1, inds, inds + 1)))
}

Now just input the function row_sequence into a call to slice:

Implementation:

library(dplyr)
df %&gt;% 
  slice(row_sequence(x, y))
  x y
1 A 1
2 A 3   # &lt;- filtered
3 B 1
4 B 1
5 A 5   # &lt;- filtered
6 C 1

答案2

得分: 0

你可以首先向你的数据集添加一对筛选标记，这里的 filter_mark 用于标识筛选后的记录，而 window_mark 用于标识前后行。在实际的子集操作中，你可以选择包括或排除这些额外的行：

library(dplyr)
chromosome <- "chr8"
start_pos  <- 166818 + 1
end_pos    <- 181076 - 1
df_ <- df_ %>% 
  mutate(filter_mark = chr == chromosome & pos >= start_pos & pos <= end_pos,
         window_mark = lag(filter_mark) | lead(filter_mark))
df_ %>% filter(filter_mark | window_mark)
#> # A tibble: 3 × 6
#>   chr   snp           pos cM    filter_mark window_mark
#>   <chr> <chr>       <dbl> <lgl> <lgl>       <lgl>      
#> 1 chr8  rs2003497  166818 NA    FALSE       TRUE       
#> 2 chr8  rs10488368 180568 NA    TRUE        FALSE      
#> 3 chr8  rs10488369 181076 NA    FALSE       TRUE

输入数据：

df_ <- readr::read_table("
chr snp pos cM
chr8    rs2003497   166818  NA
chr8    rs10488368  180568  NA
chr8    rs10488369  181076  NA")

^{创建于 2023-05-28，使用 reprex v2.0.2。}

英文:

You could first add a couple of filtering markers to your dataset, here filter_mark identifies filtered records and window_mark leading/lagging rows. During actual subsetting you either include or exclude those extra rows:

library(dplyr)
chromosome &lt;- &quot;chr8&quot;
start_pos  &lt;- 166818 + 1
end_pos    &lt;- 181076 - 1
df_ &lt;- df_ %&gt;% 
  mutate(filter_mark = chr == chromosome &amp; pos &gt;= start_pos &amp; pos &lt;= end_pos,
         window_mark = lag(filter_mark) | lead(filter_mark))
df_ %&gt;% filter(filter_mark | window_mark)
#&gt; # A tibble: 3 &#215; 6
#&gt;   chr   snp           pos cM    filter_mark window_mark
#&gt;   &lt;chr&gt; &lt;chr&gt;       &lt;dbl&gt; &lt;lgl&gt; &lt;lgl&gt;       &lt;lgl&gt;      
#&gt; 1 chr8  rs2003497  166818 NA    FALSE       TRUE       
#&gt; 2 chr8  rs10488368 180568 NA    TRUE        FALSE      
#&gt; 3 chr8  rs10488369 181076 NA    FALSE       TRUE

Input data:

df_ &lt;- readr::read_table(&quot;
chr snp pos cM
chr8    rs2003497   166818  NA
chr8    rs10488368  180568  NA
chr8    rs10488369  181076  NA&quot;)

<sup>Created on 2023-05-28 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在R中筛选数据框，同时获取筛选前后的行。

问题

答案1

答案2

同步Shiny中两个Handsontables的垂直滚动

从单元格中移除具有多个出现的字符串部分

有没有办法使用R计算列中的个别方程？

如何在一个逐行处理矩阵的for循环中包含一些列向计算？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。