英文:
Write a loop in R to collapse rows based on identical consecutive column values
问题
以下是您的R循环代码的中文翻译部分:
for (i in 1:(nrow(test) - 1)) {
if (test$chr[i] == test$chr[i + 1] && test$abs.sum[i] == test$abs.sum[i + 1]) {
test$start[i] <- min(test$start[i], test$start[i + 1])
test$end[i] <- max(test$end[i], test$end[i + 1])
test <- test[-(i + 1), ]
i <- i - 1
}
}
请注意,我稍作修改以处理行索引错误,以及检查是否chr
和abs.sum
都相同。这应该按照您的要求合并行。
英文:
I am trying to write a loop in R to collapse rows. Let's say I'm using a data frame like this one:
r1 <- c(1, 1,1000,2)
r2 <- c(1, 1001,2000, 2)
r3 <- c(1, 2001,3000, 2)
r4 <- c(1, 3001,4000, 1)
r5 <- c(1, 4001,5000, 3)
r6 <- c(1, 5001,6000, 3)
r7 <- c(2, 1,1000,2 )
r8 <- c(2, 1001,2000, 1)
r9 <- c(2, 2001,3000, 2)
r10 <- c(2, 3001,4000, 1)
r11 <- c(2, 4001,5000, 1)
test <- rbind(r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11)
test <- as.data.frame(test)
colnames(test) <- c("chr", "start","end", "abs.sum")
rownames(test) <- NULL
This gives me a data frame that looks like this:
chr start end abs.sum
1 1 1 1000 2
2 1 1001 2000 2
3 1 2001 3000 2
4 1 3001 4000 1
5 1 4001 5000 3
6 1 5001 6000 3
7 2 1 1000 2
8 2 1001 2000 1
9 2 2001 3000 2
10 2 3001 4000 1
11 2 4001 5000 1
For each chr
value, I want to collapse based on identical consecutive abs.sum
, keeping the lowest value in start
and the highest value in end
. So, for example, I would like my final data frame to look like this:
chr start end abs.sum
1 1 1 3000 2
2 1 3001 4000 1
3 1 4001 6000 3
4 2 1 1000 2
5 2 1001 2000 1
6 2 2001 3000 2
7 2 3001 5000 1
I tried writing a for loop:
for (i in 1:nrow(test)) {
if (test$abs.sum[i] == test$abs.sum[i + 1]) {
test$end[i] <- test$end[i+1]
test <- test[-i + 1]
test <- test[-(i + 1),]
}
}
Which returns the error:
>Error in if (test$abs.sum[i] == test$abs.sum[i + 1]) { :
argument is of length zero
I know this isn't correct, but it's what I have been able to piece together so far. I think this may require some combination of "while" and "for" loops, but I am stuck. Maybe a package already exists with a function that can do this?
答案1
得分: 2
以下是翻译好的部分:
这是一个使用dplyr
和tidyr
的方法。工作流程如下:
- 使用
consecutive_id()
为每个“abs.sum”组创建一个具有唯一值的“tmp”列。 - 将数据透视为长格式,将所有“start”和“end”值放入一个单独的列中。
- 按“chr”和“tmp”对数据进行分组,并使用
filter()
获取每个子组(“tmp”)的最小和最大值。 - 将数据重新透视为宽格式。
library(dplyr)
library(tidyr)
test %>%
mutate(tmp = consecutive_id(abs.sum)) %>%
pivot_longer(!c(chr, abs.sum, tmp)) %>%
group_by(chr, tmp) %>%
filter(value == min(value) | value == max(value)) %>%
pivot_wider(names_from = name,
values_from = value) %>%
ungroup() %>%
select(-tmp)
# A tibble: 7 × 4
chr abs.sum start end
<dbl> <dbl> <dbl> <dbl>
1 1 2 1 3000
2 1 1 3001 4000
3 1 3 4001 6000
4 2 2 1 1000
5 2 1 1001 2000
6 2 2 2001 3000
7 2 1 3001 5000
更新:
根据@langtang的有用建议,上述方法可以简化为:
test %>%
mutate(tmp = consecutive_id(abs.sum)) %>%
group_by(chr, tmp) %>%
summarise(min(start), max(end)) %>%
ungroup() %>%
select(-tmp)
有趣的是(至少对我来说),在样本数据上,summarise()
方法比透视方法慢大约3倍。不确定是否适用于更大的数据集。无论如何,这两种方法都比@langtang答案中概述的data.table
方法慢。我只是将其添加在这里,以防有人想知道如何仅使用tidyverse
完成这个任务。
英文:
Here's a dplyr
and tidyr
method. The workflow is:
- Create "tmp" column with unique value for each "abs.sum" 'group' using
consecutive_id()
- Pivot data to long format by to get all "start" and "end" values into a single column
- Group data by "chr" and "tmp" and use
filter()
to get the min and max values for each subgroup ("tmp") - Pivot data back to wide format
library(dplyr)
library(tidyr)
test %>%
mutate(tmp = consecutive_id(abs.sum)) %>%
pivot_longer(!c(chr, abs.sum, tmp)) %>%
group_by(chr, tmp) %>%
filter(value == min(value) | value == max(value)) %>%
pivot_wider(names_from = name,
values_from = value) %>%
ungroup() %>%
select(-tmp)
# A tibble: 7 × 4
chr abs.sum start end
<dbl> <dbl> <dbl> <dbl>
1 1 2 1 3000
2 1 1 3001 4000
3 1 3 4001 6000
4 2 2 1 1000
5 2 1 1001 2000
6 2 2 2001 3000
7 2 1 3001 5000
Update:
As per @langtang's helpful suggestion, the above can be simplified to:
test %>%
mutate(tmp = consecutive_id(abs.sum)) %>%
group_by(chr, tmp) %>%
summarise(min(start), max(end)) %>%
ungroup() %>%
select(-tmp)
Interestingly (for me anyway), on the sample data, the summarise()
approach is ~3x slower than the pivot approach. Not sure if this scales to larger datasets. Either way, both of these methods are slower than the data.table
method as outlined in @langtang's answer. Just added it here in case someone is curious about how it could be done solely in the tidyverse
.
答案2
得分: 1
以下是您要翻译的代码部分:
library(data.table)
setDT(test)[, .(start=min(start), end=max(end), abs.sum=min(abs.sum)), .(chr,rleid(abs.sum))][,-2]
输出:
chr start end abs.sum
<num> <num> <num> <num>
1: 1 1 3000 2
2: 1 3001 4000 1
3: 1 4001 6000 3
4: 2 1 1000 2
5: 2 1001 2000 1
6: 2 2001 3000 2
7: 2 3001 5000 1
另一个选项使用dplyr
,但请注意,我保留了data.table:rleid
的使用(感谢@LeroyTyrone指出consecutive_id()
函数):
library(dplyr)
test %>%
group_by(chr, id=consecutive_id(abs.sum)) %>%
summarize(start=min(start), end=max(end), abs.sum=min(abs.sum)) %>%
select(-id)
输出:
chr start end abs.sum
<dbl> <dbl> <dbl> <dbl>
1 1 1 3000 2
2 1 3001 4000 1
3 1 4001 6000 3
4 2 1 1000 2
5 2 1001 2000 1
6 2 2001 3000 2
7 2 3001 5000 1
英文:
You can do this leveraging run-length id:
library(data.table)
setDT(test)[, .(start=min(start), end=max(end), abs.sum=min(abs.sum)), .(chr,rleid(abs.sum))][,-2]
Output:
chr start end abs.sum
<num> <num> <num> <num>
1: 1 1 3000 2
2: 1 3001 4000 1
3: 1 4001 6000 3
4: 2 1 1000 2
5: 2 1001 2000 1
6: 2 2001 3000 2
7: 2 3001 5000 1
Here is another option using dplyr
(<strike>but note I retain the use of data.table:rleid
</strike>. Thanks to @LeroyTyrone for pointing out the consecutive_id()
function)
library(dplyr)
test %>%
group_by(chr, id=consecutive_id(abs.sum)) %>%
summarize(start=min(start), end=max(end), abs.sum=min(abs.sum)) %>%
select(-id)
Output:
chr start end abs.sum
<dbl> <dbl> <dbl> <dbl>
1 1 1 3000 2
2 1 3001 4000 1
3 1 4001 6000 3
4 2 1 1000 2
5 2 1001 2000 1
6 2 2001 3000 2
7 2 3001 5000 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论