英文:
Write a loop in R to collapse rows based on identical consecutive column values
问题
以下是您的R循环代码的中文翻译部分:
for (i in 1:(nrow(test) - 1)) {
    if (test$chr[i] == test$chr[i + 1] && test$abs.sum[i] == test$abs.sum[i + 1]) {
        test$start[i] <- min(test$start[i], test$start[i + 1])
        test$end[i] <- max(test$end[i], test$end[i + 1])
        test <- test[-(i + 1), ]
        i <- i - 1
    }
}
请注意,我稍作修改以处理行索引错误,以及检查是否chr和abs.sum都相同。这应该按照您的要求合并行。
英文:
I am trying to write a loop in R to collapse rows. Let's say I'm using a data frame like this one:
r1 <- c(1, 1,1000,2)
r2 <- c(1, 1001,2000, 2)
r3 <- c(1, 2001,3000, 2)
r4 <- c(1, 3001,4000, 1)
r5 <- c(1, 4001,5000, 3)
r6 <- c(1, 5001,6000, 3)
r7 <- c(2, 1,1000,2 )
r8 <- c(2, 1001,2000, 1)
r9 <- c(2, 2001,3000, 2)
r10 <- c(2, 3001,4000, 1)
r11 <- c(2, 4001,5000, 1)
test <- rbind(r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11)
test <- as.data.frame(test)
colnames(test) <- c("chr", "start","end", "abs.sum")
rownames(test) <- NULL
This gives me a data frame that looks like this:
   chr start  end abs.sum
1    1     1 1000       2
2    1  1001 2000       2
3    1  2001 3000       2
4    1  3001 4000       1
5    1  4001 5000       3
6    1  5001 6000       3
7    2     1 1000       2
8    2  1001 2000       1
9    2  2001 3000       2
10   2  3001 4000       1
11   2  4001 5000       1
For each chr value, I want to collapse based on identical consecutive abs.sum, keeping the lowest value in start and the highest value in end. So, for example, I would like my final data frame to look like this:
  chr start  end abs.sum
1   1     1 3000       2
2   1  3001 4000       1
3   1  4001 6000       3
4   2     1 1000       2
5   2  1001 2000       1
6   2  2001 3000       2
7   2  3001 5000       1
I tried writing a for loop:
for (i in 1:nrow(test)) {
        
        if (test$abs.sum[i] == test$abs.sum[i + 1]) {
                test$end[i] <- test$end[i+1]
                test <- test[-i + 1]
                test <- test[-(i + 1),]
        }
        
}
Which returns the error:
>Error in if (test$abs.sum[i] == test$abs.sum[i + 1]) { :
argument is of length zero
I know this isn't correct, but it's what I have been able to piece together so far. I think this may require some combination of "while" and "for" loops, but I am stuck. Maybe a package already exists with a function that can do this?
答案1
得分: 2
以下是翻译好的部分:
这是一个使用dplyr和tidyr的方法。工作流程如下:
- 使用
consecutive_id()为每个“abs.sum”组创建一个具有唯一值的“tmp”列。 - 将数据透视为长格式,将所有“start”和“end”值放入一个单独的列中。
 - 按“chr”和“tmp”对数据进行分组,并使用
filter()获取每个子组(“tmp”)的最小和最大值。 - 将数据重新透视为宽格式。
 
library(dplyr)
library(tidyr)
test %>%
  mutate(tmp = consecutive_id(abs.sum)) %>%
  pivot_longer(!c(chr, abs.sum, tmp)) %>%
  group_by(chr, tmp) %>%
  filter(value == min(value) | value == max(value)) %>%
  pivot_wider(names_from = name,
              values_from = value) %>%
  ungroup() %>%
  select(-tmp)
# A tibble: 7 × 4
    chr abs.sum start   end
  <dbl>   <dbl> <dbl> <dbl>
1     1       2     1  3000
2     1       1  3001  4000
3     1       3  4001  6000
4     2       2     1  1000
5     2       1  1001  2000
6     2       2  2001  3000
7     2       1  3001  5000
更新:
根据@langtang的有用建议,上述方法可以简化为:
test %>%
  mutate(tmp = consecutive_id(abs.sum)) %>%
  group_by(chr, tmp) %>%
  summarise(min(start), max(end)) %>%
  ungroup() %>%
  select(-tmp)
有趣的是(至少对我来说),在样本数据上,summarise()方法比透视方法慢大约3倍。不确定是否适用于更大的数据集。无论如何,这两种方法都比@langtang答案中概述的data.table方法慢。我只是将其添加在这里,以防有人想知道如何仅使用tidyverse完成这个任务。
英文:
Here's a dplyr and tidyr method. The workflow is:
- Create "tmp" column with unique value for each "abs.sum" 'group' using 
consecutive_id() - Pivot data to long format by to get all "start" and "end" values into a single column
 - Group data by "chr" and "tmp" and use 
filter()to get the min and max values for each subgroup ("tmp") - Pivot data back to wide format
 
library(dplyr)
library(tidyr)
test %>% 
  mutate(tmp = consecutive_id(abs.sum)) %>%
  pivot_longer(!c(chr, abs.sum, tmp)) %>%
  group_by(chr, tmp) %>%
  filter(value == min(value) | value == max(value)) %>%
  pivot_wider(names_from = name,
              values_from = value) %>%
  ungroup() %>%
  select(-tmp)
# A tibble: 7 × 4
    chr abs.sum start   end
  <dbl>   <dbl> <dbl> <dbl>
1     1       2     1  3000
2     1       1  3001  4000
3     1       3  4001  6000
4     2       2     1  1000
5     2       1  1001  2000
6     2       2  2001  3000
7     2       1  3001  5000
Update:
As per @langtang's helpful suggestion, the above can be simplified to:
test %>% 
  mutate(tmp = consecutive_id(abs.sum)) %>%
  group_by(chr, tmp) %>%
  summarise(min(start), max(end)) %>%
  ungroup() %>%
  select(-tmp)
Interestingly (for me anyway), on the sample data, the summarise() approach is ~3x slower than the pivot approach. Not sure if this scales to larger datasets. Either way, both of these methods are slower than the data.table method as outlined in @langtang's answer. Just added it here in case someone is curious about how it could be done solely in the tidyverse.
答案2
得分: 1
以下是您要翻译的代码部分:
library(data.table)
setDT(test)[, .(start=min(start), end=max(end), abs.sum=min(abs.sum)), .(chr,rleid(abs.sum))][,-2]
输出:
     chr start   end abs.sum
   <num> <num> <num>   <num>
1:     1     1  3000       2
2:     1  3001  4000       1
3:     1  4001  6000       3
4:     2     1  1000       2
5:     2  1001  2000       1
6:     2  2001  3000       2
7:     2  3001  5000       1
另一个选项使用dplyr,但请注意,我保留了data.table:rleid的使用(感谢@LeroyTyrone指出consecutive_id()函数):
library(dplyr)
test %>%
  group_by(chr, id=consecutive_id(abs.sum)) %>%
  summarize(start=min(start), end=max(end), abs.sum=min(abs.sum)) %>%
  select(-id)
输出:
    chr start   end abs.sum
  <dbl> <dbl> <dbl>   <dbl>
1     1     1  3000       2
2     1  3001  4000       1
3     1  4001  6000       3
4     2     1  1000       2
5     2  1001  2000       1
6     2  2001  3000       2
7     2  3001  5000       1
英文:
You can do this leveraging run-length id:
library(data.table)
setDT(test)[, .(start=min(start), end=max(end), abs.sum=min(abs.sum)), .(chr,rleid(abs.sum))][,-2]
Output:
     chr start   end abs.sum
   <num> <num> <num>   <num>
1:     1     1  3000       2
2:     1  3001  4000       1
3:     1  4001  6000       3
4:     2     1  1000       2
5:     2  1001  2000       1
6:     2  2001  3000       2
7:     2  3001  5000       1
Here is another option using dplyr (<strike>but note I retain the use of data.table:rleid</strike>. Thanks to @LeroyTyrone for pointing out the consecutive_id() function)
library(dplyr)
test %>% 
  group_by(chr, id=consecutive_id(abs.sum)) %>% 
  summarize(start=min(start), end=max(end), abs.sum=min(abs.sum)) %>% 
  select(-id)
Output:
    chr start   end abs.sum
  <dbl> <dbl> <dbl>   <dbl>
1     1     1  3000       2
2     1  3001  4000       1
3     1  4001  6000       3
4     2     1  1000       2
5     2  1001  2000       1
6     2  2001  3000       2
7     2  3001  5000       1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论