在R中编写一个循环,根据连续相同的列数值合并行。

huangapple go评论62阅读模式
英文:

Write a loop in R to collapse rows based on identical consecutive column values

问题

以下是您的R循环代码的中文翻译部分:

for (i in 1:(nrow(test) - 1)) {
    if (test$chr[i] == test$chr[i + 1] && test$abs.sum[i] == test$abs.sum[i + 1]) {
        test$start[i] <- min(test$start[i], test$start[i + 1])
        test$end[i] <- max(test$end[i], test$end[i + 1])
        test <- test[-(i + 1), ]
        i <- i - 1
    }
}

请注意,我稍作修改以处理行索引错误,以及检查是否chrabs.sum都相同。这应该按照您的要求合并行。

英文:

I am trying to write a loop in R to collapse rows. Let's say I'm using a data frame like this one:

r1 &lt;- c(1, 1,1000,2)
r2 &lt;- c(1, 1001,2000, 2)
r3 &lt;- c(1, 2001,3000, 2)
r4 &lt;- c(1, 3001,4000, 1)
r5 &lt;- c(1, 4001,5000, 3)
r6 &lt;- c(1, 5001,6000, 3)
r7 &lt;- c(2, 1,1000,2 )
r8 &lt;- c(2, 1001,2000, 1)
r9 &lt;- c(2, 2001,3000, 2)
r10 &lt;- c(2, 3001,4000, 1)
r11 &lt;- c(2, 4001,5000, 1)
test &lt;- rbind(r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11)
test &lt;- as.data.frame(test)
colnames(test) &lt;- c(&quot;chr&quot;, &quot;start&quot;,&quot;end&quot;, &quot;abs.sum&quot;)
rownames(test) &lt;- NULL

This gives me a data frame that looks like this:

   chr start  end abs.sum
1    1     1 1000       2
2    1  1001 2000       2
3    1  2001 3000       2
4    1  3001 4000       1
5    1  4001 5000       3
6    1  5001 6000       3
7    2     1 1000       2
8    2  1001 2000       1
9    2  2001 3000       2
10   2  3001 4000       1
11   2  4001 5000       1

For each chr value, I want to collapse based on identical consecutive abs.sum, keeping the lowest value in start and the highest value in end. So, for example, I would like my final data frame to look like this:

  chr start  end abs.sum
1   1     1 3000       2
2   1  3001 4000       1
3   1  4001 6000       3
4   2     1 1000       2
5   2  1001 2000       1
6   2  2001 3000       2
7   2  3001 5000       1

I tried writing a for loop:

for (i in 1:nrow(test)) {
        
        if (test$abs.sum[i] == test$abs.sum[i + 1]) {
                test$end[i] &lt;- test$end[i+1]
                test &lt;- test[-i + 1]
                test &lt;- test[-(i + 1),]
        }
        
}

Which returns the error:

>Error in if (test$abs.sum[i] == test$abs.sum[i + 1]) { :
argument is of length zero

I know this isn't correct, but it's what I have been able to piece together so far. I think this may require some combination of "while" and "for" loops, but I am stuck. Maybe a package already exists with a function that can do this?

答案1

得分: 2

以下是翻译好的部分:

这是一个使用dplyrtidyr的方法。工作流程如下:

  1. 使用consecutive_id()为每个“abs.sum”组创建一个具有唯一值的“tmp”列。
  2. 将数据透视为长格式,将所有“start”和“end”值放入一个单独的列中。
  3. 按“chr”和“tmp”对数据进行分组,并使用filter()获取每个子组(“tmp”)的最小和最大值。
  4. 将数据重新透视为宽格式。
library(dplyr)
library(tidyr)

test %>%
  mutate(tmp = consecutive_id(abs.sum)) %>%
  pivot_longer(!c(chr, abs.sum, tmp)) %>%
  group_by(chr, tmp) %>%
  filter(value == min(value) | value == max(value)) %>%
  pivot_wider(names_from = name,
              values_from = value) %>%
  ungroup() %>%
  select(-tmp)

# A tibble: 7 × 4
    chr abs.sum start   end
  <dbl>   <dbl> <dbl> <dbl>
1     1       2     1  3000
2     1       1  3001  4000
3     1       3  4001  6000
4     2       2     1  1000
5     2       1  1001  2000
6     2       2  2001  3000
7     2       1  3001  5000

更新:
根据@langtang的有用建议,上述方法可以简化为:

test %>%
  mutate(tmp = consecutive_id(abs.sum)) %>%
  group_by(chr, tmp) %>%
  summarise(min(start), max(end)) %>%
  ungroup() %>%
  select(-tmp)

有趣的是(至少对我来说),在样本数据上,summarise()方法比透视方法慢大约3倍。不确定是否适用于更大的数据集。无论如何,这两种方法都比@langtang答案中概述的data.table方法慢。我只是将其添加在这里,以防有人想知道如何仅使用tidyverse完成这个任务。

英文:

Here's a dplyr and tidyr method. The workflow is:

  1. Create "tmp" column with unique value for each "abs.sum" 'group' using consecutive_id()
  2. Pivot data to long format by to get all "start" and "end" values into a single column
  3. Group data by "chr" and "tmp" and use filter() to get the min and max values for each subgroup ("tmp")
  4. Pivot data back to wide format
library(dplyr)
library(tidyr)

test %&gt;% 
  mutate(tmp = consecutive_id(abs.sum)) %&gt;%
  pivot_longer(!c(chr, abs.sum, tmp)) %&gt;%
  group_by(chr, tmp) %&gt;%
  filter(value == min(value) | value == max(value)) %&gt;%
  pivot_wider(names_from = name,
              values_from = value) %&gt;%
  ungroup() %&gt;%
  select(-tmp)

# A tibble: 7 &#215; 4
    chr abs.sum start   end
  &lt;dbl&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
1     1       2     1  3000
2     1       1  3001  4000
3     1       3  4001  6000
4     2       2     1  1000
5     2       1  1001  2000
6     2       2  2001  3000
7     2       1  3001  5000

Update:
As per @langtang's helpful suggestion, the above can be simplified to:

test %&gt;% 
  mutate(tmp = consecutive_id(abs.sum)) %&gt;%
  group_by(chr, tmp) %&gt;%
  summarise(min(start), max(end)) %&gt;%
  ungroup() %&gt;%
  select(-tmp)

Interestingly (for me anyway), on the sample data, the summarise() approach is ~3x slower than the pivot approach. Not sure if this scales to larger datasets. Either way, both of these methods are slower than the data.table method as outlined in @langtang's answer. Just added it here in case someone is curious about how it could be done solely in the tidyverse.

答案2

得分: 1

以下是您要翻译的代码部分:

library(data.table)

setDT(test)[, .(start=min(start), end=max(end), abs.sum=min(abs.sum)), .(chr,rleid(abs.sum))][,-2]

输出:

     chr start   end abs.sum
   <num> <num> <num>   <num>
1:     1     1  3000       2
2:     1  3001  4000       1
3:     1  4001  6000       3
4:     2     1  1000       2
5:     2  1001  2000       1
6:     2  2001  3000       2
7:     2  3001  5000       1

另一个选项使用dplyr,但请注意,我保留了data.table:rleid的使用(感谢@LeroyTyrone指出consecutive_id()函数):

library(dplyr)

test %>%
  group_by(chr, id=consecutive_id(abs.sum)) %>%
  summarize(start=min(start), end=max(end), abs.sum=min(abs.sum)) %>%
  select(-id)

输出:

    chr start   end abs.sum
  <dbl> <dbl> <dbl>   <dbl>
1     1     1  3000       2
2     1  3001  4000       1
3     1  4001  6000       3
4     2     1  1000       2
5     2  1001  2000       1
6     2  2001  3000       2
7     2  3001  5000       1
英文:

You can do this leveraging run-length id:

library(data.table)

setDT(test)[, .(start=min(start), end=max(end), abs.sum=min(abs.sum)), .(chr,rleid(abs.sum))][,-2]

Output:

     chr start   end abs.sum
   &lt;num&gt; &lt;num&gt; &lt;num&gt;   &lt;num&gt;
1:     1     1  3000       2
2:     1  3001  4000       1
3:     1  4001  6000       3
4:     2     1  1000       2
5:     2  1001  2000       1
6:     2  2001  3000       2
7:     2  3001  5000       1

Here is another option using dplyr (<strike>but note I retain the use of data.table:rleid</strike>. Thanks to @LeroyTyrone for pointing out the consecutive_id() function)

library(dplyr)

test %&gt;% 
  group_by(chr, id=consecutive_id(abs.sum)) %&gt;% 
  summarize(start=min(start), end=max(end), abs.sum=min(abs.sum)) %&gt;% 
  select(-id)

Output:

    chr start   end abs.sum
  &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;   &lt;dbl&gt;
1     1     1  3000       2
2     1  3001  4000       1
3     1  4001  6000       3
4     2     1  1000       2
5     2  1001  2000       1
6     2  2001  3000       2
7     2  3001  5000       1

huangapple
  • 本文由 发表于 2023年5月11日 06:18:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76222922.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定