2023年5月11日 06:18:55go评论95阅读模式

英文:

Write a loop in R to collapse rows based on identical consecutive column values

问题

以下是您的R循环代码的中文翻译部分：

for (i in 1:(nrow(test) - 1)) {
    if (test$chr[i] == test$chr[i + 1] && test$abs.sum[i] == test$abs.sum[i + 1]) {
        test$start[i] <- min(test$start[i], test$start[i + 1])
        test$end[i] <- max(test$end[i], test$end[i + 1])
        test <- test[-(i + 1), ]
        i <- i - 1
    }
}

请注意，我稍作修改以处理行索引错误，以及检查是否chr和abs.sum都相同。这应该按照您的要求合并行。

英文:

I am trying to write a loop in R to collapse rows. Let's say I'm using a data frame like this one:

r1 &lt;- c(1, 1,1000,2)
r2 &lt;- c(1, 1001,2000, 2)
r3 &lt;- c(1, 2001,3000, 2)
r4 &lt;- c(1, 3001,4000, 1)
r5 &lt;- c(1, 4001,5000, 3)
r6 &lt;- c(1, 5001,6000, 3)
r7 &lt;- c(2, 1,1000,2 )
r8 &lt;- c(2, 1001,2000, 1)
r9 &lt;- c(2, 2001,3000, 2)
r10 &lt;- c(2, 3001,4000, 1)
r11 &lt;- c(2, 4001,5000, 1)
test &lt;- rbind(r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11)
test &lt;- as.data.frame(test)
colnames(test) &lt;- c(&quot;chr&quot;, &quot;start&quot;,&quot;end&quot;, &quot;abs.sum&quot;)
rownames(test) &lt;- NULL

This gives me a data frame that looks like this:

   chr start  end abs.sum
1    1     1 1000       2
2    1  1001 2000       2
3    1  2001 3000       2
4    1  3001 4000       1
5    1  4001 5000       3
6    1  5001 6000       3
7    2     1 1000       2
8    2  1001 2000       1
9    2  2001 3000       2
10   2  3001 4000       1
11   2  4001 5000       1

For each chr value, I want to collapse based on identical consecutive abs.sum, keeping the lowest value in start and the highest value in end. So, for example, I would like my final data frame to look like this:

  chr start  end abs.sum
1   1     1 3000       2
2   1  3001 4000       1
3   1  4001 6000       3
4   2     1 1000       2
5   2  1001 2000       1
6   2  2001 3000       2
7   2  3001 5000       1

I tried writing a for loop:

for (i in 1:nrow(test)) {
        
        if (test$abs.sum[i] == test$abs.sum[i + 1]) {
                test$end[i] &lt;- test$end[i+1]
                test &lt;- test[-i + 1]
                test &lt;- test[-(i + 1),]
        }
        
}

Which returns the error:

>Error in if (test$abs.sum[i] == test$abs.sum[i + 1]) { :
argument is of length zero

I know this isn't correct, but it's what I have been able to piece together so far. I think this may require some combination of "while" and "for" loops, but I am stuck. Maybe a package already exists with a function that can do this?

答案1

得分: 2

以下是翻译好的部分：

这是一个使用dplyr和tidyr的方法。工作流程如下：

使用consecutive_id()为每个“abs.sum”组创建一个具有唯一值的“tmp”列。
将数据透视为长格式，将所有“start”和“end”值放入一个单独的列中。
按“chr”和“tmp”对数据进行分组，并使用filter()获取每个子组（“tmp”）的最小和最大值。
将数据重新透视为宽格式。

library(dplyr)
library(tidyr)
test %>%
  mutate(tmp = consecutive_id(abs.sum)) %>%
  pivot_longer(!c(chr, abs.sum, tmp)) %>%
  group_by(chr, tmp) %>%
  filter(value == min(value) | value == max(value)) %>%
  pivot_wider(names_from = name,
              values_from = value) %>%
  ungroup() %>%
  select(-tmp)
# A tibble: 7 × 4
    chr abs.sum start   end
  <dbl>   <dbl> <dbl> <dbl>
1     1       2     1  3000
2     1       1  3001  4000
3     1       3  4001  6000
4     2       2     1  1000
5     2       1  1001  2000
6     2       2  2001  3000
7     2       1  3001  5000

更新：
根据@langtang的有用建议，上述方法可以简化为：

test %>%
  mutate(tmp = consecutive_id(abs.sum)) %>%
  group_by(chr, tmp) %>%
  summarise(min(start), max(end)) %>%
  ungroup() %>%
  select(-tmp)

有趣的是（至少对我来说），在样本数据上，summarise()方法比透视方法慢大约3倍。不确定是否适用于更大的数据集。无论如何，这两种方法都比@langtang答案中概述的data.table方法慢。我只是将其添加在这里，以防有人想知道如何仅使用tidyverse完成这个任务。

英文:

Here's a dplyr and tidyr method. The workflow is:

Create "tmp" column with unique value for each "abs.sum" 'group' using consecutive_id()
Pivot data to long format by to get all "start" and "end" values into a single column
Group data by "chr" and "tmp" and use filter() to get the min and max values for each subgroup ("tmp")
Pivot data back to wide format

library(dplyr)
library(tidyr)
test %&gt;% 
  mutate(tmp = consecutive_id(abs.sum)) %&gt;%
  pivot_longer(!c(chr, abs.sum, tmp)) %&gt;%
  group_by(chr, tmp) %&gt;%
  filter(value == min(value) | value == max(value)) %&gt;%
  pivot_wider(names_from = name,
              values_from = value) %&gt;%
  ungroup() %&gt;%
  select(-tmp)
# A tibble: 7 &#215; 4
    chr abs.sum start   end
  &lt;dbl&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
1     1       2     1  3000
2     1       1  3001  4000
3     1       3  4001  6000
4     2       2     1  1000
5     2       1  1001  2000
6     2       2  2001  3000
7     2       1  3001  5000

Update:
As per @langtang's helpful suggestion, the above can be simplified to:

test %&gt;% 
  mutate(tmp = consecutive_id(abs.sum)) %&gt;%
  group_by(chr, tmp) %&gt;%
  summarise(min(start), max(end)) %&gt;%
  ungroup() %&gt;%
  select(-tmp)

Interestingly (for me anyway), on the sample data, the summarise() approach is ~3x slower than the pivot approach. Not sure if this scales to larger datasets. Either way, both of these methods are slower than the data.table method as outlined in @langtang's answer. Just added it here in case someone is curious about how it could be done solely in the tidyverse.

答案2

得分: 1

以下是您要翻译的代码部分：

library(data.table)
setDT(test)[, .(start=min(start), end=max(end), abs.sum=min(abs.sum)), .(chr,rleid(abs.sum))][,-2]

输出：

     chr start   end abs.sum
   <num> <num> <num>   <num>
1:     1     1  3000       2
2:     1  3001  4000       1
3:     1  4001  6000       3
4:     2     1  1000       2
5:     2  1001  2000       1
6:     2  2001  3000       2
7:     2  3001  5000       1

另一个选项使用dplyr，但请注意，我保留了data.table:rleid的使用（感谢@LeroyTyrone指出consecutive_id()函数）：

library(dplyr)
test %>%
  group_by(chr, id=consecutive_id(abs.sum)) %>%
  summarize(start=min(start), end=max(end), abs.sum=min(abs.sum)) %>%
  select(-id)

输出：

    chr start   end abs.sum
  <dbl> <dbl> <dbl>   <dbl>
1     1     1  3000       2
2     1  3001  4000       1
3     1  4001  6000       3
4     2     1  1000       2
5     2  1001  2000       1
6     2  2001  3000       2
7     2  3001  5000       1

英文:

You can do this leveraging run-length id:

library(data.table)
setDT(test)[, .(start=min(start), end=max(end), abs.sum=min(abs.sum)), .(chr,rleid(abs.sum))][,-2]

Output:

     chr start   end abs.sum
   &lt;num&gt; &lt;num&gt; &lt;num&gt;   &lt;num&gt;
1:     1     1  3000       2
2:     1  3001  4000       1
3:     1  4001  6000       3
4:     2     1  1000       2
5:     2  1001  2000       1
6:     2  2001  3000       2
7:     2  3001  5000       1

Here is another option using dplyr (<strike>but note I retain the use of data.table:rleid</strike>. Thanks to @LeroyTyrone for pointing out the consecutive_id() function)

library(dplyr)
test %&gt;% 
  group_by(chr, id=consecutive_id(abs.sum)) %&gt;% 
  summarize(start=min(start), end=max(end), abs.sum=min(abs.sum)) %&gt;% 
  select(-id)

Output:

    chr start   end abs.sum
  &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;   &lt;dbl&gt;
1     1     1  3000       2
2     1  3001  4000       1
3     1  4001  6000       3
4     2     1  1000       2
5     2  1001  2000       1
6     2  2001  3000       2
7     2  3001  5000       1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中编写一个循环，根据连续相同的列数值合并行。

问题

答案1

答案2

How to compare values in two columns and if values are equal keep as is, but if values are different, sum by row

如何在 pandas 中获取姓和名，当姓是多个名字时。

从Pandas DataFrame提取数据

如何将一个数据框按月份的天数进行分割？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。