聚合列 – 如何处理不均匀的数据框

huangapple go评论67阅读模式
英文:

Aggregate column - how to handle uneven dataframe

问题

我想要按Chr列对每隔三行进行聚合(三行的总和)。然而,由于我的df不是3的倍数,我不确定如何处理最后剩下的行,可能只有1或2行。如果剩下两行,我希望只对这两行进行求和。

  • 输入
data.frame(Chr = c("chr1","chr1","chr1","chr1","chr1","chr2","chr2","chr2","chr2","chr2","chr3"),
           value = c(1,3,1,3,5,6,3,1,3,5,0),
           seq = c(1,2,3,4,5,1,2,3,4,5,6))
  • 输出(使用dplyr mutate,保持所有列)
data.frame(Chr = c("chr1","chr1","chr1","chr1","chr1","chr2","chr2","chr2","chr2","chr2","chr3"),
           value = c(1,3,1,3,5,6,3,1,3,5,0),
           seq = c(1,2,3,4,5,1,2,3,4,5,6),
           agg = c(5,5,5,8,8,10,10,10,8,8,8))
英文:

I want to aggregate per Chr every third row (sum of three rows). However, since my df is not divisible by 3, I am not sure how to handle the last remaining rows, which could be just 1 or 2 rows. If two rows remain, I would like to just sum those 2 remaining rows.

  • Input
data.frame(Chr = c("chr1","chr1","chr1","chr1","chr1","chr2","chr2","chr2","chr2","chr2","chr3"),
           value = c(1,3,1,3,5,6,3,1,3,5,0),
           seq = c(1,2,3,4,5,1,2,3,4,5,6))
  • Output (using dplyr mutate, keeping all columns)
data.frame(Chr = c("chr1","chr1","chr1","chr1","chr1","chr2","chr2","chr2","chr2","chr2","chr3"),
           value = c(1,3,1,3,5,6,3,1,3,5,0),
           seq = c(1,2,3,4,5,1,2,3,4,5,6),
           agg = c(5,5,5,8,8,10,10,10,8,8,8))

答案1

得分: 2

你可以使用 (row_number()-1) %/% 3 来每 3 个观察值分组。

library(dplyr)

df %>%
  mutate(Grp = (row_number()-1) %/% 3 + 1, .by = Chr) %>%
  mutate(agg = sum(value), .by = c(Chr, Grp))

#     Chr value seq Grp agg
# 1  chr1     1   1   1   5
# 2  chr1     3   2   1   5
# 3  chr1     1   3   1   5
# 4  chr1     3   4   2   8
# 5  chr1     5   5   2   8
# 6  chr2     6   1   1  10
# 7  chr2     3   2   1  10
# 8  chr2     1   3   1  10
# 9  chr2     3   4   2   8
# 10 chr2     5   5   2   8
# 11 chr3     0   6   1   0

如果不需要 Grp 列,你可以使用 select(-Grp) 来删除它。

英文:

You can use (row_number()-1) %/% 3 to group per 3 observations.

library(dplyr)

df %>%
  mutate(Grp = (row_number()-1) %/% 3 + 1, .by = Chr) %>%
  mutate(agg = sum(value), .by = c(Chr, Grp))

#     Chr value seq Grp agg
# 1  chr1     1   1   1   5
# 2  chr1     3   2   1   5
# 3  chr1     1   3   1   5
# 4  chr1     3   4   2   8
# 5  chr1     5   5   2   8
# 6  chr2     6   1   1  10
# 7  chr2     3   2   1  10
# 8  chr2     1   3   1  10
# 9  chr2     3   4   2   8
# 10 chr2     5   5   2   8
# 11 chr3     0   6   1   0

If the Grp column is not needed, you can drop it with select(-Grp).

答案2

得分: 2

这是一种相似但不同的方法:尝试使用新的.by参数。我尝试将其用于最后的mutate,但无法将cumsum(..与.by结合起来:

library(dplyr)

df %>%
  mutate(group = as.integer(gl(n(),3,n())), .by=Chr) %>%
  mutate(id = row_number(), .by = c(Chr, group)) %>%
  group_by(Chr, sumgroup = cumsum(id == 1)) %>%
  mutate(agg = sum(value)) %>%
  ungroup() %>%
  select(Chr, value, seq, agg)
  Chr   value   seq   agg
   <chr> <dbl> <dbl> <dbl>
 1 chr1      1     1     5
 2 chr1      3     2     5
 3 chr1      1     3     5
 4 chr1      3     4     8
 5 chr1      5     5     8
 6 chr2      6     1    10
 7 chr2      3     2    10
 8 chr2      1     3    10
 9 chr2      3     4     8
10 chr2      5     5     8
11 chr3      0     6     0
英文:

Here is a similar yet other approach: Trying out the new .by argument.
I tried to use it also for the last mutate but it was not possible to combine cumsum(.. with .by:

library(dplyr)

df %&gt;% 
  mutate(group = as.integer(gl(n(),3,n())), .by=Chr) %&gt;% 
  mutate(id = row_number(), .by = c(Chr, group)) %&gt;% 
  group_by(Chr, sumgroup = cumsum(id == 1)) %&gt;% 
  mutate(agg = sum(value)) %&gt;% 
  ungroup() %&gt;% 
  select(Chr, value, seq, agg)
  Chr   value   seq   agg
   &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
 1 chr1      1     1     5
 2 chr1      3     2     5
 3 chr1      1     3     5
 4 chr1      3     4     8
 5 chr1      5     5     8
 6 chr2      6     1    10
 7 chr2      3     2    10
 8 chr2      1     3    10
 9 chr2      3     4     8
10 chr2      5     5     8
11 chr3      0     6     0

答案3

得分: 2

代码中的部分不需要翻译,只提供代码中的注释和输出内容的翻译:

代码部分:

  1. 在使用 dplyrave 进行操作时,首先加载 dplyr 库。
  2. 对数据框 df1 进行处理,使用 mutate 函数,为每行添加一个名为 "agg" 的新列。
  3. 利用 ave 函数对 "value" 列进行分组求和,分组依据为计算出的整数值 as.integer(gl(n(), 3, n())),其中 n() 返回行数。
  4. 最后,指定 .by 参数为 "Chr",表示按照 "Chr" 列进行分组。

输出结果:

  1. 输出结果包含四列,分别是 "Chr"、"value"、"seq"、"agg"。
  2. 按照 "Chr" 列进行分组,计算 "agg" 列的值,得到如上所示的结果。

代码部分:

  1. 在使用 data.table 时,首先加载 data.table 库。
  2. 使用 setDT 函数将数据框 df1 转换为数据表。
  3. 利用 [.SD[, rep(sum(value), .N), as.integer(gl(.N, 3, .N))]$V1, Chr] 对数据进行操作。
    • [.SD[, rep(sum(value), .N), as.integer(gl(.N, 3, .N))]$V1 表示对每个分组计算 "value" 列的总和,并将结果赋给 "agg" 列。
    • 最后,根据 "Chr" 列分组。

输出结果:

  1. 输出结果包含四列,分别是 "Chr"、"value"、"seq"、"agg"。
  2. 按照 "Chr" 列进行分组,计算 "agg" 列的值,得到如上所示的结果。
英文:

An option with dplyr & ave

library(dplyr) # &gt;= 1.1.0
df1 %&gt;% 
  mutate(agg = ave(value, as.integer(gl(n(), 3, n())), FUN = sum), .by = Chr)

-output

    Chr value seq agg
1  chr1     1   1   5
2  chr1     3   2   5
3  chr1     1   3   5
4  chr1     3   4   8
5  chr1     5   5   8
6  chr2     6   1  10
7  chr2     3   2  10
8  chr2     1   3  10
9  chr2     3   4   8
10 chr2     5   5   8
11 chr3     0   6   0

Or with data.table

library(data.table)
 setDT(df1)[,  agg := .SD[, rep(sum(value), .N),
    as.integer(gl(.N, 3, .N))]$V1, Chr]

-output

&gt; df1
     Chr value seq agg
 1: chr1     1   1   5
 2: chr1     3   2   5
 3: chr1     1   3   5
 4: chr1     3   4   8
 5: chr1     5   5   8
 6: chr2     6   1  10
 7: chr2     3   2  10
 8: chr2     1   3  10
 9: chr2     3   4   8
10: chr2     5   5   8
11: chr3     0   6   0

答案4

得分: 1

将行号(在Chr分组内)除以3并四舍五入得到的比例为前3个值为1,接下来的3个值为2,依此类推。然后,您可以按Chr和此变量进行分组以计算总和:

英文:

Diving the row number (within Chr groups) by 3 and rounding up this ratio gives 1 for the first 3 values, 2 for the next 3 values, etc. You can then group by Chr and this variable to compute your sums:

d %&gt;% 
  group_by(Chr) %&gt;%
  mutate(Chr_group = ceiling(row_number()/3)) %&gt;% 
  group_by(Chr, Chr_group) %&gt;%
  mutate(agg = sum(value)) %&gt;%
  ungroup()

huangapple
  • 本文由 发表于 2023年2月19日 15:36:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/75498640.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定