英文:
Aggregate column - how to handle uneven dataframe
问题
我想要按Chr
列对每隔三行进行聚合(三行的总和)。然而,由于我的df
不是3的倍数,我不确定如何处理最后剩下的行,可能只有1或2行。如果剩下两行,我希望只对这两行进行求和。
- 输入
data.frame(Chr = c("chr1","chr1","chr1","chr1","chr1","chr2","chr2","chr2","chr2","chr2","chr3"),
value = c(1,3,1,3,5,6,3,1,3,5,0),
seq = c(1,2,3,4,5,1,2,3,4,5,6))
- 输出(使用
dplyr
mutate
,保持所有列)
data.frame(Chr = c("chr1","chr1","chr1","chr1","chr1","chr2","chr2","chr2","chr2","chr2","chr3"),
value = c(1,3,1,3,5,6,3,1,3,5,0),
seq = c(1,2,3,4,5,1,2,3,4,5,6),
agg = c(5,5,5,8,8,10,10,10,8,8,8))
英文:
I want to aggregate per Chr
every third row (sum of three rows). However, since my df
is not divisible by 3, I am not sure how to handle the last remaining rows, which could be just 1 or 2 rows. If two rows remain, I would like to just sum those 2 remaining rows.
- Input
data.frame(Chr = c("chr1","chr1","chr1","chr1","chr1","chr2","chr2","chr2","chr2","chr2","chr3"),
value = c(1,3,1,3,5,6,3,1,3,5,0),
seq = c(1,2,3,4,5,1,2,3,4,5,6))
- Output (using
dplyr
mutate
, keeping all columns)
data.frame(Chr = c("chr1","chr1","chr1","chr1","chr1","chr2","chr2","chr2","chr2","chr2","chr3"),
value = c(1,3,1,3,5,6,3,1,3,5,0),
seq = c(1,2,3,4,5,1,2,3,4,5,6),
agg = c(5,5,5,8,8,10,10,10,8,8,8))
答案1
得分: 2
你可以使用 (row_number()-1) %/% 3
来每 3 个观察值分组。
library(dplyr)
df %>%
mutate(Grp = (row_number()-1) %/% 3 + 1, .by = Chr) %>%
mutate(agg = sum(value), .by = c(Chr, Grp))
# Chr value seq Grp agg
# 1 chr1 1 1 1 5
# 2 chr1 3 2 1 5
# 3 chr1 1 3 1 5
# 4 chr1 3 4 2 8
# 5 chr1 5 5 2 8
# 6 chr2 6 1 1 10
# 7 chr2 3 2 1 10
# 8 chr2 1 3 1 10
# 9 chr2 3 4 2 8
# 10 chr2 5 5 2 8
# 11 chr3 0 6 1 0
如果不需要 Grp
列,你可以使用 select(-Grp)
来删除它。
英文:
You can use (row_number()-1) %/% 3
to group per 3 observations.
library(dplyr)
df %>%
mutate(Grp = (row_number()-1) %/% 3 + 1, .by = Chr) %>%
mutate(agg = sum(value), .by = c(Chr, Grp))
# Chr value seq Grp agg
# 1 chr1 1 1 1 5
# 2 chr1 3 2 1 5
# 3 chr1 1 3 1 5
# 4 chr1 3 4 2 8
# 5 chr1 5 5 2 8
# 6 chr2 6 1 1 10
# 7 chr2 3 2 1 10
# 8 chr2 1 3 1 10
# 9 chr2 3 4 2 8
# 10 chr2 5 5 2 8
# 11 chr3 0 6 1 0
If the Grp
column is not needed, you can drop it with select(-Grp)
.
答案2
得分: 2
这是一种相似但不同的方法:尝试使用新的.by
参数。我尝试将其用于最后的mutate,但无法将cumsum(..与.by
结合起来:
library(dplyr)
df %>%
mutate(group = as.integer(gl(n(),3,n())), .by=Chr) %>%
mutate(id = row_number(), .by = c(Chr, group)) %>%
group_by(Chr, sumgroup = cumsum(id == 1)) %>%
mutate(agg = sum(value)) %>%
ungroup() %>%
select(Chr, value, seq, agg)
Chr value seq agg
<chr> <dbl> <dbl> <dbl>
1 chr1 1 1 5
2 chr1 3 2 5
3 chr1 1 3 5
4 chr1 3 4 8
5 chr1 5 5 8
6 chr2 6 1 10
7 chr2 3 2 10
8 chr2 1 3 10
9 chr2 3 4 8
10 chr2 5 5 8
11 chr3 0 6 0
英文:
Here is a similar yet other approach: Trying out the new .by
argument.
I tried to use it also for the last mutate but it was not possible to combine cumsum(.. with .by
:
library(dplyr)
df %>%
mutate(group = as.integer(gl(n(),3,n())), .by=Chr) %>%
mutate(id = row_number(), .by = c(Chr, group)) %>%
group_by(Chr, sumgroup = cumsum(id == 1)) %>%
mutate(agg = sum(value)) %>%
ungroup() %>%
select(Chr, value, seq, agg)
Chr value seq agg
<chr> <dbl> <dbl> <dbl>
1 chr1 1 1 5
2 chr1 3 2 5
3 chr1 1 3 5
4 chr1 3 4 8
5 chr1 5 5 8
6 chr2 6 1 10
7 chr2 3 2 10
8 chr2 1 3 10
9 chr2 3 4 8
10 chr2 5 5 8
11 chr3 0 6 0
答案3
得分: 2
代码中的部分不需要翻译,只提供代码中的注释和输出内容的翻译:
代码部分:
- 在使用
dplyr
和ave
进行操作时,首先加载dplyr
库。 - 对数据框
df1
进行处理,使用mutate
函数,为每行添加一个名为 "agg" 的新列。 - 利用
ave
函数对 "value" 列进行分组求和,分组依据为计算出的整数值as.integer(gl(n(), 3, n()))
,其中n()
返回行数。 - 最后,指定
.by
参数为 "Chr",表示按照 "Chr" 列进行分组。
输出结果:
- 输出结果包含四列,分别是 "Chr"、"value"、"seq"、"agg"。
- 按照 "Chr" 列进行分组,计算 "agg" 列的值,得到如上所示的结果。
代码部分:
- 在使用
data.table
时,首先加载data.table
库。 - 使用
setDT
函数将数据框df1
转换为数据表。 - 利用
[.SD[, rep(sum(value), .N), as.integer(gl(.N, 3, .N))]$V1, Chr]
对数据进行操作。[.SD[, rep(sum(value), .N), as.integer(gl(.N, 3, .N))]$V1
表示对每个分组计算 "value" 列的总和,并将结果赋给 "agg" 列。- 最后,根据 "Chr" 列分组。
输出结果:
- 输出结果包含四列,分别是 "Chr"、"value"、"seq"、"agg"。
- 按照 "Chr" 列进行分组,计算 "agg" 列的值,得到如上所示的结果。
英文:
An option with dplyr
& ave
library(dplyr) # >= 1.1.0
df1 %>%
mutate(agg = ave(value, as.integer(gl(n(), 3, n())), FUN = sum), .by = Chr)
-output
Chr value seq agg
1 chr1 1 1 5
2 chr1 3 2 5
3 chr1 1 3 5
4 chr1 3 4 8
5 chr1 5 5 8
6 chr2 6 1 10
7 chr2 3 2 10
8 chr2 1 3 10
9 chr2 3 4 8
10 chr2 5 5 8
11 chr3 0 6 0
Or with data.table
library(data.table)
setDT(df1)[, agg := .SD[, rep(sum(value), .N),
as.integer(gl(.N, 3, .N))]$V1, Chr]
-output
> df1
Chr value seq agg
1: chr1 1 1 5
2: chr1 3 2 5
3: chr1 1 3 5
4: chr1 3 4 8
5: chr1 5 5 8
6: chr2 6 1 10
7: chr2 3 2 10
8: chr2 1 3 10
9: chr2 3 4 8
10: chr2 5 5 8
11: chr3 0 6 0
答案4
得分: 1
将行号(在Chr
分组内)除以3并四舍五入得到的比例为前3个值为1,接下来的3个值为2,依此类推。然后,您可以按Chr
和此变量进行分组以计算总和:
英文:
Diving the row number (within Chr
groups) by 3 and rounding up this ratio gives 1 for the first 3 values, 2 for the next 3 values, etc. You can then group by Chr
and this variable to compute your sums:
d %>%
group_by(Chr) %>%
mutate(Chr_group = ceiling(row_number()/3)) %>%
group_by(Chr, Chr_group) %>%
mutate(agg = sum(value)) %>%
ungroup()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论