英文:
Numbering rows within groups in a data frame, but in relation to the blocks of rows with the same value
问题
I have translated the code portions for you:
set.seed(10)
df <- data.frame(row=c(1:15), cat = c(rep("a", 3), rep("b", 3), rep("c", 3), rep("b", 3), rep("a", 3)), val = runif(15))
df
row cat val
1 1 a 0.53559704
2 2 a 0.09308813
3 3 a 0.16980304
4 4 b 0.89983245
5 5 b 0.42263761
6 6 b 0.74774647
7 7 c 0.82265258
8 8 c 0.95465365
9 9 c 0.68544451
10 10 b 0.50050323
11 11 b 0.27548386
12 12 b 0.22890394
13 13 a 0.01443391
14 14 a 0.72896456
15 15 a 0.24988047
row cat val cat2
1 1 a 0.53559704 a1
2 2 a 0.09308813 a1
3 3 a 0.16980304 a1
4 4 b 0.89983245 b1
5 5 b 0.42263761 b1
6 6 b 0.74774647 b1
7 7 c 0.82265258 c1
8 8 c 0.95465365 c1
9 9 c 0.68544451 c1
10 10 b 0.50050323 b2
11 11 b 0.27548386 b2
12 12 b 0.22890394 b2
13 13 a 0.01443391 a2
14 14 a 0.72896456 a2
15 15 a 0.24988047 a2
Regarding your question about how to achieve this in tidyverse, you've already provided a solution using mutate
, rle
, and paste0
. If you want the block numbers to start from 1 for each category, you can modify your code like this:
df %>%
group_by(cat) %>%
mutate(cat2 = paste0(cat, cumsum(c(TRUE, diff(row) != 1)))) %>%
ungroup()
This will give you the desired output with block numbers starting from 1 for each category.
英文:
I have data such as this.
set.seed(10)
df <- data.frame(row=c(1:15), cat = c(rep("a", 3), rep("b", 3), rep("c", 3), rep("b", 3), rep("a", 3)), val = runif(15))
df
row cat val
1 1 a 0.53559704
2 2 a 0.09308813
3 3 a 0.16980304
4 4 b 0.89983245
5 5 b 0.42263761
6 6 b 0.74774647
7 7 c 0.82265258
8 8 c 0.95465365
9 9 c 0.68544451
10 10 b 0.50050323
11 11 b 0.27548386
12 12 b 0.22890394
13 13 a 0.01443391
14 14 a 0.72896456
15 15 a 0.24988047
Notice that there are "blocks" of the same category in subsequent rows. So the category "a" is in rows 1-3 and also in rows 13-15. What I need is to add a column to the data which would categorize each category based on the category block it is in. This cannot be done simply by using group_by(cat)
in dplyr.
row cat val cat2
1 1 a 0.53559704 a1
2 2 a 0.09308813 a1
3 3 a 0.16980304 a1
4 4 b 0.89983245 b1
5 5 b 0.42263761 b1
6 6 b 0.74774647 b1
7 7 c 0.82265258 c1
8 8 c 0.95465365 c1
9 9 c 0.68544451 c1
10 10 b 0.50050323 b2
11 11 b 0.27548386 b2
12 12 b 0.22890394 b2
13 13 a 0.01443391 a2
14 14 a 0.72896456 a2
15 15 a 0.24988047 a2
Does anyone have an idea how to do this in tidyverse?
I tried various solutions, but the best thing I could find is using this:
df %>%
mutate(run = with(rle(cat), rep(seq_along(lengths), lengths))) %>%
group_by(cat, run) %>%
mutate(cat2 = paste0(cat, run)) %>%
ungroup() %>%
select(-run)
But this gives me this output which is not what I want, since now each block is labeled using a higher number than the previous one...
row cat val cat2
<int> <chr> <dbl> <chr>
1 1 a 0.536 a1
2 2 a 0.0931 a1
3 3 a 0.170 a1
4 4 b 0.900 b2
5 5 b 0.423 b2
6 6 b 0.748 b2
7 7 c 0.823 c3
8 8 c 0.955 c3
9 9 c 0.685 c3
10 10 b 0.501 b4
11 11 b 0.275 b4
12 12 b 0.229 b4
13 13 a 0.0144 a5
14 14 a 0.729 a5
15 15 a 0.250 a5
答案1
得分: 2
使用 dplyr::consecutive_id
(在 dplyr 1.1.0
中引入,受 data.table::rleid
启发)两次,你可以这样做:
library(dplyr)
df %>%
mutate(run = consecutive_id(cat)) %>%
mutate(cat2 = paste0(cat, consecutive_id(run)), .by = cat) %>%
select(-run)
结果如下:
row cat val cat2
1 1 a 0.50747820 a1
2 2 a 0.30676851 a1
3 3 a 0.42690767 a1
4 4 b 0.69310208 b1
5 5 b 0.08513597 b1
6 6 b 0.22543662 b1
7 7 c 0.27453052 c1
8 8 c 0.27230507 c1
9 9 c 0.61582931 c1
10 10 b 0.42967153 b2
11 11 b 0.65165567 b2
12 12 b 0.56773775 b2
13 13 a 0.11350898 a2
14 14 a 0.59592531 a2
15 15 a 0.35804998 a2
英文:
Using dplyr::consecutive_id
(introduced with dplyr 1.1.0
and inspired by data.table::rleid
) two times you could do:
library(dplyr)
df %>%
mutate(run = consecutive_id(cat)) %>%
mutate(cat2 = paste0(cat, consecutive_id(run)), .by = cat) %>%
select(-run)
row cat val cat2
1 1 a 0.50747820 a1
2 2 a 0.30676851 a1
3 3 a 0.42690767 a1
4 4 b 0.69310208 b1
5 5 b 0.08513597 b1
6 6 b 0.22543662 b1
7 7 c 0.27453052 c1
8 8 c 0.27230507 c1
9 9 c 0.61582931 c1
10 10 b 0.42967153 b2
11 11 b 0.65165567 b2
12 12 b 0.56773775 b2
13 13 a 0.11350898 a2
14 14 a 0.59592531 a2
15 15 a 0.35804998 a2
答案2
得分: 2
你可以借助row
列来创建输出。
library(dplyr)
df %>%
mutate(cat2 = paste0(cat, cumsum(c(TRUE, diff(row) > 1))), .by = cat)
cumsum(c(TRUE, diff(row) > 1))
创建从1开始的数字,每当连续的row
值之间的差异大于1时,它就会递增。对于cat = a
,它的工作方式如下。
x <- c(1, 2, 3, 13, 14, 15)
cumsum(c(TRUE, diff(x) > 1))
#[1] 1 1 1 2 2 2
英文:
You may take help of row
column to create the output.
library(dplyr)
df %>%
mutate(cat2 = paste0(cat, cumsum(c(TRUE, diff(row) > 1))), .by = cat)
# row cat val cat2
#1 1 a 0.50747820 a1
#2 2 a 0.30676851 a1
#3 3 a 0.42690767 a1
#4 4 b 0.69310208 b1
#5 5 b 0.08513597 b1
#6 6 b 0.22543662 b1
#7 7 c 0.27453052 c1
#8 8 c 0.27230507 c1
#9 9 c 0.61582931 c1
#10 10 b 0.42967153 b2
#11 11 b 0.65165567 b2
#12 12 b 0.56773775 b2
#13 13 a 0.11350898 a2
#14 14 a 0.59592531 a2
#15 15 a 0.35804998 a2
cumsum(c(TRUE, diff(row) > 1))
creates numbers starting from 1 which increments every-time the difference between the consecutive row
values is greater than 1. For cat = a
this is how it works.
x <- c(1, 2, 3, 13, 14, 15)
cumsum(c(TRUE, diff(x) > 1))
#[1] 1 1 1 2 2 2
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论