Numbering rows within groups in a data frame, but in relation to the blocks of rows with the same value

huangapple go评论56阅读模式
英文:

Numbering rows within groups in a data frame, but in relation to the blocks of rows with the same value

问题

I have translated the code portions for you:

set.seed(10)
df <- data.frame(row=c(1:15), cat = c(rep("a", 3), rep("b", 3), rep("c", 3), rep("b", 3), rep("a", 3)), val = runif(15))
df

   row cat        val
1    1   a 0.53559704
2    2   a 0.09308813
3    3   a 0.16980304
4    4   b 0.89983245
5    5   b 0.42263761
6    6   b 0.74774647
7    7   c 0.82265258
8    8   c 0.95465365
9    9   c 0.68544451
10  10   b 0.50050323
11  11   b 0.27548386
12  12   b 0.22890394
13  13   a 0.01443391
14  14   a 0.72896456
15  15   a 0.24988047
     row cat        val  cat2
1    1   a 0.53559704    a1
2    2   a 0.09308813    a1
3    3   a 0.16980304    a1
4    4   b 0.89983245    b1
5    5   b 0.42263761    b1
6    6   b 0.74774647    b1
7    7   c 0.82265258    c1
8    8   c 0.95465365    c1
9    9   c 0.68544451    c1
10  10   b 0.50050323    b2
11  11   b 0.27548386    b2
12  12   b 0.22890394    b2
13  13   a 0.01443391    a2
14  14   a 0.72896456    a2
15  15   a 0.24988047    a2

Regarding your question about how to achieve this in tidyverse, you've already provided a solution using mutate, rle, and paste0. If you want the block numbers to start from 1 for each category, you can modify your code like this:

df %>%
  group_by(cat) %>%
  mutate(cat2 = paste0(cat, cumsum(c(TRUE, diff(row) != 1)))) %>%
  ungroup()

This will give you the desired output with block numbers starting from 1 for each category.

英文:

I have data such as this.

set.seed(10) 
df &lt;- data.frame(row=c(1:15), cat = c(rep(&quot;a&quot;, 3), rep(&quot;b&quot;, 3), rep(&quot;c&quot;, 3), rep(&quot;b&quot;, 3), rep(&quot;a&quot;, 3)), val = runif(15)) 
df
   row cat        val
1    1   a 0.53559704
2    2   a 0.09308813
3    3   a 0.16980304
4    4   b 0.89983245
5    5   b 0.42263761
6    6   b 0.74774647
7    7   c 0.82265258
8    8   c 0.95465365
9    9   c 0.68544451
10  10   b 0.50050323
11  11   b 0.27548386
12  12   b 0.22890394
13  13   a 0.01443391
14  14   a 0.72896456
15  15   a 0.24988047

Notice that there are "blocks" of the same category in subsequent rows. So the category "a" is in rows 1-3 and also in rows 13-15. What I need is to add a column to the data which would categorize each category based on the category block it is in. This cannot be done simply by using group_by(cat) in dplyr.

   row cat        val  cat2
1    1   a 0.53559704    a1
2    2   a 0.09308813    a1
3    3   a 0.16980304    a1
4    4   b 0.89983245    b1
5    5   b 0.42263761    b1
6    6   b 0.74774647    b1
7    7   c 0.82265258    c1
8    8   c 0.95465365    c1
9    9   c 0.68544451    c1
10  10   b 0.50050323    b2
11  11   b 0.27548386    b2
12  12   b 0.22890394    b2
13  13   a 0.01443391    a2
14  14   a 0.72896456    a2
15  15   a 0.24988047    a2

Does anyone have an idea how to do this in tidyverse?

I tried various solutions, but the best thing I could find is using this:

df %&gt;%
  mutate(run = with(rle(cat), rep(seq_along(lengths), lengths))) %&gt;%
  group_by(cat, run) %&gt;%
  mutate(cat2 = paste0(cat, run)) %&gt;%
  ungroup() %&gt;%
  select(-run)

But this gives me this output which is not what I want, since now each block is labeled using a higher number than the previous one...

     row cat      val cat2 
   &lt;int&gt; &lt;chr&gt;  &lt;dbl&gt; &lt;chr&gt;
 1     1 a     0.536  a1   
 2     2 a     0.0931 a1   
 3     3 a     0.170  a1   
 4     4 b     0.900  b2   
 5     5 b     0.423  b2   
 6     6 b     0.748  b2   
 7     7 c     0.823  c3   
 8     8 c     0.955  c3   
 9     9 c     0.685  c3   
10    10 b     0.501  b4   
11    11 b     0.275  b4   
12    12 b     0.229  b4   
13    13 a     0.0144 a5   
14    14 a     0.729  a5   
15    15 a     0.250  a5  

答案1

得分: 2

使用 dplyr::consecutive_id(在 dplyr 1.1.0 中引入,受 data.table::rleid 启发)两次,你可以这样做:

library(dplyr)

df %>%
  mutate(run = consecutive_id(cat)) %>%
  mutate(cat2 = paste0(cat, consecutive_id(run)), .by = cat) %>%
  select(-run)

结果如下:

   row cat        val cat2
1    1   a 0.50747820   a1
2    2   a 0.30676851   a1
3    3   a 0.42690767   a1
4    4   b 0.69310208   b1
5    5   b 0.08513597   b1
6    6   b 0.22543662   b1
7    7   c 0.27453052   c1
8    8   c 0.27230507   c1
9    9   c 0.61582931   c1
10  10   b 0.42967153   b2
11  11   b 0.65165567   b2
12  12   b 0.56773775   b2
13  13   a 0.11350898   a2
14  14   a 0.59592531   a2
15  15   a 0.35804998   a2
英文:

Using dplyr::consecutive_id (introduced with dplyr 1.1.0 and inspired by data.table::rleid) two times you could do:

library(dplyr)

df %&gt;% 
  mutate(run = consecutive_id(cat)) %&gt;% 
  mutate(cat2 = paste0(cat, consecutive_id(run)), .by = cat) %&gt;% 
  select(-run)

   row cat        val cat2
1    1   a 0.50747820   a1
2    2   a 0.30676851   a1
3    3   a 0.42690767   a1
4    4   b 0.69310208   b1
5    5   b 0.08513597   b1
6    6   b 0.22543662   b1
7    7   c 0.27453052   c1
8    8   c 0.27230507   c1
9    9   c 0.61582931   c1
10  10   b 0.42967153   b2
11  11   b 0.65165567   b2
12  12   b 0.56773775   b2
13  13   a 0.11350898   a2
14  14   a 0.59592531   a2
15  15   a 0.35804998   a2

答案2

得分: 2

你可以借助row列来创建输出。

library(dplyr)

df %>%
  mutate(cat2 = paste0(cat, cumsum(c(TRUE, diff(row) > 1))), .by = cat)

cumsum(c(TRUE, diff(row) > 1)) 创建从1开始的数字,每当连续的row值之间的差异大于1时,它就会递增。对于cat = a,它的工作方式如下。

x <- c(1, 2, 3, 13, 14, 15)
cumsum(c(TRUE, diff(x) > 1))
#[1] 1 1 1 2 2 2
英文:

You may take help of row column to create the output.

library(dplyr)

df %&gt;%
  mutate(cat2 = paste0(cat, cumsum(c(TRUE, diff(row) &gt; 1))), .by = cat)

#   row cat        val cat2
#1    1   a 0.50747820   a1
#2    2   a 0.30676851   a1
#3    3   a 0.42690767   a1
#4    4   b 0.69310208   b1
#5    5   b 0.08513597   b1
#6    6   b 0.22543662   b1
#7    7   c 0.27453052   c1
#8    8   c 0.27230507   c1
#9    9   c 0.61582931   c1
#10  10   b 0.42967153   b2
#11  11   b 0.65165567   b2
#12  12   b 0.56773775   b2
#13  13   a 0.11350898   a2
#14  14   a 0.59592531   a2
#15  15   a 0.35804998   a2

cumsum(c(TRUE, diff(row) &gt; 1)) creates numbers starting from 1 which increments every-time the difference between the consecutive row values is greater than 1. For cat = a this is how it works.

x &lt;- c(1, 2, 3, 13, 14, 15)
cumsum(c(TRUE, diff(x) &gt; 1))
#[1] 1 1 1 2 2 2

huangapple
  • 本文由 发表于 2023年5月6日 17:00:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76188023.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定