2023年5月6日 17:00:40go评论76阅读模式

英文:

Numbering rows within groups in a data frame, but in relation to the blocks of rows with the same value

问题

I have translated the code portions for you:

set.seed(10)
df <- data.frame(row=c(1:15), cat = c(rep("a", 3), rep("b", 3), rep("c", 3), rep("b", 3), rep("a", 3)), val = runif(15))
df

   row cat        val
1    1   a 0.53559704
2    2   a 0.09308813
3    3   a 0.16980304
4    4   b 0.89983245
5    5   b 0.42263761
6    6   b 0.74774647
7    7   c 0.82265258
8    8   c 0.95465365
9    9   c 0.68544451
10  10   b 0.50050323
11  11   b 0.27548386
12  12   b 0.22890394
13  13   a 0.01443391
14  14   a 0.72896456
15  15   a 0.24988047

     row cat        val  cat2
1    1   a 0.53559704    a1
2    2   a 0.09308813    a1
3    3   a 0.16980304    a1
4    4   b 0.89983245    b1
5    5   b 0.42263761    b1
6    6   b 0.74774647    b1
7    7   c 0.82265258    c1
8    8   c 0.95465365    c1
9    9   c 0.68544451    c1
10  10   b 0.50050323    b2
11  11   b 0.27548386    b2
12  12   b 0.22890394    b2
13  13   a 0.01443391    a2
14  14   a 0.72896456    a2
15  15   a 0.24988047    a2

Regarding your question about how to achieve this in tidyverse, you've already provided a solution using mutate, rle, and paste0. If you want the block numbers to start from 1 for each category, you can modify your code like this:

df %>%
  group_by(cat) %>%
  mutate(cat2 = paste0(cat, cumsum(c(TRUE, diff(row) != 1)))) %>%
  ungroup()

This will give you the desired output with block numbers starting from 1 for each category.

英文:

I have data such as this.

set.seed(10) 
df &lt;- data.frame(row=c(1:15), cat = c(rep(&quot;a&quot;, 3), rep(&quot;b&quot;, 3), rep(&quot;c&quot;, 3), rep(&quot;b&quot;, 3), rep(&quot;a&quot;, 3)), val = runif(15)) 
df

   row cat        val
1    1   a 0.53559704
2    2   a 0.09308813
3    3   a 0.16980304
4    4   b 0.89983245
5    5   b 0.42263761
6    6   b 0.74774647
7    7   c 0.82265258
8    8   c 0.95465365
9    9   c 0.68544451
10  10   b 0.50050323
11  11   b 0.27548386
12  12   b 0.22890394
13  13   a 0.01443391
14  14   a 0.72896456
15  15   a 0.24988047

Notice that there are "blocks" of the same category in subsequent rows. So the category "a" is in rows 1-3 and also in rows 13-15. What I need is to add a column to the data which would categorize each category based on the category block it is in. This cannot be done simply by using group_by(cat) in dplyr.

   row cat        val  cat2
1    1   a 0.53559704    a1
2    2   a 0.09308813    a1
3    3   a 0.16980304    a1
4    4   b 0.89983245    b1
5    5   b 0.42263761    b1
6    6   b 0.74774647    b1
7    7   c 0.82265258    c1
8    8   c 0.95465365    c1
9    9   c 0.68544451    c1
10  10   b 0.50050323    b2
11  11   b 0.27548386    b2
12  12   b 0.22890394    b2
13  13   a 0.01443391    a2
14  14   a 0.72896456    a2
15  15   a 0.24988047    a2

Does anyone have an idea how to do this in tidyverse?

I tried various solutions, but the best thing I could find is using this:

df %&gt;%
  mutate(run = with(rle(cat), rep(seq_along(lengths), lengths))) %&gt;%
  group_by(cat, run) %&gt;%
  mutate(cat2 = paste0(cat, run)) %&gt;%
  ungroup() %&gt;%
  select(-run)

But this gives me this output which is not what I want, since now each block is labeled using a higher number than the previous one...

     row cat      val cat2 
   &lt;int&gt; &lt;chr&gt;  &lt;dbl&gt; &lt;chr&gt;
 1     1 a     0.536  a1   
 2     2 a     0.0931 a1   
 3     3 a     0.170  a1   
 4     4 b     0.900  b2   
 5     5 b     0.423  b2   
 6     6 b     0.748  b2   
 7     7 c     0.823  c3   
 8     8 c     0.955  c3   
 9     9 c     0.685  c3   
10    10 b     0.501  b4   
11    11 b     0.275  b4   
12    12 b     0.229  b4   
13    13 a     0.0144 a5   
14    14 a     0.729  a5   
15    15 a     0.250  a5

答案1

得分: 2

使用 dplyr::consecutive_id（在 dplyr 1.1.0 中引入，受 data.table::rleid 启发）两次，你可以这样做：

library(dplyr)

df %>%
  mutate(run = consecutive_id(cat)) %>%
  mutate(cat2 = paste0(cat, consecutive_id(run)), .by = cat) %>%
  select(-run)

结果如下：

   row cat        val cat2
1    1   a 0.50747820   a1
2    2   a 0.30676851   a1
3    3   a 0.42690767   a1
4    4   b 0.69310208   b1
5    5   b 0.08513597   b1
6    6   b 0.22543662   b1
7    7   c 0.27453052   c1
8    8   c 0.27230507   c1
9    9   c 0.61582931   c1
10  10   b 0.42967153   b2
11  11   b 0.65165567   b2
12  12   b 0.56773775   b2
13  13   a 0.11350898   a2
14  14   a 0.59592531   a2
15  15   a 0.35804998   a2

英文:

Using dplyr::consecutive_id (introduced with dplyr 1.1.0 and inspired by data.table::rleid) two times you could do:

library(dplyr)

df %&gt;% 
  mutate(run = consecutive_id(cat)) %&gt;% 
  mutate(cat2 = paste0(cat, consecutive_id(run)), .by = cat) %&gt;% 
  select(-run)

   row cat        val cat2
1    1   a 0.50747820   a1
2    2   a 0.30676851   a1
3    3   a 0.42690767   a1
4    4   b 0.69310208   b1
5    5   b 0.08513597   b1
6    6   b 0.22543662   b1
7    7   c 0.27453052   c1
8    8   c 0.27230507   c1
9    9   c 0.61582931   c1
10  10   b 0.42967153   b2
11  11   b 0.65165567   b2
12  12   b 0.56773775   b2
13  13   a 0.11350898   a2
14  14   a 0.59592531   a2
15  15   a 0.35804998   a2

答案2

得分: 2

你可以借助row列来创建输出。

library(dplyr)

df %>%
  mutate(cat2 = paste0(cat, cumsum(c(TRUE, diff(row) > 1))), .by = cat)

cumsum(c(TRUE, diff(row) > 1)) 创建从1开始的数字，每当连续的row值之间的差异大于1时，它就会递增。对于cat = a，它的工作方式如下。

x <- c(1, 2, 3, 13, 14, 15)
cumsum(c(TRUE, diff(x) > 1))
#[1] 1 1 1 2 2 2

英文:

You may take help of row column to create the output.

library(dplyr)

df %&gt;%
  mutate(cat2 = paste0(cat, cumsum(c(TRUE, diff(row) &gt; 1))), .by = cat)

#   row cat        val cat2
#1    1   a 0.50747820   a1
#2    2   a 0.30676851   a1
#3    3   a 0.42690767   a1
#4    4   b 0.69310208   b1
#5    5   b 0.08513597   b1
#6    6   b 0.22543662   b1
#7    7   c 0.27453052   c1
#8    8   c 0.27230507   c1
#9    9   c 0.61582931   c1
#10  10   b 0.42967153   b2
#11  11   b 0.65165567   b2
#12  12   b 0.56773775   b2
#13  13   a 0.11350898   a2
#14  14   a 0.59592531   a2
#15  15   a 0.35804998   a2

cumsum(c(TRUE, diff(row) > 1)) creates numbers starting from 1 which increments every-time the difference between the consecutive row values is greater than 1. For cat = a this is how it works.

x &lt;- c(1, 2, 3, 13, 14, 15)
cumsum(c(TRUE, diff(x) &gt; 1))
#[1] 1 1 1 2 2 2

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Numbering rows within groups in a data frame, but in relation to the blocks of rows with the same value

问题

答案1

答案2

替换匹配后的字符串部分

读取包含多个ID的数据文件并保存为不同的CSV文件。

如何绘制带有图例和数据作为x轴的一部分的图表？

使用ifelse在列之间进行变异

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论