在数据框中基于重复数字序列创建标识列。

huangapple go评论70阅读模式
英文:

Create identifier column based on repeating sequence of numbers in data frame

问题

我有一个监测数据集,记录了不同主体在21天(大多数情况下)的周期内的体重 - 然而,有时周期会缩短(例如,20天)或延长(例如,28天)。请参见示例:

set.seed(20)

df <- data.frame(subject1 = runif(n=1:90, min = 1, max = 100), 
                 subject2 = runif(n=1:90, min = 1, max = 100), 
                 subject3 = runif(n=1:90, min = 1, max = 100), 
                 day = c(rep(1:21, 2), 1:28, 1:20))
df

我想创建一个带有“批次ID”的列,每次周期重新开始时增加一个数字,得到类似这样的结果:

set.seed(20)

df <- data.frame(subject1 = runif(n=1:90, min = 1, max = 100), 
                 subject2 = runif(n=1:90, min = 1, max = 100), 
                 subject3 = runif(n=1:90, min = 1, max = 100), 
                 day = c(rep(1:21, 2), 1:28, 1:20), 
                 ID = c(rep("批次 1", 21), rep("批次 2", 21), rep("批次 3", 28), rep("批次 4", 20)))
df

我不知道从哪里开始。数据已经收集了很多年,df非常长,这就是为什么我需要一个自动化方法的原因。我通常使用dplyr,但是其他语言的解决方案也受欢迎。

英文:

I have a monitoring data set which records weights for different subjects in cycles of 21 days (mostly) - however, sometimes the cycle i shortened (e.g. 20 days) or expanded (e.g. 28 days). See example:

set.seed(20)

df &lt;- data.frame(subject1 = runif(n=1:90, min = 1, max = 100), 
                 subject2 = runif(n=1:90, min = 1, max = 100), 
                 subject3 = runif(n=1:90, min = 1, max = 100), 
                 day = c(rep(1:21, 2), 1:28, 1:20))
df

I would like to create a column with a "batch ID" which increase in number each time the cycle starts over, getting something like this:

set.seed(20)

df &lt;- data.frame(subject1 = runif(n=1:90, min = 1, max = 100), 
                 subject2 = runif(n=1:90, min = 1, max = 100), 
                 subject3 = runif(n=1:90, min = 1, max = 100), 
                 day = c(rep(1:21, 2), 1:28, 1:20), 
                 ID = c(rep(&quot;batch 1&quot;, 21), rep(&quot;batch 2&quot;, 21), rep(&quot;batch 3&quot;, 28), rep(&quot;batch 4&quot;, 20)))
df

I don't know where to start. The data has been collected for many years and the df is extremely long, which is why I need an automated method for this.

I usually use dplyr, however solutions in all languages are welcome.

答案1

得分: 1

这是代码的一部分,其中用到了R语言中的cumsum(day==1)来创建新的分组。每当day等于1时,就会开始一个新的分组,并使用"batch"作为前缀。

英文:

It is a kind of starting a new group. We can do it with cumsum(day==1). Eachtime day is 1, a new group is started:

library(dplyr)

df %&gt;% 
  mutate(ID = cumsum(day==1),
         ID = paste(&quot;batch&quot;, ID))
   subject1  subject2  subject3 day      ID
1  28.112304 23.747643  2.506307   1 batch 1
2  14.707646 64.856926 64.934039   2 batch 1
3  63.048533 72.325234 68.608072   3 batch 1
4  94.326426 75.051280  6.784585   4 batch 1
5  27.176812 31.487133 56.319612   5 batch 1
6  94.341248 59.129401  7.002702   6 batch 1
7  97.219212 89.005288 51.363620   7 batch 1
8  14.826730 11.091299 72.477603   8 batch 1
9  14.250449 84.572366 51.806585   9 batch 1
10  8.731178 50.483787 63.905433  10 batch 1
11 24.595016 54.690629 29.220187  11 batch 1
12 66.109872 38.976594 57.952008  12 batch 1
13 69.709213 28.246633 96.082079  13 batch 1
14 98.662153 39.431172 38.304419  14 batch 1
15 35.180087 19.315271 94.495912  15 batch 1
16 45.465575 31.429093 86.362035  16 batch 1
17 88.569241 65.789032 15.264510  17 batch 1
18 41.317364 19.413599 36.570227  18 batch 1
19 93.470092 61.837675 28.270692  19 batch 1
20 99.405225  1.825664 46.233582  20 batch 1
21 98.778535 10.493757  3.473848  21 batch 1
22 94.813979 15.090318 10.536749   1 batch 2
23  1.580498 43.355038 47.193968   2 batch 2
24  8.171572 82.660440 99.404702   3 batch 2
25 10.791114 40.136418  7.615652   4 batch 2
26 28.782935 61.110801 15.203424   5 batch 2
27 26.170606 60.001865 21.975446   6 batch 2
28 98.073649  4.205692 97.422975   7 batch 2
29 23.640042  5.384184 18.883684   8 batch 2
30 18.924822 78.691799 15.707679   9 batch 2
31 63.542991 83.722546 81.379621  10 batch 2
32 65.861969 51.918785 14.969885  11 batch 2
33 59.136869 45.680453 77.921039  12 batch 2
34 43.916728 96.927358 71.636865  13 batch 2
35 84.404238 25.997312 27.519646  14 batch 2
36  1.343036 38.276619 74.979714  15 batch 2
37 54.025242 68.931825 67.697015  16 batch 2
38  9.654027 45.233349 87.036974  17 batch 2
39 90.297132 78.244482 43.231548  18 batch 2
40 53.228791 49.865308 86.774027  19 batch 2
41 72.950380 74.636688 45.980257  20 batch 2
42 15.766594 32.927134 96.377100  21 batch 2
43 65.280840 33.047331 53.369068   1 batch 3
44 41.419024 84.291542  1.055891   2 batch 3
45 64.189356 97.864732  8.435106   3 batch 3
46 64.782105 82.842300 80.991830   4 batch 3
47 61.184364 20.249522 13.622524   5 batch 3
48 20.608730 74.588932 91.931483   6 batch 3
49 80.901141 28.992681 72.589333   7 batch 3
50  4.084888 30.209502 63.598710   8 batch 3
51 95.588275 74.761589 62.743750   9 batch 3
52 62.271646  9.675652 76.652212  10 batch 3
53 43.174593 94.037260 42.026702  11 batch 3
54 53.207426 93.448598 61.360762  12 batch 3
55 19.355807 35.463982 53.029776  13 batch 3
56 13.580139 97.384648 80.816749  14 batch 3
57 32.219313 26.256102 45.672542  15 batch 3
58 33.894195 73.801500  8.051782  16 batch 3
59 45.671515 51.298810  7.600933  17 batch 3
60 24.499980 62.102165 86.877326  18 batch 3
61 60.893774 30.891038 16.634176  19 batch 3
62 89.360993 78.245477 29.801789  20 batch 3
63 57.501857 19.254282 20.347036  21 batch 3
64 80.004186 17.583524 30.764541  22 batch 3
65 71.140736 76.311887 30.298813  23 batch 3
66 72.739952 13.229765 56.278070  24 batch 3
67 62.314198 79.444292  8.105422  25 batch 3
68 32.037481 10.490401 92.195441  26 batch 3
69 66.342709 48.151444 26.442403  27 batch 3
70 65.165316  9.942519  6.828309  28 batch 3
71 45.501737 15.926725 93.197676   1 batch 4
72 48.523332  1.570688 55.853054   2 batch 4
73 26.508309 52.256063 93.128411   3 batch 4
74 97.476880 35.398011 89.341347   4 batch 4
75 17.123732 84.779030 43.547328   5 batch 4
76 18.453889 59.836859 38.898992   6 batch 4
77  1.419466  1.403754 75.434308   7 batch 4
78 21.803886 91.401409 84.694884   8 batch 4
79 61.964926 93.287858 36.304794   9 batch 4
80 81.242887 82.993459  3.940457  10 batch 4
81 57.470248  2.156846 16.255565  11 batch 4
82 79.895440 46.352555  7.666849  12 batch 4
83 38.304018 46.883902 28.208153  13 batch 4
84 92.238729 35.422918 99.706443  14 batch 4
85 60.493556 71.277974 52.243233  15 batch 4
86 99.128086 36.706295 60.799051  16 batch 4
87 29.826266 61.463732 86.887740  17 batch 4
88 46.486269 86.044449 73.024360  18 batch 4
89 96.568415  6.379009 58.893206  19 batch 4
90 92.866674 53.126197 59.812420  20 batch 4

答案2

得分: 0

以下是您要翻译的内容:

新的分组开始,如果其"day"值小于其前导值。这适用于不等大小的分组。然后,您可以应用一种运行长度编码:

library(tidyverse)

set.seed(20)

df <- data.frame(
  subject1 = runif(n = 1:10, min = 1, max = 100),
  subject2 = runif(n = 1:10, min = 1, max = 100),
  subject3 = runif(n = 1:10, min = 1, max = 100),
  day = c(1:10, 1:12, 1:8)
)

df <-
  df |
  as_tibble() |
  mutate(batch_flag = (day > lag(day)) |> replace_na(TRUE))

day_rle <- df$batch_flag |> rle()
day_rle$values <-
  day_rle$values |> 
  length() |> 
  seq()
day_batches <- day_rle |> inverse.rle()

df <-
  df |
  mutate(
    batch = day_batches |>
      map_dbl(~ ifelse(.x %% 2 == 0, .x + 1, .x)) |>
      map_dbl(~ ceiling(.x / 2))
  ) |
  select(-batch_flag)

df |> print(n = 30)
#> # A tibble: 30 × 5
#>    subject1 subject2 subject3   day batch
#>       <dbl>    <dbl>    <dbl> <int> <dbl>
#>  1    87.9     71.8     49.7      1     1
#>  2    77.1     76.0      4.00     2     1
#>  3    28.6      1.19    44.6      3     1
#>  4    53.4     74.5      8.65     4     1
#>  5    96.3     20.0     27.2      5     1
#>  6    98.1     45.8      7.89     6     1
#>  7    10.0     32.9     90.8      7     1
#>  8     8.00    11.8     99.2      8     1
#>  9    33.4     29.6      7.34     9     1
#> 10    37.6     82.1     67.8     10     1
#> 11    87.9     71.8     49.7      1     2
#> 12    77.1     76.0      4.00     2     2
#> 13    28.6      1.19    44.6      3     2
#> 14    53.4     74.5      8.65     4     2
#> 15    96.3     20.0     27.2      5     2
#> 16    98.1     45.8      7.89     6     2
#> 17    10.0     32.9     90.8      7     2
#> 18     8.00    11.8     99.2      8     2
#> 19    33.4     29.6      7.34     9     2
#> 20    37.6     82.1     67.8     10     2
#> 21    87.9     71.8     49.7     11     2
#> 22    77.1     76.0      4.00    12     2
#> 23    28.6      1.19    44.6      1     3
#> 24    53.4     74.5      8.65     2     3
#> 25    96.3     20.0     27.2      3     3
#> 26    98.1     45.8      7.89     4     3
#> 27    10.0     32.9     90.8      5     3
#> 28     8.00    11.8     99.2      6     3
#> 29    33.4     29.6      7.34     7     3
#> 30    37.6     82.1     67.8      8     3

创建于2023-05-13,使用reprex包(v2.0.1)

英文:

A new group starts, if its day value is smaller than its precursor. This works even for groups of unequal sizes. Then you can apply a run length encoding:

library(tidyverse)

set.seed(20)

df &lt;- data.frame(
  subject1 = runif(n = 1:10, min = 1, max = 100),
  subject2 = runif(n = 1:10, min = 1, max = 100),
  subject3 = runif(n = 1:10, min = 1, max = 100),
  day = c(1:10, 1:12, 1:8)
)

df &lt;-
  df |&gt;
  as_tibble() |&gt;
  mutate(batch_flag = (day &gt; lag(day)) |&gt; replace_na(TRUE))

day_rle &lt;- df$batch_flag |&gt; rle()
day_rle$values &lt;-
  day_rle$values |&gt;
  length() |&gt;
  seq()
day_batches &lt;- day_rle |&gt; inverse.rle()

df &lt;-
  df |&gt;
  mutate(
    batch = day_batches |&gt;
      map_dbl(~ ifelse(.x %% 2 == 0, .x + 1, .x)) |&gt;
      map_dbl(~ ceiling(.x / 2))
  ) |&gt;
  select(-batch_flag)

df |&gt; print(n = 30)
#&gt; # A tibble: 30 &#215; 5
#&gt;    subject1 subject2 subject3   day batch
#&gt;       &lt;dbl&gt;    &lt;dbl&gt;    &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
#&gt;  1    87.9     71.8     49.7      1     1
#&gt;  2    77.1     76.0      4.00     2     1
#&gt;  3    28.6      1.19    44.6      3     1
#&gt;  4    53.4     74.5      8.65     4     1
#&gt;  5    96.3     20.0     27.2      5     1
#&gt;  6    98.1     45.8      7.89     6     1
#&gt;  7    10.0     32.9     90.8      7     1
#&gt;  8     8.00    11.8     99.2      8     1
#&gt;  9    33.4     29.6      7.34     9     1
#&gt; 10    37.6     82.1     67.8     10     1
#&gt; 11    87.9     71.8     49.7      1     2
#&gt; 12    77.1     76.0      4.00     2     2
#&gt; 13    28.6      1.19    44.6      3     2
#&gt; 14    53.4     74.5      8.65     4     2
#&gt; 15    96.3     20.0     27.2      5     2
#&gt; 16    98.1     45.8      7.89     6     2
#&gt; 17    10.0     32.9     90.8      7     2
#&gt; 18     8.00    11.8     99.2      8     2
#&gt; 19    33.4     29.6      7.34     9     2
#&gt; 20    37.6     82.1     67.8     10     2
#&gt; 21    87.9     71.8     49.7     11     2
#&gt; 22    77.1     76.0      4.00    12     2
#&gt; 23    28.6      1.19    44.6      1     3
#&gt; 24    53.4     74.5      8.65     2     3
#&gt; 25    96.3     20.0     27.2      3     3
#&gt; 26    98.1     45.8      7.89     4     3
#&gt; 27    10.0     32.9     90.8      5     3
#&gt; 28     8.00    11.8     99.2      6     3
#&gt; 29    33.4     29.6      7.34     7     3
#&gt; 30    37.6     82.1     67.8      8     3

<sup>Created on 2023-05-13 by the reprex package (v2.0.1)</sup>

huangapple
  • 本文由 发表于 2023年5月14日 01:58:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76244195.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定