英文:
Create identifier column based on repeating sequence of numbers in data frame
问题
我有一个监测数据集,记录了不同主体在21天(大多数情况下)的周期内的体重 - 然而,有时周期会缩短(例如,20天)或延长(例如,28天)。请参见示例:
set.seed(20)
df <- data.frame(subject1 = runif(n=1:90, min = 1, max = 100),
subject2 = runif(n=1:90, min = 1, max = 100),
subject3 = runif(n=1:90, min = 1, max = 100),
day = c(rep(1:21, 2), 1:28, 1:20))
df
我想创建一个带有“批次ID”的列,每次周期重新开始时增加一个数字,得到类似这样的结果:
set.seed(20)
df <- data.frame(subject1 = runif(n=1:90, min = 1, max = 100),
subject2 = runif(n=1:90, min = 1, max = 100),
subject3 = runif(n=1:90, min = 1, max = 100),
day = c(rep(1:21, 2), 1:28, 1:20),
ID = c(rep("批次 1", 21), rep("批次 2", 21), rep("批次 3", 28), rep("批次 4", 20)))
df
我不知道从哪里开始。数据已经收集了很多年,df非常长,这就是为什么我需要一个自动化方法的原因。我通常使用dplyr,但是其他语言的解决方案也受欢迎。
英文:
I have a monitoring data set which records weights for different subjects in cycles of 21 days (mostly) - however, sometimes the cycle i shortened (e.g. 20 days) or expanded (e.g. 28 days). See example:
set.seed(20)
df <- data.frame(subject1 = runif(n=1:90, min = 1, max = 100),
subject2 = runif(n=1:90, min = 1, max = 100),
subject3 = runif(n=1:90, min = 1, max = 100),
day = c(rep(1:21, 2), 1:28, 1:20))
df
I would like to create a column with a "batch ID" which increase in number each time the cycle starts over, getting something like this:
set.seed(20)
df <- data.frame(subject1 = runif(n=1:90, min = 1, max = 100),
subject2 = runif(n=1:90, min = 1, max = 100),
subject3 = runif(n=1:90, min = 1, max = 100),
day = c(rep(1:21, 2), 1:28, 1:20),
ID = c(rep("batch 1", 21), rep("batch 2", 21), rep("batch 3", 28), rep("batch 4", 20)))
df
I don't know where to start. The data has been collected for many years and the df is extremely long, which is why I need an automated method for this.
I usually use dplyr, however solutions in all languages are welcome.
答案1
得分: 1
这是代码的一部分,其中用到了R语言中的cumsum(day==1)
来创建新的分组。每当day
等于1时,就会开始一个新的分组,并使用"batch"作为前缀。
英文:
It is a kind of starting a new group. We can do it with cumsum(day==1)
. Eachtime day is 1, a new group is started:
library(dplyr)
df %>%
mutate(ID = cumsum(day==1),
ID = paste("batch", ID))
subject1 subject2 subject3 day ID
1 28.112304 23.747643 2.506307 1 batch 1
2 14.707646 64.856926 64.934039 2 batch 1
3 63.048533 72.325234 68.608072 3 batch 1
4 94.326426 75.051280 6.784585 4 batch 1
5 27.176812 31.487133 56.319612 5 batch 1
6 94.341248 59.129401 7.002702 6 batch 1
7 97.219212 89.005288 51.363620 7 batch 1
8 14.826730 11.091299 72.477603 8 batch 1
9 14.250449 84.572366 51.806585 9 batch 1
10 8.731178 50.483787 63.905433 10 batch 1
11 24.595016 54.690629 29.220187 11 batch 1
12 66.109872 38.976594 57.952008 12 batch 1
13 69.709213 28.246633 96.082079 13 batch 1
14 98.662153 39.431172 38.304419 14 batch 1
15 35.180087 19.315271 94.495912 15 batch 1
16 45.465575 31.429093 86.362035 16 batch 1
17 88.569241 65.789032 15.264510 17 batch 1
18 41.317364 19.413599 36.570227 18 batch 1
19 93.470092 61.837675 28.270692 19 batch 1
20 99.405225 1.825664 46.233582 20 batch 1
21 98.778535 10.493757 3.473848 21 batch 1
22 94.813979 15.090318 10.536749 1 batch 2
23 1.580498 43.355038 47.193968 2 batch 2
24 8.171572 82.660440 99.404702 3 batch 2
25 10.791114 40.136418 7.615652 4 batch 2
26 28.782935 61.110801 15.203424 5 batch 2
27 26.170606 60.001865 21.975446 6 batch 2
28 98.073649 4.205692 97.422975 7 batch 2
29 23.640042 5.384184 18.883684 8 batch 2
30 18.924822 78.691799 15.707679 9 batch 2
31 63.542991 83.722546 81.379621 10 batch 2
32 65.861969 51.918785 14.969885 11 batch 2
33 59.136869 45.680453 77.921039 12 batch 2
34 43.916728 96.927358 71.636865 13 batch 2
35 84.404238 25.997312 27.519646 14 batch 2
36 1.343036 38.276619 74.979714 15 batch 2
37 54.025242 68.931825 67.697015 16 batch 2
38 9.654027 45.233349 87.036974 17 batch 2
39 90.297132 78.244482 43.231548 18 batch 2
40 53.228791 49.865308 86.774027 19 batch 2
41 72.950380 74.636688 45.980257 20 batch 2
42 15.766594 32.927134 96.377100 21 batch 2
43 65.280840 33.047331 53.369068 1 batch 3
44 41.419024 84.291542 1.055891 2 batch 3
45 64.189356 97.864732 8.435106 3 batch 3
46 64.782105 82.842300 80.991830 4 batch 3
47 61.184364 20.249522 13.622524 5 batch 3
48 20.608730 74.588932 91.931483 6 batch 3
49 80.901141 28.992681 72.589333 7 batch 3
50 4.084888 30.209502 63.598710 8 batch 3
51 95.588275 74.761589 62.743750 9 batch 3
52 62.271646 9.675652 76.652212 10 batch 3
53 43.174593 94.037260 42.026702 11 batch 3
54 53.207426 93.448598 61.360762 12 batch 3
55 19.355807 35.463982 53.029776 13 batch 3
56 13.580139 97.384648 80.816749 14 batch 3
57 32.219313 26.256102 45.672542 15 batch 3
58 33.894195 73.801500 8.051782 16 batch 3
59 45.671515 51.298810 7.600933 17 batch 3
60 24.499980 62.102165 86.877326 18 batch 3
61 60.893774 30.891038 16.634176 19 batch 3
62 89.360993 78.245477 29.801789 20 batch 3
63 57.501857 19.254282 20.347036 21 batch 3
64 80.004186 17.583524 30.764541 22 batch 3
65 71.140736 76.311887 30.298813 23 batch 3
66 72.739952 13.229765 56.278070 24 batch 3
67 62.314198 79.444292 8.105422 25 batch 3
68 32.037481 10.490401 92.195441 26 batch 3
69 66.342709 48.151444 26.442403 27 batch 3
70 65.165316 9.942519 6.828309 28 batch 3
71 45.501737 15.926725 93.197676 1 batch 4
72 48.523332 1.570688 55.853054 2 batch 4
73 26.508309 52.256063 93.128411 3 batch 4
74 97.476880 35.398011 89.341347 4 batch 4
75 17.123732 84.779030 43.547328 5 batch 4
76 18.453889 59.836859 38.898992 6 batch 4
77 1.419466 1.403754 75.434308 7 batch 4
78 21.803886 91.401409 84.694884 8 batch 4
79 61.964926 93.287858 36.304794 9 batch 4
80 81.242887 82.993459 3.940457 10 batch 4
81 57.470248 2.156846 16.255565 11 batch 4
82 79.895440 46.352555 7.666849 12 batch 4
83 38.304018 46.883902 28.208153 13 batch 4
84 92.238729 35.422918 99.706443 14 batch 4
85 60.493556 71.277974 52.243233 15 batch 4
86 99.128086 36.706295 60.799051 16 batch 4
87 29.826266 61.463732 86.887740 17 batch 4
88 46.486269 86.044449 73.024360 18 batch 4
89 96.568415 6.379009 58.893206 19 batch 4
90 92.866674 53.126197 59.812420 20 batch 4
答案2
得分: 0
以下是您要翻译的内容:
新的分组开始,如果其"day"值小于其前导值。这适用于不等大小的分组。然后,您可以应用一种运行长度编码:
library(tidyverse)
set.seed(20)
df <- data.frame(
subject1 = runif(n = 1:10, min = 1, max = 100),
subject2 = runif(n = 1:10, min = 1, max = 100),
subject3 = runif(n = 1:10, min = 1, max = 100),
day = c(1:10, 1:12, 1:8)
)
df <-
df |
as_tibble() |
mutate(batch_flag = (day > lag(day)) |> replace_na(TRUE))
day_rle <- df$batch_flag |> rle()
day_rle$values <-
day_rle$values |>
length() |>
seq()
day_batches <- day_rle |> inverse.rle()
df <-
df |
mutate(
batch = day_batches |>
map_dbl(~ ifelse(.x %% 2 == 0, .x + 1, .x)) |>
map_dbl(~ ceiling(.x / 2))
) |
select(-batch_flag)
df |> print(n = 30)
#> # A tibble: 30 × 5
#> subject1 subject2 subject3 day batch
#> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 87.9 71.8 49.7 1 1
#> 2 77.1 76.0 4.00 2 1
#> 3 28.6 1.19 44.6 3 1
#> 4 53.4 74.5 8.65 4 1
#> 5 96.3 20.0 27.2 5 1
#> 6 98.1 45.8 7.89 6 1
#> 7 10.0 32.9 90.8 7 1
#> 8 8.00 11.8 99.2 8 1
#> 9 33.4 29.6 7.34 9 1
#> 10 37.6 82.1 67.8 10 1
#> 11 87.9 71.8 49.7 1 2
#> 12 77.1 76.0 4.00 2 2
#> 13 28.6 1.19 44.6 3 2
#> 14 53.4 74.5 8.65 4 2
#> 15 96.3 20.0 27.2 5 2
#> 16 98.1 45.8 7.89 6 2
#> 17 10.0 32.9 90.8 7 2
#> 18 8.00 11.8 99.2 8 2
#> 19 33.4 29.6 7.34 9 2
#> 20 37.6 82.1 67.8 10 2
#> 21 87.9 71.8 49.7 11 2
#> 22 77.1 76.0 4.00 12 2
#> 23 28.6 1.19 44.6 1 3
#> 24 53.4 74.5 8.65 2 3
#> 25 96.3 20.0 27.2 3 3
#> 26 98.1 45.8 7.89 4 3
#> 27 10.0 32.9 90.8 5 3
#> 28 8.00 11.8 99.2 6 3
#> 29 33.4 29.6 7.34 7 3
#> 30 37.6 82.1 67.8 8 3
创建于2023-05-13,使用reprex包(v2.0.1)
英文:
A new group starts, if its day value is smaller than its precursor. This works even for groups of unequal sizes. Then you can apply a run length encoding:
library(tidyverse)
set.seed(20)
df <- data.frame(
subject1 = runif(n = 1:10, min = 1, max = 100),
subject2 = runif(n = 1:10, min = 1, max = 100),
subject3 = runif(n = 1:10, min = 1, max = 100),
day = c(1:10, 1:12, 1:8)
)
df <-
df |>
as_tibble() |>
mutate(batch_flag = (day > lag(day)) |> replace_na(TRUE))
day_rle <- df$batch_flag |> rle()
day_rle$values <-
day_rle$values |>
length() |>
seq()
day_batches <- day_rle |> inverse.rle()
df <-
df |>
mutate(
batch = day_batches |>
map_dbl(~ ifelse(.x %% 2 == 0, .x + 1, .x)) |>
map_dbl(~ ceiling(.x / 2))
) |>
select(-batch_flag)
df |> print(n = 30)
#> # A tibble: 30 × 5
#> subject1 subject2 subject3 day batch
#> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 87.9 71.8 49.7 1 1
#> 2 77.1 76.0 4.00 2 1
#> 3 28.6 1.19 44.6 3 1
#> 4 53.4 74.5 8.65 4 1
#> 5 96.3 20.0 27.2 5 1
#> 6 98.1 45.8 7.89 6 1
#> 7 10.0 32.9 90.8 7 1
#> 8 8.00 11.8 99.2 8 1
#> 9 33.4 29.6 7.34 9 1
#> 10 37.6 82.1 67.8 10 1
#> 11 87.9 71.8 49.7 1 2
#> 12 77.1 76.0 4.00 2 2
#> 13 28.6 1.19 44.6 3 2
#> 14 53.4 74.5 8.65 4 2
#> 15 96.3 20.0 27.2 5 2
#> 16 98.1 45.8 7.89 6 2
#> 17 10.0 32.9 90.8 7 2
#> 18 8.00 11.8 99.2 8 2
#> 19 33.4 29.6 7.34 9 2
#> 20 37.6 82.1 67.8 10 2
#> 21 87.9 71.8 49.7 11 2
#> 22 77.1 76.0 4.00 12 2
#> 23 28.6 1.19 44.6 1 3
#> 24 53.4 74.5 8.65 2 3
#> 25 96.3 20.0 27.2 3 3
#> 26 98.1 45.8 7.89 4 3
#> 27 10.0 32.9 90.8 5 3
#> 28 8.00 11.8 99.2 6 3
#> 29 33.4 29.6 7.34 7 3
#> 30 37.6 82.1 67.8 8 3
<sup>Created on 2023-05-13 by the reprex package (v2.0.1)</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论