在数据框中基于重复数字序列创建标识列。

huangapple go评论91阅读模式
英文:

Create identifier column based on repeating sequence of numbers in data frame

问题

我有一个监测数据集,记录了不同主体在21天(大多数情况下)的周期内的体重 - 然而,有时周期会缩短(例如,20天)或延长(例如,28天)。请参见示例:

  1. set.seed(20)
  2. df <- data.frame(subject1 = runif(n=1:90, min = 1, max = 100),
  3. subject2 = runif(n=1:90, min = 1, max = 100),
  4. subject3 = runif(n=1:90, min = 1, max = 100),
  5. day = c(rep(1:21, 2), 1:28, 1:20))
  6. df

我想创建一个带有“批次ID”的列,每次周期重新开始时增加一个数字,得到类似这样的结果:

  1. set.seed(20)
  2. df <- data.frame(subject1 = runif(n=1:90, min = 1, max = 100),
  3. subject2 = runif(n=1:90, min = 1, max = 100),
  4. subject3 = runif(n=1:90, min = 1, max = 100),
  5. day = c(rep(1:21, 2), 1:28, 1:20),
  6. ID = c(rep("批次 1", 21), rep("批次 2", 21), rep("批次 3", 28), rep("批次 4", 20)))
  7. df

我不知道从哪里开始。数据已经收集了很多年,df非常长,这就是为什么我需要一个自动化方法的原因。我通常使用dplyr,但是其他语言的解决方案也受欢迎。

英文:

I have a monitoring data set which records weights for different subjects in cycles of 21 days (mostly) - however, sometimes the cycle i shortened (e.g. 20 days) or expanded (e.g. 28 days). See example:

  1. set.seed(20)
  2. df &lt;- data.frame(subject1 = runif(n=1:90, min = 1, max = 100),
  3. subject2 = runif(n=1:90, min = 1, max = 100),
  4. subject3 = runif(n=1:90, min = 1, max = 100),
  5. day = c(rep(1:21, 2), 1:28, 1:20))
  6. df

I would like to create a column with a "batch ID" which increase in number each time the cycle starts over, getting something like this:

  1. set.seed(20)
  2. df &lt;- data.frame(subject1 = runif(n=1:90, min = 1, max = 100),
  3. subject2 = runif(n=1:90, min = 1, max = 100),
  4. subject3 = runif(n=1:90, min = 1, max = 100),
  5. day = c(rep(1:21, 2), 1:28, 1:20),
  6. ID = c(rep(&quot;batch 1&quot;, 21), rep(&quot;batch 2&quot;, 21), rep(&quot;batch 3&quot;, 28), rep(&quot;batch 4&quot;, 20)))
  7. df

I don't know where to start. The data has been collected for many years and the df is extremely long, which is why I need an automated method for this.

I usually use dplyr, however solutions in all languages are welcome.

答案1

得分: 1

这是代码的一部分,其中用到了R语言中的cumsum(day==1)来创建新的分组。每当day等于1时,就会开始一个新的分组,并使用"batch"作为前缀。

英文:

It is a kind of starting a new group. We can do it with cumsum(day==1). Eachtime day is 1, a new group is started:

  1. library(dplyr)
  2. df %&gt;%
  3. mutate(ID = cumsum(day==1),
  4. ID = paste(&quot;batch&quot;, ID))
  1. subject1 subject2 subject3 day ID
  2. 1 28.112304 23.747643 2.506307 1 batch 1
  3. 2 14.707646 64.856926 64.934039 2 batch 1
  4. 3 63.048533 72.325234 68.608072 3 batch 1
  5. 4 94.326426 75.051280 6.784585 4 batch 1
  6. 5 27.176812 31.487133 56.319612 5 batch 1
  7. 6 94.341248 59.129401 7.002702 6 batch 1
  8. 7 97.219212 89.005288 51.363620 7 batch 1
  9. 8 14.826730 11.091299 72.477603 8 batch 1
  10. 9 14.250449 84.572366 51.806585 9 batch 1
  11. 10 8.731178 50.483787 63.905433 10 batch 1
  12. 11 24.595016 54.690629 29.220187 11 batch 1
  13. 12 66.109872 38.976594 57.952008 12 batch 1
  14. 13 69.709213 28.246633 96.082079 13 batch 1
  15. 14 98.662153 39.431172 38.304419 14 batch 1
  16. 15 35.180087 19.315271 94.495912 15 batch 1
  17. 16 45.465575 31.429093 86.362035 16 batch 1
  18. 17 88.569241 65.789032 15.264510 17 batch 1
  19. 18 41.317364 19.413599 36.570227 18 batch 1
  20. 19 93.470092 61.837675 28.270692 19 batch 1
  21. 20 99.405225 1.825664 46.233582 20 batch 1
  22. 21 98.778535 10.493757 3.473848 21 batch 1
  23. 22 94.813979 15.090318 10.536749 1 batch 2
  24. 23 1.580498 43.355038 47.193968 2 batch 2
  25. 24 8.171572 82.660440 99.404702 3 batch 2
  26. 25 10.791114 40.136418 7.615652 4 batch 2
  27. 26 28.782935 61.110801 15.203424 5 batch 2
  28. 27 26.170606 60.001865 21.975446 6 batch 2
  29. 28 98.073649 4.205692 97.422975 7 batch 2
  30. 29 23.640042 5.384184 18.883684 8 batch 2
  31. 30 18.924822 78.691799 15.707679 9 batch 2
  32. 31 63.542991 83.722546 81.379621 10 batch 2
  33. 32 65.861969 51.918785 14.969885 11 batch 2
  34. 33 59.136869 45.680453 77.921039 12 batch 2
  35. 34 43.916728 96.927358 71.636865 13 batch 2
  36. 35 84.404238 25.997312 27.519646 14 batch 2
  37. 36 1.343036 38.276619 74.979714 15 batch 2
  38. 37 54.025242 68.931825 67.697015 16 batch 2
  39. 38 9.654027 45.233349 87.036974 17 batch 2
  40. 39 90.297132 78.244482 43.231548 18 batch 2
  41. 40 53.228791 49.865308 86.774027 19 batch 2
  42. 41 72.950380 74.636688 45.980257 20 batch 2
  43. 42 15.766594 32.927134 96.377100 21 batch 2
  44. 43 65.280840 33.047331 53.369068 1 batch 3
  45. 44 41.419024 84.291542 1.055891 2 batch 3
  46. 45 64.189356 97.864732 8.435106 3 batch 3
  47. 46 64.782105 82.842300 80.991830 4 batch 3
  48. 47 61.184364 20.249522 13.622524 5 batch 3
  49. 48 20.608730 74.588932 91.931483 6 batch 3
  50. 49 80.901141 28.992681 72.589333 7 batch 3
  51. 50 4.084888 30.209502 63.598710 8 batch 3
  52. 51 95.588275 74.761589 62.743750 9 batch 3
  53. 52 62.271646 9.675652 76.652212 10 batch 3
  54. 53 43.174593 94.037260 42.026702 11 batch 3
  55. 54 53.207426 93.448598 61.360762 12 batch 3
  56. 55 19.355807 35.463982 53.029776 13 batch 3
  57. 56 13.580139 97.384648 80.816749 14 batch 3
  58. 57 32.219313 26.256102 45.672542 15 batch 3
  59. 58 33.894195 73.801500 8.051782 16 batch 3
  60. 59 45.671515 51.298810 7.600933 17 batch 3
  61. 60 24.499980 62.102165 86.877326 18 batch 3
  62. 61 60.893774 30.891038 16.634176 19 batch 3
  63. 62 89.360993 78.245477 29.801789 20 batch 3
  64. 63 57.501857 19.254282 20.347036 21 batch 3
  65. 64 80.004186 17.583524 30.764541 22 batch 3
  66. 65 71.140736 76.311887 30.298813 23 batch 3
  67. 66 72.739952 13.229765 56.278070 24 batch 3
  68. 67 62.314198 79.444292 8.105422 25 batch 3
  69. 68 32.037481 10.490401 92.195441 26 batch 3
  70. 69 66.342709 48.151444 26.442403 27 batch 3
  71. 70 65.165316 9.942519 6.828309 28 batch 3
  72. 71 45.501737 15.926725 93.197676 1 batch 4
  73. 72 48.523332 1.570688 55.853054 2 batch 4
  74. 73 26.508309 52.256063 93.128411 3 batch 4
  75. 74 97.476880 35.398011 89.341347 4 batch 4
  76. 75 17.123732 84.779030 43.547328 5 batch 4
  77. 76 18.453889 59.836859 38.898992 6 batch 4
  78. 77 1.419466 1.403754 75.434308 7 batch 4
  79. 78 21.803886 91.401409 84.694884 8 batch 4
  80. 79 61.964926 93.287858 36.304794 9 batch 4
  81. 80 81.242887 82.993459 3.940457 10 batch 4
  82. 81 57.470248 2.156846 16.255565 11 batch 4
  83. 82 79.895440 46.352555 7.666849 12 batch 4
  84. 83 38.304018 46.883902 28.208153 13 batch 4
  85. 84 92.238729 35.422918 99.706443 14 batch 4
  86. 85 60.493556 71.277974 52.243233 15 batch 4
  87. 86 99.128086 36.706295 60.799051 16 batch 4
  88. 87 29.826266 61.463732 86.887740 17 batch 4
  89. 88 46.486269 86.044449 73.024360 18 batch 4
  90. 89 96.568415 6.379009 58.893206 19 batch 4
  91. 90 92.866674 53.126197 59.812420 20 batch 4

答案2

得分: 0

以下是您要翻译的内容:

新的分组开始,如果其"day"值小于其前导值。这适用于不等大小的分组。然后,您可以应用一种运行长度编码:

  1. library(tidyverse)
  2. set.seed(20)
  3. df <- data.frame(
  4. subject1 = runif(n = 1:10, min = 1, max = 100),
  5. subject2 = runif(n = 1:10, min = 1, max = 100),
  6. subject3 = runif(n = 1:10, min = 1, max = 100),
  7. day = c(1:10, 1:12, 1:8)
  8. )
  9. df <-
  10. df |
  11. as_tibble() |
  12. mutate(batch_flag = (day > lag(day)) |> replace_na(TRUE))
  13. day_rle <- df$batch_flag |> rle()
  14. day_rle$values <-
  15. day_rle$values |>
  16. length() |>
  17. seq()
  18. day_batches <- day_rle |> inverse.rle()
  19. df <-
  20. df |
  21. mutate(
  22. batch = day_batches |>
  23. map_dbl(~ ifelse(.x %% 2 == 0, .x + 1, .x)) |>
  24. map_dbl(~ ceiling(.x / 2))
  25. ) |
  26. select(-batch_flag)
  27. df |> print(n = 30)
  28. #> # A tibble: 30 × 5
  29. #> subject1 subject2 subject3 day batch
  30. #> <dbl> <dbl> <dbl> <int> <dbl>
  31. #> 1 87.9 71.8 49.7 1 1
  32. #> 2 77.1 76.0 4.00 2 1
  33. #> 3 28.6 1.19 44.6 3 1
  34. #> 4 53.4 74.5 8.65 4 1
  35. #> 5 96.3 20.0 27.2 5 1
  36. #> 6 98.1 45.8 7.89 6 1
  37. #> 7 10.0 32.9 90.8 7 1
  38. #> 8 8.00 11.8 99.2 8 1
  39. #> 9 33.4 29.6 7.34 9 1
  40. #> 10 37.6 82.1 67.8 10 1
  41. #> 11 87.9 71.8 49.7 1 2
  42. #> 12 77.1 76.0 4.00 2 2
  43. #> 13 28.6 1.19 44.6 3 2
  44. #> 14 53.4 74.5 8.65 4 2
  45. #> 15 96.3 20.0 27.2 5 2
  46. #> 16 98.1 45.8 7.89 6 2
  47. #> 17 10.0 32.9 90.8 7 2
  48. #> 18 8.00 11.8 99.2 8 2
  49. #> 19 33.4 29.6 7.34 9 2
  50. #> 20 37.6 82.1 67.8 10 2
  51. #> 21 87.9 71.8 49.7 11 2
  52. #> 22 77.1 76.0 4.00 12 2
  53. #> 23 28.6 1.19 44.6 1 3
  54. #> 24 53.4 74.5 8.65 2 3
  55. #> 25 96.3 20.0 27.2 3 3
  56. #> 26 98.1 45.8 7.89 4 3
  57. #> 27 10.0 32.9 90.8 5 3
  58. #> 28 8.00 11.8 99.2 6 3
  59. #> 29 33.4 29.6 7.34 7 3
  60. #> 30 37.6 82.1 67.8 8 3

创建于2023-05-13,使用reprex包(v2.0.1)

英文:

A new group starts, if its day value is smaller than its precursor. This works even for groups of unequal sizes. Then you can apply a run length encoding:

  1. library(tidyverse)
  2. set.seed(20)
  3. df &lt;- data.frame(
  4. subject1 = runif(n = 1:10, min = 1, max = 100),
  5. subject2 = runif(n = 1:10, min = 1, max = 100),
  6. subject3 = runif(n = 1:10, min = 1, max = 100),
  7. day = c(1:10, 1:12, 1:8)
  8. )
  9. df &lt;-
  10. df |&gt;
  11. as_tibble() |&gt;
  12. mutate(batch_flag = (day &gt; lag(day)) |&gt; replace_na(TRUE))
  13. day_rle &lt;- df$batch_flag |&gt; rle()
  14. day_rle$values &lt;-
  15. day_rle$values |&gt;
  16. length() |&gt;
  17. seq()
  18. day_batches &lt;- day_rle |&gt; inverse.rle()
  19. df &lt;-
  20. df |&gt;
  21. mutate(
  22. batch = day_batches |&gt;
  23. map_dbl(~ ifelse(.x %% 2 == 0, .x + 1, .x)) |&gt;
  24. map_dbl(~ ceiling(.x / 2))
  25. ) |&gt;
  26. select(-batch_flag)
  27. df |&gt; print(n = 30)
  28. #&gt; # A tibble: 30 &#215; 5
  29. #&gt; subject1 subject2 subject3 day batch
  30. #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
  31. #&gt; 1 87.9 71.8 49.7 1 1
  32. #&gt; 2 77.1 76.0 4.00 2 1
  33. #&gt; 3 28.6 1.19 44.6 3 1
  34. #&gt; 4 53.4 74.5 8.65 4 1
  35. #&gt; 5 96.3 20.0 27.2 5 1
  36. #&gt; 6 98.1 45.8 7.89 6 1
  37. #&gt; 7 10.0 32.9 90.8 7 1
  38. #&gt; 8 8.00 11.8 99.2 8 1
  39. #&gt; 9 33.4 29.6 7.34 9 1
  40. #&gt; 10 37.6 82.1 67.8 10 1
  41. #&gt; 11 87.9 71.8 49.7 1 2
  42. #&gt; 12 77.1 76.0 4.00 2 2
  43. #&gt; 13 28.6 1.19 44.6 3 2
  44. #&gt; 14 53.4 74.5 8.65 4 2
  45. #&gt; 15 96.3 20.0 27.2 5 2
  46. #&gt; 16 98.1 45.8 7.89 6 2
  47. #&gt; 17 10.0 32.9 90.8 7 2
  48. #&gt; 18 8.00 11.8 99.2 8 2
  49. #&gt; 19 33.4 29.6 7.34 9 2
  50. #&gt; 20 37.6 82.1 67.8 10 2
  51. #&gt; 21 87.9 71.8 49.7 11 2
  52. #&gt; 22 77.1 76.0 4.00 12 2
  53. #&gt; 23 28.6 1.19 44.6 1 3
  54. #&gt; 24 53.4 74.5 8.65 2 3
  55. #&gt; 25 96.3 20.0 27.2 3 3
  56. #&gt; 26 98.1 45.8 7.89 4 3
  57. #&gt; 27 10.0 32.9 90.8 5 3
  58. #&gt; 28 8.00 11.8 99.2 6 3
  59. #&gt; 29 33.4 29.6 7.34 7 3
  60. #&gt; 30 37.6 82.1 67.8 8 3

<sup>Created on 2023-05-13 by the reprex package (v2.0.1)</sup>

huangapple
  • 本文由 发表于 2023年5月14日 01:58:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76244195.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定