识别 R 中序列中大致等值数值的序列

huangapple go评论107阅读模式
英文:

identify sequences of approximately equivalent values in a series using R

问题

我有一系列的数值,其中包括彼此接近的数值字符串,例如下面的序列。请注意,我已经在V1中标记了V2中具有不同值的值的范围,V2中标记为1的所有值彼此之间的点数变化在20点以内。V2中标记为2的所有值都在彼此之间的点数变化在20点以内,以此类推。请注意,这些值并不相同(它们都是不同的)。但相反,它们围绕着一个共同的值聚集。

我手动识别了这些群集。如何自动化处理它?

  1. V1 V2
  2. 1 399.710 1
  3. 2 403.075 1
  4. 3 405.766 1
  5. 4 407.112 1
  6. 5 408.458 1
  7. 6 409.131 1
  8. 7 410.477 1
  9. 8 411.150 1
  10. 9 412.495 1
  11. 10 332.419 2
  12. 11 330.400 2
  13. 12 329.054 2
  14. 13 327.708 2
  15. 14 326.363 2
  16. 15 325.017 2
  17. 16 322.998 2
  18. 17 319.633 2
  19. 18 314.923 2
  20. 19 288.680 3
  21. 20 285.315 3
  22. 21 283.969 3
  23. 22 281.950 3
  24. 23 279.932 3
  25. 24 276.567 3
  26. 25 273.875 3
  27. 26 272.530 3
  28. 27 271.857 3
  29. 28 272.530 3
  30. 29 273.875 3
  31. 30 274.548 3
  32. 31 275.894 3
  33. 32 275.894 3
  34. 33 276.567 3
  35. 34 277.240 3
  36. 35 278.586 3
  37. 36 279.932 3
  38. 37 281.950 3
  39. 38 284.642 3
  40. 39 288.007 3
  41. 40 291.371 3
  42. 41 294.063 4
  43. 42 295.409 4
  44. 43 296.754 4
  45. 44 297.427 4
  46. 45 298.100 4
  47. 46 299.446 4
  48. 47 300.792 4
  49. 48 303.484 4
  50. 49 306.848 4
  51. 50 327.708 5
  52. 51 309.540 6
  53. 52 310.213 6
  54. 53 309.540 6
  55. 54 306.848 6
  56. 55 304.156 6
  57. 56 302.811 6
  58. 57 302.811 6
  59. 58 304.156 6
  60. 59 305.502 6
  61. 60 306.175 6
  62. 61 306.175 6
  63. 62 304.829 6

我还没有尝试任何方法,不知道如何处理这个问题。

英文:

I have a series of values that includes strings of values that are close to each other, for example the sequences below. Note that roughly around the places I have categorized the values in V1 with distinct values in V2, the range of the values changes. That is, all the values called 1 in V2 are within 20 points of each other. All the values marked 2 in V2 are within 20 points of each other. All the values marked 3 are within 20 points of each other, etc. Notice that the values are not identical (they are all different). But instead, they cluster around a common value.

I identified these clusters manually. How could I automate it?

  1. V1 V2
  2. 1 399.710 1
  3. 2 403.075 1
  4. 3 405.766 1
  5. 4 407.112 1
  6. 5 408.458 1
  7. 6 409.131 1
  8. 7 410.477 1
  9. 8 411.150 1
  10. 9 412.495 1
  11. 10 332.419 2
  12. 11 330.400 2
  13. 12 329.054 2
  14. 13 327.708 2
  15. 14 326.363 2
  16. 15 325.017 2
  17. 16 322.998 2
  18. 17 319.633 2
  19. 18 314.923 2
  20. 19 288.680 3
  21. 20 285.315 3
  22. 21 283.969 3
  23. 22 281.950 3
  24. 23 279.932 3
  25. 24 276.567 3
  26. 25 273.875 3
  27. 26 272.530 3
  28. 27 271.857 3
  29. 28 272.530 3
  30. 29 273.875 3
  31. 30 274.548 3
  32. 31 275.894 3
  33. 32 275.894 3
  34. 33 276.567 3
  35. 34 277.240 3
  36. 35 278.586 3
  37. 36 279.932 3
  38. 37 281.950 3
  39. 38 284.642 3
  40. 39 288.007 3
  41. 40 291.371 3
  42. 41 294.063 4
  43. 42 295.409 4
  44. 43 296.754 4
  45. 44 297.427 4
  46. 45 298.100 4
  47. 46 299.446 4
  48. 47 300.792 4
  49. 48 303.484 4
  50. 49 306.848 4
  51. 50 327.708 5
  52. 51 309.540 6
  53. 52 310.213 6
  54. 53 309.540 6
  55. 54 306.848 6
  56. 55 304.156 6
  57. 56 302.811 6
  58. 57 302.811 6
  59. 58 304.156 6
  60. 59 305.502 6
  61. 60 306.175 6
  62. 61 306.175 6
  63. 62 304.829 6

I haven't tried anything yet, I don't know how to do this.

答案1

得分: 1

使用disthclust以及cutree来检测聚类,但在断点处具有唯一级别。

  1. hc <- hclust(dist(x))
  2. cl <- cutree(hc, k=6)
  3. data.frame(x, seq=cumsum(c(0, diff(cl)) != 0) + 1)
  4. # x seq
  5. # 1 399.710 1
  6. # 2 403.075 1
  7. # 3 405.766 1
  8. # 4 407.112 1
  9. # 5 408.458 1
  10. # 6 409.131 1
  11. # 7 410.477 1
  12. # 8 411.150 1
  13. # 9 412.495 1
  14. # 10 332.419 2
  15. # 11 330.400 2
  16. # 12 329.054 2
  17. # 13 327.708 2
  18. # 14 326.363 2
  19. # 15 325.017 2
  20. # 16 322.998 2
  21. # 17 319.633 3
  22. # 18 314.923 3
  23. # 19 288.680 4
  24. # 20 285.315 4
  25. # 21 283.969 4
  26. # 22 281.950 4
  27. # 23 279.932 4
  28. # 24 276.567 5
  29. # 25 273.875 5
  30. # 26 272.530 5
  31. # 27 271.857 5
  32. # 28 272.530 5
  33. # 29 273.875 5
  34. # 30 274.548 5
  35. # 31 275.894 5
  36. # 32 275.894 5
  37. # 33 276.567 5
  38. # 34 277.240 5
  39. # 35 278.586 6
  40. # 36 279.932 6
  41. # 37 281.950 6
  42. # 38 284.642 6
  43. # 39 288.007 6
  44. # 40 291.371 6
  45. # 41 294.063 7
  46. # 42 295.409 7
  47. # 43 296.754 7
  48. # 44 297.427 7
  49. # 45 298.100 7
  50. # 46 299.446 7
  51. # 47 300.792 7
  52. # 48 303.484 7
  53. # 49 306.848 7
  54. # 50 327.708 8
  55. # 51 309.540 9
  56. # 52 310.213 9
  57. # 53 309.540 9
  58. # 54 306.848 9
  59. # 55 304.156 9
  60. # 56 302.811 9
  61. # 57 302.811 9
  62. # 58 304.156 9
  63. # 59 305.502 9
  64. # 60 306.175 9
  65. # 61 306.175 9
  66. # 62 304.829 9

然而,树状图表明有4个簇,而不是6个,但这是主观的。

  1. plot(hc)
  2. abline(h=30, lty=2, col=2)
  3. abline(h=18.5, lty=2, col=3)
  4. abline(h=14, lty=2, col=4)
  5. legend('topright', lty=2, col=2:4, legend=paste(c(4, 5, 7), 'cluster'), cex=.8)

数据:

  1. x <- c(399.71, 403.075, 405.766, 407.112, 408.458, 409.131, 410.477,
  2. 411.15, 412.495, 332.419, 330.4, 329.054, 327.708, 326.363, 325.017,
  3. 322.998, 319.633, 314.923, 288.68, 285.315, 283.969, 281.95,
  4. 279.932, 276.567, 273.875, 272.53, 271.857, 272.53, 273.875,
  5. 274.548, 275.894, 275.894, 276.567, 277.24, 278.586, 279.932,
  6. 281.95, 284.642, 288.007, 291.371, 294.063, 295.409, 296.754,
  7. 297.427, 298.1, 299.446, 300.792, 303.484, 306.848, 327.708,
  8. 309.54, 310.213, 309.54, 306.848, 304.156, 302.811, 302.811,
  9. 304.156, 305.502, 306.175, 306.175, 304.829)
英文:

Using dist and hclust with cutree to detect clusters, but with unique levels at the breaks.

  1. hc &lt;- hclust(dist(x))
  2. cl &lt;- cutree(hc, k=6)
  3. data.frame(x, seq=cumsum(c(0, diff(cl)) != 0) + 1)
  4. # x seq
  5. # 1 399.710 1
  6. # 2 403.075 1
  7. # 3 405.766 1
  8. # 4 407.112 1
  9. # 5 408.458 1
  10. # 6 409.131 1
  11. # 7 410.477 1
  12. # 8 411.150 1
  13. # 9 412.495 1
  14. # 10 332.419 2
  15. # 11 330.400 2
  16. # 12 329.054 2
  17. # 13 327.708 2
  18. # 14 326.363 2
  19. # 15 325.017 2
  20. # 16 322.998 2
  21. # 17 319.633 3
  22. # 18 314.923 3
  23. # 19 288.680 4
  24. # 20 285.315 4
  25. # 21 283.969 4
  26. # 22 281.950 4
  27. # 23 279.932 4
  28. # 24 276.567 5
  29. # 25 273.875 5
  30. # 26 272.530 5
  31. # 27 271.857 5
  32. # 28 272.530 5
  33. # 29 273.875 5
  34. # 30 274.548 5
  35. # 31 275.894 5
  36. # 32 275.894 5
  37. # 33 276.567 5
  38. # 34 277.240 5
  39. # 35 278.586 6
  40. # 36 279.932 6
  41. # 37 281.950 6
  42. # 38 284.642 6
  43. # 39 288.007 6
  44. # 40 291.371 6
  45. # 41 294.063 7
  46. # 42 295.409 7
  47. # 43 296.754 7
  48. # 44 297.427 7
  49. # 45 298.100 7
  50. # 46 299.446 7
  51. # 47 300.792 7
  52. # 48 303.484 7
  53. # 49 306.848 7
  54. # 50 327.708 8
  55. # 51 309.540 9
  56. # 52 310.213 9
  57. # 53 309.540 9
  58. # 54 306.848 9
  59. # 55 304.156 9
  60. # 56 302.811 9
  61. # 57 302.811 9
  62. # 58 304.156 9
  63. # 59 305.502 9
  64. # 60 306.175 9
  65. # 61 306.175 9
  66. # 62 304.829 9

However, the dendrogram suggests rather k=4 clusters instead of 6, but it is arbitrary.

  1. plot(hc)
  2. abline(h=30, lty=2, col=2)
  3. abline(h=18.5, lty=2, col=3)
  4. abline(h=14, lty=2, col=4)
  5. legend(&#39;topright&#39;, lty=2, col=2:4, legend=paste(c(4, 5, 7), &#39;cluster&#39;), cex=.8)

识别 R 中序列中大致等值数值的序列


Data:

  1. x &lt;- c(399.71, 403.075, 405.766, 407.112, 408.458, 409.131, 410.477,
  2. 411.15, 412.495, 332.419, 330.4, 329.054, 327.708, 326.363, 325.017,
  3. 322.998, 319.633, 314.923, 288.68, 285.315, 283.969, 281.95,
  4. 279.932, 276.567, 273.875, 272.53, 271.857, 272.53, 273.875,
  5. 274.548, 275.894, 275.894, 276.567, 277.24, 278.586, 279.932,
  6. 281.95, 284.642, 288.007, 291.371, 294.063, 295.409, 296.754,
  7. 297.427, 298.1, 299.446, 300.792, 303.484, 306.848, 327.708,
  8. 309.54, 310.213, 309.54, 306.848, 304.156, 302.811, 302.811,
  9. 304.156, 305.502, 306.175, 306.175, 304.829)

答案2

得分: 0

这个解决方案遍历每个数值,检查到该点为止组内所有值的范围,并且如果范围大于阈值,则开始一个新的组。

  1. maxrange <- 18
  2. grp_start <- 1
  3. grp_num <- 1
  4. V3 <- numeric(length(dat$V1))
  5. for (i in seq_along(dat$V1)) {
  6. grp <- dat$V1[grp_start:i]
  7. if (max(grp) - min(grp) > maxrange) {
  8. grp_num <- grp_num + 1
  9. grp_start <- i
  10. }
  11. V3[[i]] <- grp_num
  12. }
  13. cbind(dat, V3)
英文:

This solution iterates over every value, checks the range of all values in the group up to that point, and starts a new group if the range is greater than a threshold.

  1. maxrange &lt;- 18
  2. grp_start &lt;- 1
  3. grp_num &lt;- 1
  4. V3 &lt;- numeric(length(dat$V1))
  5. for (i in seq_along(dat$V1)) {
  6. grp &lt;- dat$V1[grp_start:i]
  7. if (max(grp) - min(grp) &gt; maxrange) {
  8. grp_num &lt;- grp_num + 1
  9. grp_start &lt;- i
  10. }
  11. V3[[i]] &lt;- grp_num
  12. }
  13. cbind(dat, V3)
  1. V1 V2 V3
  2. 1 399.710 1 1
  3. 2 403.075 1 1
  4. 3 405.766 1 1
  5. 4 407.112 1 1
  6. 5 408.458 1 1
  7. 6 409.131 1 1
  8. 7 410.477 1 1
  9. 8 411.150 1 1
  10. 9 412.495 1 1
  11. 10 332.419 2 2
  12. 11 330.400 2 2
  13. 12 329.054 2 2
  14. 13 327.708 2 2
  15. 14 326.363 2 2
  16. 15 325.017 2 2
  17. 16 322.998 2 2
  18. 17 319.633 2 2
  19. 18 314.923 2 2
  20. 19 288.680 3 3
  21. 20 285.315 3 3
  22. 21 283.969 3 3
  23. 22 281.950 3 3
  24. 23 279.932 3 3
  25. 24 276.567 3 3
  26. 25 273.875 3 3
  27. 26 272.530 3 3
  28. 27 271.857 3 3
  29. 28 272.530 3 3
  30. 29 273.875 3 3
  31. 30 274.548 3 3
  32. 31 275.894 3 3
  33. 32 275.894 3 3
  34. 33 276.567 3 3
  35. 34 277.240 3 3
  36. 35 278.586 3 3
  37. 36 279.932 3 3
  38. 37 281.950 3 3
  39. 38 284.642 3 3
  40. 39 288.007 3 3
  41. 40 291.371 3 4
  42. 41 294.063 4 4
  43. 42 295.409 4 4
  44. 43 296.754 4 4
  45. 44 297.427 4 4
  46. 45 298.100 4 4
  47. 46 299.446 4 4
  48. 47 300.792 4 4
  49. 48 303.484 4 4
  50. 49 306.848 4 4
  51. 50 327.708 5 5
  52. 51 309.540 6 6
  53. 52 310.213 6 6
  54. 53 309.540 6 6
  55. 54 306.848 6 6
  56. 55 304.156 6 6
  57. 56 302.811 6 6
  58. 57 302.811 6 6
  59. 58 304.156 6 6
  60. 59 305.502 6 6
  61. 60 306.175 6 6
  62. 61 306.175 6 6
  63. 62 304.829 6 6

A threshold of 18 reproduces your groups, except that group 4 starts one row earlier. You could use a higher threshold, but then group 6 would start later than you have it.

huangapple
  • 本文由 发表于 2023年2月19日 03:04:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/75495724.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定