识别 R 中序列中大致等值数值的序列

huangapple go评论62阅读模式
英文:

identify sequences of approximately equivalent values in a series using R

问题

我有一系列的数值,其中包括彼此接近的数值字符串,例如下面的序列。请注意,我已经在V1中标记了V2中具有不同值的值的范围,V2中标记为1的所有值彼此之间的点数变化在20点以内。V2中标记为2的所有值都在彼此之间的点数变化在20点以内,以此类推。请注意,这些值并不相同(它们都是不同的)。但相反,它们围绕着一个共同的值聚集。

我手动识别了这些群集。如何自动化处理它?

            V1 V2
    1  399.710  1
    2  403.075  1
    3  405.766  1
    4  407.112  1
    5  408.458  1
    6  409.131  1
    7  410.477  1
    8  411.150  1
    9  412.495  1
    10 332.419  2
    11 330.400  2
    12 329.054  2
    13 327.708  2
    14 326.363  2
    15 325.017  2
    16 322.998  2
    17 319.633  2
    18 314.923  2
    19 288.680  3
    20 285.315  3
    21 283.969  3
    22 281.950  3
    23 279.932  3
    24 276.567  3
    25 273.875  3
    26 272.530  3
    27 271.857  3
    28 272.530  3
    29 273.875  3
    30 274.548  3
    31 275.894  3
    32 275.894  3
    33 276.567  3
    34 277.240  3
    35 278.586  3
    36 279.932  3
    37 281.950  3
    38 284.642  3
    39 288.007  3
    40 291.371  3
    41 294.063  4
    42 295.409  4
    43 296.754  4
    44 297.427  4
    45 298.100  4
    46 299.446  4
    47 300.792  4
    48 303.484  4
    49 306.848  4
    50 327.708  5
    51 309.540  6
    52 310.213  6
    53 309.540  6
    54 306.848  6
    55 304.156  6
    56 302.811  6
    57 302.811  6
    58 304.156  6
    59 305.502  6
    60 306.175  6
    61 306.175  6
    62 304.829  6

我还没有尝试任何方法,不知道如何处理这个问题。

英文:

I have a series of values that includes strings of values that are close to each other, for example the sequences below. Note that roughly around the places I have categorized the values in V1 with distinct values in V2, the range of the values changes. That is, all the values called 1 in V2 are within 20 points of each other. All the values marked 2 in V2 are within 20 points of each other. All the values marked 3 are within 20 points of each other, etc. Notice that the values are not identical (they are all different). But instead, they cluster around a common value.

I identified these clusters manually. How could I automate it?

        V1 V2
1  399.710  1
2  403.075  1
3  405.766  1
4  407.112  1
5  408.458  1
6  409.131  1
7  410.477  1
8  411.150  1
9  412.495  1
10 332.419  2
11 330.400  2
12 329.054  2
13 327.708  2
14 326.363  2
15 325.017  2
16 322.998  2
17 319.633  2
18 314.923  2
19 288.680  3
20 285.315  3
21 283.969  3
22 281.950  3
23 279.932  3
24 276.567  3
25 273.875  3
26 272.530  3
27 271.857  3
28 272.530  3
29 273.875  3
30 274.548  3
31 275.894  3
32 275.894  3
33 276.567  3
34 277.240  3
35 278.586  3
36 279.932  3
37 281.950  3
38 284.642  3
39 288.007  3
40 291.371  3
41 294.063  4
42 295.409  4
43 296.754  4
44 297.427  4
45 298.100  4
46 299.446  4
47 300.792  4
48 303.484  4
49 306.848  4
50 327.708  5
51 309.540  6
52 310.213  6
53 309.540  6
54 306.848  6
55 304.156  6
56 302.811  6
57 302.811  6
58 304.156  6
59 305.502  6
60 306.175  6
61 306.175  6
62 304.829  6

I haven't tried anything yet, I don't know how to do this.

答案1

得分: 1

使用disthclust以及cutree来检测聚类,但在断点处具有唯一级别。

hc <- hclust(dist(x))
cl <- cutree(hc, k=6)
data.frame(x, seq=cumsum(c(0, diff(cl)) != 0) + 1)
#          x seq
# 1  399.710   1
# 2  403.075   1
# 3  405.766   1
# 4  407.112   1
# 5  408.458   1
# 6  409.131   1
# 7  410.477   1
# 8  411.150   1
# 9  412.495   1
# 10 332.419   2
# 11 330.400   2
# 12 329.054   2
# 13 327.708   2
# 14 326.363   2
# 15 325.017   2
# 16 322.998   2
# 17 319.633   3
# 18 314.923   3
# 19 288.680   4
# 20 285.315   4
# 21 283.969   4
# 22 281.950   4
# 23 279.932   4
# 24 276.567   5
# 25 273.875   5
# 26 272.530   5
# 27 271.857   5
# 28 272.530   5
# 29 273.875   5
# 30 274.548   5
# 31 275.894   5
# 32 275.894   5
# 33 276.567   5
# 34 277.240   5
# 35 278.586   6
# 36 279.932   6
# 37 281.950   6
# 38 284.642   6
# 39 288.007   6
# 40 291.371   6
# 41 294.063   7
# 42 295.409   7
# 43 296.754   7
# 44 297.427   7
# 45 298.100   7
# 46 299.446   7
# 47 300.792   7
# 48 303.484   7
# 49 306.848   7
# 50 327.708   8
# 51 309.540   9
# 52 310.213   9
# 53 309.540   9
# 54 306.848   9
# 55 304.156   9
# 56 302.811   9
# 57 302.811   9
# 58 304.156   9
# 59 305.502   9
# 60 306.175   9
# 61 306.175   9
# 62 304.829   9

然而,树状图表明有4个簇,而不是6个,但这是主观的。

plot(hc)
abline(h=30, lty=2, col=2)
abline(h=18.5, lty=2, col=3)
abline(h=14, lty=2, col=4)
legend('topright', lty=2, col=2:4, legend=paste(c(4, 5, 7), 'cluster'), cex=.8)

数据:

x <- c(399.71, 403.075, 405.766, 407.112, 408.458, 409.131, 410.477, 
    411.15, 412.495, 332.419, 330.4, 329.054, 327.708, 326.363, 325.017, 
    322.998, 319.633, 314.923, 288.68, 285.315, 283.969, 281.95, 
    279.932, 276.567, 273.875, 272.53, 271.857, 272.53, 273.875, 
    274.548, 275.894, 275.894, 276.567, 277.24, 278.586, 279.932, 
    281.95, 284.642, 288.007, 291.371, 294.063, 295.409, 296.754, 
    297.427, 298.1, 299.446, 300.792, 303.484, 306.848, 327.708, 
    309.54, 310.213, 309.54, 306.848, 304.156, 302.811, 302.811, 
    304.156, 305.502, 306.175, 306.175, 304.829)
英文:

Using dist and hclust with cutree to detect clusters, but with unique levels at the breaks.

hc &lt;- hclust(dist(x))
cl &lt;- cutree(hc, k=6)
data.frame(x, seq=cumsum(c(0, diff(cl)) != 0) + 1)
#          x seq
# 1  399.710   1
# 2  403.075   1
# 3  405.766   1
# 4  407.112   1
# 5  408.458   1
# 6  409.131   1
# 7  410.477   1
# 8  411.150   1
# 9  412.495   1
# 10 332.419   2
# 11 330.400   2
# 12 329.054   2
# 13 327.708   2
# 14 326.363   2
# 15 325.017   2
# 16 322.998   2
# 17 319.633   3
# 18 314.923   3
# 19 288.680   4
# 20 285.315   4
# 21 283.969   4
# 22 281.950   4
# 23 279.932   4
# 24 276.567   5
# 25 273.875   5
# 26 272.530   5
# 27 271.857   5
# 28 272.530   5
# 29 273.875   5
# 30 274.548   5
# 31 275.894   5
# 32 275.894   5
# 33 276.567   5
# 34 277.240   5
# 35 278.586   6
# 36 279.932   6
# 37 281.950   6
# 38 284.642   6
# 39 288.007   6
# 40 291.371   6
# 41 294.063   7
# 42 295.409   7
# 43 296.754   7
# 44 297.427   7
# 45 298.100   7
# 46 299.446   7
# 47 300.792   7
# 48 303.484   7
# 49 306.848   7
# 50 327.708   8
# 51 309.540   9
# 52 310.213   9
# 53 309.540   9
# 54 306.848   9
# 55 304.156   9
# 56 302.811   9
# 57 302.811   9
# 58 304.156   9
# 59 305.502   9
# 60 306.175   9
# 61 306.175   9
# 62 304.829   9

However, the dendrogram suggests rather k=4 clusters instead of 6, but it is arbitrary.

plot(hc)
abline(h=30, lty=2, col=2)
abline(h=18.5, lty=2, col=3)
abline(h=14, lty=2, col=4)
legend(&#39;topright&#39;, lty=2, col=2:4, legend=paste(c(4, 5, 7), &#39;cluster&#39;), cex=.8)

识别 R 中序列中大致等值数值的序列


Data:

x &lt;- c(399.71, 403.075, 405.766, 407.112, 408.458, 409.131, 410.477, 
411.15, 412.495, 332.419, 330.4, 329.054, 327.708, 326.363, 325.017, 
322.998, 319.633, 314.923, 288.68, 285.315, 283.969, 281.95, 
279.932, 276.567, 273.875, 272.53, 271.857, 272.53, 273.875, 
274.548, 275.894, 275.894, 276.567, 277.24, 278.586, 279.932, 
281.95, 284.642, 288.007, 291.371, 294.063, 295.409, 296.754, 
297.427, 298.1, 299.446, 300.792, 303.484, 306.848, 327.708, 
309.54, 310.213, 309.54, 306.848, 304.156, 302.811, 302.811, 
304.156, 305.502, 306.175, 306.175, 304.829)

答案2

得分: 0

这个解决方案遍历每个数值,检查到该点为止组内所有值的范围,并且如果范围大于阈值,则开始一个新的组。

maxrange <- 18

grp_start <- 1
grp_num <- 1
V3 <- numeric(length(dat$V1))
for (i in seq_along(dat$V1)) {
  grp <- dat$V1[grp_start:i]
  if (max(grp) - min(grp) > maxrange) {
    grp_num <- grp_num + 1 
    grp_start <- i
  }
  V3[[i]] <- grp_num
}

cbind(dat, V3)
英文:

This solution iterates over every value, checks the range of all values in the group up to that point, and starts a new group if the range is greater than a threshold.

maxrange &lt;- 18

grp_start &lt;- 1
grp_num &lt;- 1
V3 &lt;- numeric(length(dat$V1))
for (i in seq_along(dat$V1)) {
  grp &lt;- dat$V1[grp_start:i]
  if (max(grp) - min(grp) &gt; maxrange) {
    grp_num &lt;- grp_num + 1 
    grp_start &lt;- i
  }
  V3[[i]] &lt;- grp_num
}

cbind(dat, V3)
        V1 V2 V3
1  399.710  1  1
2  403.075  1  1
3  405.766  1  1
4  407.112  1  1
5  408.458  1  1
6  409.131  1  1
7  410.477  1  1
8  411.150  1  1
9  412.495  1  1
10 332.419  2  2
11 330.400  2  2
12 329.054  2  2
13 327.708  2  2
14 326.363  2  2
15 325.017  2  2
16 322.998  2  2
17 319.633  2  2
18 314.923  2  2
19 288.680  3  3
20 285.315  3  3
21 283.969  3  3
22 281.950  3  3
23 279.932  3  3
24 276.567  3  3
25 273.875  3  3
26 272.530  3  3
27 271.857  3  3
28 272.530  3  3
29 273.875  3  3
30 274.548  3  3
31 275.894  3  3
32 275.894  3  3
33 276.567  3  3
34 277.240  3  3
35 278.586  3  3
36 279.932  3  3
37 281.950  3  3
38 284.642  3  3
39 288.007  3  3
40 291.371  3  4
41 294.063  4  4
42 295.409  4  4
43 296.754  4  4
44 297.427  4  4
45 298.100  4  4
46 299.446  4  4
47 300.792  4  4
48 303.484  4  4
49 306.848  4  4
50 327.708  5  5
51 309.540  6  6
52 310.213  6  6
53 309.540  6  6
54 306.848  6  6
55 304.156  6  6
56 302.811  6  6
57 302.811  6  6
58 304.156  6  6
59 305.502  6  6
60 306.175  6  6
61 306.175  6  6
62 304.829  6  6

A threshold of 18 reproduces your groups, except that group 4 starts one row earlier. You could use a higher threshold, but then group 6 would start later than you have it.

huangapple
  • 本文由 发表于 2023年2月19日 03:04:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/75495724.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定