英文:
identify sequences of approximately equivalent values in a series using R
问题
我有一系列的数值,其中包括彼此接近的数值字符串,例如下面的序列。请注意,我已经在V1中标记了V2中具有不同值的值的范围,V2中标记为1的所有值彼此之间的点数变化在20点以内。V2中标记为2的所有值都在彼此之间的点数变化在20点以内,以此类推。请注意,这些值并不相同(它们都是不同的)。但相反,它们围绕着一个共同的值聚集。
我手动识别了这些群集。如何自动化处理它?
V1 V2
1 399.710 1
2 403.075 1
3 405.766 1
4 407.112 1
5 408.458 1
6 409.131 1
7 410.477 1
8 411.150 1
9 412.495 1
10 332.419 2
11 330.400 2
12 329.054 2
13 327.708 2
14 326.363 2
15 325.017 2
16 322.998 2
17 319.633 2
18 314.923 2
19 288.680 3
20 285.315 3
21 283.969 3
22 281.950 3
23 279.932 3
24 276.567 3
25 273.875 3
26 272.530 3
27 271.857 3
28 272.530 3
29 273.875 3
30 274.548 3
31 275.894 3
32 275.894 3
33 276.567 3
34 277.240 3
35 278.586 3
36 279.932 3
37 281.950 3
38 284.642 3
39 288.007 3
40 291.371 3
41 294.063 4
42 295.409 4
43 296.754 4
44 297.427 4
45 298.100 4
46 299.446 4
47 300.792 4
48 303.484 4
49 306.848 4
50 327.708 5
51 309.540 6
52 310.213 6
53 309.540 6
54 306.848 6
55 304.156 6
56 302.811 6
57 302.811 6
58 304.156 6
59 305.502 6
60 306.175 6
61 306.175 6
62 304.829 6
我还没有尝试任何方法,不知道如何处理这个问题。
英文:
I have a series of values that includes strings of values that are close to each other, for example the sequences below. Note that roughly around the places I have categorized the values in V1 with distinct values in V2, the range of the values changes. That is, all the values called 1 in V2 are within 20 points of each other. All the values marked 2 in V2 are within 20 points of each other. All the values marked 3 are within 20 points of each other, etc. Notice that the values are not identical (they are all different). But instead, they cluster around a common value.
I identified these clusters manually. How could I automate it?
V1 V2
1 399.710 1
2 403.075 1
3 405.766 1
4 407.112 1
5 408.458 1
6 409.131 1
7 410.477 1
8 411.150 1
9 412.495 1
10 332.419 2
11 330.400 2
12 329.054 2
13 327.708 2
14 326.363 2
15 325.017 2
16 322.998 2
17 319.633 2
18 314.923 2
19 288.680 3
20 285.315 3
21 283.969 3
22 281.950 3
23 279.932 3
24 276.567 3
25 273.875 3
26 272.530 3
27 271.857 3
28 272.530 3
29 273.875 3
30 274.548 3
31 275.894 3
32 275.894 3
33 276.567 3
34 277.240 3
35 278.586 3
36 279.932 3
37 281.950 3
38 284.642 3
39 288.007 3
40 291.371 3
41 294.063 4
42 295.409 4
43 296.754 4
44 297.427 4
45 298.100 4
46 299.446 4
47 300.792 4
48 303.484 4
49 306.848 4
50 327.708 5
51 309.540 6
52 310.213 6
53 309.540 6
54 306.848 6
55 304.156 6
56 302.811 6
57 302.811 6
58 304.156 6
59 305.502 6
60 306.175 6
61 306.175 6
62 304.829 6
I haven't tried anything yet, I don't know how to do this.
答案1
得分: 1
使用dist
和hclust
以及cutree
来检测聚类,但在断点处具有唯一级别。
hc <- hclust(dist(x))
cl <- cutree(hc, k=6)
data.frame(x, seq=cumsum(c(0, diff(cl)) != 0) + 1)
# x seq
# 1 399.710 1
# 2 403.075 1
# 3 405.766 1
# 4 407.112 1
# 5 408.458 1
# 6 409.131 1
# 7 410.477 1
# 8 411.150 1
# 9 412.495 1
# 10 332.419 2
# 11 330.400 2
# 12 329.054 2
# 13 327.708 2
# 14 326.363 2
# 15 325.017 2
# 16 322.998 2
# 17 319.633 3
# 18 314.923 3
# 19 288.680 4
# 20 285.315 4
# 21 283.969 4
# 22 281.950 4
# 23 279.932 4
# 24 276.567 5
# 25 273.875 5
# 26 272.530 5
# 27 271.857 5
# 28 272.530 5
# 29 273.875 5
# 30 274.548 5
# 31 275.894 5
# 32 275.894 5
# 33 276.567 5
# 34 277.240 5
# 35 278.586 6
# 36 279.932 6
# 37 281.950 6
# 38 284.642 6
# 39 288.007 6
# 40 291.371 6
# 41 294.063 7
# 42 295.409 7
# 43 296.754 7
# 44 297.427 7
# 45 298.100 7
# 46 299.446 7
# 47 300.792 7
# 48 303.484 7
# 49 306.848 7
# 50 327.708 8
# 51 309.540 9
# 52 310.213 9
# 53 309.540 9
# 54 306.848 9
# 55 304.156 9
# 56 302.811 9
# 57 302.811 9
# 58 304.156 9
# 59 305.502 9
# 60 306.175 9
# 61 306.175 9
# 62 304.829 9
然而,树状图表明有4个簇,而不是6个,但这是主观的。
plot(hc)
abline(h=30, lty=2, col=2)
abline(h=18.5, lty=2, col=3)
abline(h=14, lty=2, col=4)
legend('topright', lty=2, col=2:4, legend=paste(c(4, 5, 7), 'cluster'), cex=.8)
数据:
x <- c(399.71, 403.075, 405.766, 407.112, 408.458, 409.131, 410.477,
411.15, 412.495, 332.419, 330.4, 329.054, 327.708, 326.363, 325.017,
322.998, 319.633, 314.923, 288.68, 285.315, 283.969, 281.95,
279.932, 276.567, 273.875, 272.53, 271.857, 272.53, 273.875,
274.548, 275.894, 275.894, 276.567, 277.24, 278.586, 279.932,
281.95, 284.642, 288.007, 291.371, 294.063, 295.409, 296.754,
297.427, 298.1, 299.446, 300.792, 303.484, 306.848, 327.708,
309.54, 310.213, 309.54, 306.848, 304.156, 302.811, 302.811,
304.156, 305.502, 306.175, 306.175, 304.829)
英文:
Using dist
and hclust
with cutree
to detect clusters, but with unique levels at the breaks.
hc <- hclust(dist(x))
cl <- cutree(hc, k=6)
data.frame(x, seq=cumsum(c(0, diff(cl)) != 0) + 1)
# x seq
# 1 399.710 1
# 2 403.075 1
# 3 405.766 1
# 4 407.112 1
# 5 408.458 1
# 6 409.131 1
# 7 410.477 1
# 8 411.150 1
# 9 412.495 1
# 10 332.419 2
# 11 330.400 2
# 12 329.054 2
# 13 327.708 2
# 14 326.363 2
# 15 325.017 2
# 16 322.998 2
# 17 319.633 3
# 18 314.923 3
# 19 288.680 4
# 20 285.315 4
# 21 283.969 4
# 22 281.950 4
# 23 279.932 4
# 24 276.567 5
# 25 273.875 5
# 26 272.530 5
# 27 271.857 5
# 28 272.530 5
# 29 273.875 5
# 30 274.548 5
# 31 275.894 5
# 32 275.894 5
# 33 276.567 5
# 34 277.240 5
# 35 278.586 6
# 36 279.932 6
# 37 281.950 6
# 38 284.642 6
# 39 288.007 6
# 40 291.371 6
# 41 294.063 7
# 42 295.409 7
# 43 296.754 7
# 44 297.427 7
# 45 298.100 7
# 46 299.446 7
# 47 300.792 7
# 48 303.484 7
# 49 306.848 7
# 50 327.708 8
# 51 309.540 9
# 52 310.213 9
# 53 309.540 9
# 54 306.848 9
# 55 304.156 9
# 56 302.811 9
# 57 302.811 9
# 58 304.156 9
# 59 305.502 9
# 60 306.175 9
# 61 306.175 9
# 62 304.829 9
However, the dendrogram suggests rather k=4
clusters instead of 6, but it is arbitrary.
plot(hc)
abline(h=30, lty=2, col=2)
abline(h=18.5, lty=2, col=3)
abline(h=14, lty=2, col=4)
legend('topright', lty=2, col=2:4, legend=paste(c(4, 5, 7), 'cluster'), cex=.8)
Data:
x <- c(399.71, 403.075, 405.766, 407.112, 408.458, 409.131, 410.477,
411.15, 412.495, 332.419, 330.4, 329.054, 327.708, 326.363, 325.017,
322.998, 319.633, 314.923, 288.68, 285.315, 283.969, 281.95,
279.932, 276.567, 273.875, 272.53, 271.857, 272.53, 273.875,
274.548, 275.894, 275.894, 276.567, 277.24, 278.586, 279.932,
281.95, 284.642, 288.007, 291.371, 294.063, 295.409, 296.754,
297.427, 298.1, 299.446, 300.792, 303.484, 306.848, 327.708,
309.54, 310.213, 309.54, 306.848, 304.156, 302.811, 302.811,
304.156, 305.502, 306.175, 306.175, 304.829)
答案2
得分: 0
这个解决方案遍历每个数值,检查到该点为止组内所有值的范围,并且如果范围大于阈值,则开始一个新的组。
maxrange <- 18
grp_start <- 1
grp_num <- 1
V3 <- numeric(length(dat$V1))
for (i in seq_along(dat$V1)) {
grp <- dat$V1[grp_start:i]
if (max(grp) - min(grp) > maxrange) {
grp_num <- grp_num + 1
grp_start <- i
}
V3[[i]] <- grp_num
}
cbind(dat, V3)
英文:
This solution iterates over every value, checks the range of all values in the group up to that point, and starts a new group if the range is greater than a threshold.
maxrange <- 18
grp_start <- 1
grp_num <- 1
V3 <- numeric(length(dat$V1))
for (i in seq_along(dat$V1)) {
grp <- dat$V1[grp_start:i]
if (max(grp) - min(grp) > maxrange) {
grp_num <- grp_num + 1
grp_start <- i
}
V3[[i]] <- grp_num
}
cbind(dat, V3)
V1 V2 V3
1 399.710 1 1
2 403.075 1 1
3 405.766 1 1
4 407.112 1 1
5 408.458 1 1
6 409.131 1 1
7 410.477 1 1
8 411.150 1 1
9 412.495 1 1
10 332.419 2 2
11 330.400 2 2
12 329.054 2 2
13 327.708 2 2
14 326.363 2 2
15 325.017 2 2
16 322.998 2 2
17 319.633 2 2
18 314.923 2 2
19 288.680 3 3
20 285.315 3 3
21 283.969 3 3
22 281.950 3 3
23 279.932 3 3
24 276.567 3 3
25 273.875 3 3
26 272.530 3 3
27 271.857 3 3
28 272.530 3 3
29 273.875 3 3
30 274.548 3 3
31 275.894 3 3
32 275.894 3 3
33 276.567 3 3
34 277.240 3 3
35 278.586 3 3
36 279.932 3 3
37 281.950 3 3
38 284.642 3 3
39 288.007 3 3
40 291.371 3 4
41 294.063 4 4
42 295.409 4 4
43 296.754 4 4
44 297.427 4 4
45 298.100 4 4
46 299.446 4 4
47 300.792 4 4
48 303.484 4 4
49 306.848 4 4
50 327.708 5 5
51 309.540 6 6
52 310.213 6 6
53 309.540 6 6
54 306.848 6 6
55 304.156 6 6
56 302.811 6 6
57 302.811 6 6
58 304.156 6 6
59 305.502 6 6
60 306.175 6 6
61 306.175 6 6
62 304.829 6 6
A threshold of 18 reproduces your groups, except that group 4 starts one row earlier. You could use a higher threshold, but then group 6 would start later than you have it.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论