更改集群标签以进行比较目的

huangapple go评论61阅读模式
英文:

Changing cluster labels for comparison purposes

问题

我需要帮助重新定义两个聚类过程的索引,以便它们可以更直观地进行比较。

假设聚类过程 A 给出以下向量作为输出(每个个体的簇标签向量)

clust1 <- c(1, 1, 1, 1, 3, 2, 2, 1, 1, 2, 3, 2, 2)

而聚类算法 B 返回以下向量

clust2 <- c(3, 3, 3, 3, 5, 2, 2, 3, 3, 2, 5, 2, 2)

正如你所看到的,这两个算法返回了相同的聚类,但如果有数百个观测数据,很难做到这一点。

你能帮助我开发一个自动函数(或通用方式编写的代码片段),以更改两者或其中一个的簇标签,以便它们具有相同的标签吗?

我的主要目的不是比较这两个聚类,但我需要一个能实现我所说的功能的代码,因此请不要试图通过制作图表或列联表来解决我的问题。

提前感谢!

英文:

I need help in redefining the indexes of two clustering procedures in order for them to be comparable in a more straightforward manner.

Suppose that a clustering procedure A gives you the following vector as output (vector of cluster label for each individual)

clust1 &lt;- c(1, 1, 1, 1, 3, 2, 2, 1, 1, 2, 3, 2, 2)

While the clustering algorithm B return the following vector

clust2 &lt;- c(3, 3, 3, 3, 5, 2, 2, 3, 3, 2, 5, 2, 2)

As you can see the two algorithms returned the same clustering but it is not easy to get this if you have hundreds of observations.

Can you help me in develop an automatic function (or a piece of code written in a general way) that changes the cluster labels of either both or one of the two so that they have the same labels?

My main purpose is not comparing the two clustering but I need a code that does what I have said and therefore please don't try to solve my problem just saying that I can compare them with a plot or a contingency table.

Thanks in advance!

答案1

得分: 1

你可以将它们与图表或列联表进行比较。

或者,可以这样做:

relabel <- function(xs) {
  xs <- as.character(xs)
  xs_uniq <- unique(xs)
  hash <- setNames(LETTERS[seq_along(xs_uniq)], xs_uniq)
  as.character(hash[xs])
}
## > relabel(clust1)
## [1] "A" "A" "A" "A" "B" "C" "C" "A" "A" "C" "B" "C" "C"
## > identical(relabel(clust1), relabel(clust2))
## [1] TRUE
英文:

You could compare them with a plot or a contingency table.

Alternatively, like so:

relabel &lt;- \(xs) {
  xs &lt;- as.character(xs)
  xs_uniq &lt;- unique(xs)
  hash &lt;- setNames(LETTERS[seq_along(xs_uniq)], xs_uniq)
  as.character(hash[xs])
}
## &gt; relabel(clust1)
## [1] &quot;A&quot; &quot;A&quot; &quot;A&quot; &quot;A&quot; &quot;B&quot; &quot;C&quot; &quot;C&quot; &quot;A&quot; &quot;A&quot; &quot;C&quot; &quot;B&quot; &quot;C&quot; &quot;C&quot;
## &gt; identical(relabel(clust1), relabel(clust2))
## [1] TRUE

答案2

得分: 1

clust1 <- c(1, 1, 1, 1, 3, 2, 2, 1, 1, 2, 3, 2, 2)
clust2 <- c(3, 3, 3, 3, 5, 2, 2, 3, 3, 2, 5, 2, 2)

clust2_re <- 
  factor(clust2,
       levels = unique(clust2),
       labels = unique(clust1)) |
  as.character() |
  as.numeric()

clust2_re
#> [1] 1 1 1 1 3 2 2 1 1 2 3 2 2

all(clust1 == clust2_re)
#> [1] TRUE

clust3 <- c(3, 3, 3, 3, 5, 2, 2, 3, 3, 2, 5, 2, 3)

library(igraph)
compare(clust1, clust2)
#> [1] 0

compare(clust1, clust3)
#> [1] 0.4132943
英文:
clust1 &lt;- c(1, 1, 1, 1, 3, 2, 2, 1, 1, 2, 3, 2, 2)
clust2 &lt;- c(3, 3, 3, 3, 5, 2, 2, 3, 3, 2, 5, 2, 2)

Here is a solution that works as long the number of clusters is the same between the two solutions.
We are using factor() to apply the labels of clust1 to clust2.

clust2_re &lt;- 
  factor(clust2,
       levels = unique(clust2),
       labels = unique(clust1)) |&gt; 
  as.character() |&gt; 
  as.numeric()

clust2_re
#&gt;  [1] 1 1 1 1 3 2 2 1 1 2 3 2 2

all(clust1 == clust2_re)
#&gt; [1] TRUE

Furthermore: igraph has a compare() function that returns the distance between clustering results, which also works when cluster labels differ.
Let’s add a third cluster variation and change only the last value…

clust3 &lt;- c(3, 3, 3, 3, 5, 2, 2, 3, 3, 2, 5, 2, 3)

When two clustering solutions are the same compare() returns 0

library(igraph)
compare(clust1, clust2)
#&gt; [1] 0

Whenever there are differences the result will be &gt; 0

compare(clust1, clust3)
#&gt; [1] 0.4132943

huangapple
  • 本文由 发表于 2023年6月5日 23:25:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/76407898.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定