余弦相似度和杰卡德相似度的解释(直方图的相似度)

huangapple go评论71阅读模式
英文:

Interpretation of cosine similarity and jaccard similarity (similarity of histograms)

问题

以下是翻译好的内容:

简介

我想要评估两个“bin counts”数组(与两个直方图相关)之间的相似性,使用Matlab的“pdist2”函数:

% 输入
bin_counts_a = [689   430   311   135    66    67    99    23    37    19     8     4     3     4     1     3     1     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     1     0     0     0     0     1];
bin_counts_b = [569   402   200   166   262    90    50    16    33    12     6    35    49     4    12     8     8     2     1     0     0     0     0     1     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0     1];

% 将两个“bin counts”向量可视化为条形图:
bar(1:length(bin_counts_a),[bin_counts_a;bin_counts_b])

余弦相似度和杰卡德相似度的解释(直方图的相似度)

% 相似性计算
cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')

% 输出
cosine_similarity =

          0.95473215802008


jaccard_similarity =

        0.0769230769230769

问题

如果余弦相似度接近1,这意味着两个向量相似,那么杰卡德相似度不应该也接近1吗?

英文:

Introduction

I would like to assess the similarity between two "bin counts" arrays (related to two histograms), by using the Matlab "pdist2" function:

% Input
bin_counts_a = [689   430   311   135    66    67    99    23    37    19     8     4     3     4     1     3     1     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     1     0     0     0     0     1];
bin_counts_b = [569   402   200   166   262    90    50    16    33    12     6    35    49     4    12     8     8     2     1     0     0     0     0     1     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0     1];

% Visualize the two "bin counts" vectors as bars:
bar(1:length(bin_counts_a),[bin_counts_a;bin_counts_b])

余弦相似度和杰卡德相似度的解释(直方图的相似度)

% Calculation of similarities
cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')

% Output
cosine_similarity =

          0.95473215802008


jaccard_similarity =

        0.0769230769230769

Question

If the cosine similarity is close to 1, which means the two vectors are similar, shouldn't the jaccard similarity be closer to 1 as well?

答案1

得分: 3

"jaccard" 度量,根据文档,仅考虑"不同的非零坐标的百分比",但不考虑它们的差异程度。

例如,假设 bin_counts_a 如您的示例所示,以及

bin_counts_b = bin_counts_a + 1;

然后

>> cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
cosine_similarity =
   0.999971577948095

几乎等于 1,因为箱计数非常相似。然而,

>> jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
jaccard_similarity =
     0

返回 0,因为bin_counts_b 中的每个条目与bin_counts_a 中的条目(略微)不同。

对于评估直方图之间的相似性,'cosine' 可能是比 'jaccard' 更有意义的选项。您还可以考虑库尔巴克-莱布勒散度,尽管它不对称于两个分布,并且不是由 pdist2 计算的。

英文:

The 'jaccard' measure, according to the documentation, only considers the "percentage of nonzero coordinates that differ", but not by how much they differ.

For instance, assume bin_counts_a as in your example and

bin_counts_b = bin_counts_a + 1;

Then

>> cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
cosine_similarity =
   0.999971577948095

is almost 1 as expected, because the bin counts are very similar. However,

>> jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
jaccard_similarity =
     0

gives 0 because each entry in bin_counts_b is (slightly) different from that in bin_counts_a.

For assessing the similarity between the histograms, 'cosine' is probably a more meaningful option than 'jaccard'. You may also want to consider the Kullback-Leibler divergence, although it is not symmetric in the two distributions, and is not computed by pdist2.

huangapple
  • 本文由 发表于 2023年6月26日 21:56:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76557362.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定