余弦相似度和杰卡德相似度的解释(直方图的相似度)

huangapple go评论91阅读模式
英文:

Interpretation of cosine similarity and jaccard similarity (similarity of histograms)

问题

以下是翻译好的内容:

简介

我想要评估两个“bin counts”数组(与两个直方图相关)之间的相似性,使用Matlab的“pdist2”函数:

  1. % 输入
  2. bin_counts_a = [689 430 311 135 66 67 99 23 37 19 8 4 3 4 1 3 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1];
  3. bin_counts_b = [569 402 200 166 262 90 50 16 33 12 6 35 49 4 12 8 8 2 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1];
  4. % 将两个“bin counts”向量可视化为条形图:
  5. bar(1:length(bin_counts_a),[bin_counts_a;bin_counts_b])

余弦相似度和杰卡德相似度的解释(直方图的相似度)

  1. % 相似性计算
  2. cosine_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
  3. jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
  4. % 输出
  5. cosine_similarity =
  6. 0.95473215802008
  7. jaccard_similarity =
  8. 0.0769230769230769

问题

如果余弦相似度接近1,这意味着两个向量相似,那么杰卡德相似度不应该也接近1吗?

英文:

Introduction

I would like to assess the similarity between two "bin counts" arrays (related to two histograms), by using the Matlab "pdist2" function:

  1. % Input
  2. bin_counts_a = [689 430 311 135 66 67 99 23 37 19 8 4 3 4 1 3 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1];
  3. bin_counts_b = [569 402 200 166 262 90 50 16 33 12 6 35 49 4 12 8 8 2 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1];
  4. % Visualize the two "bin counts" vectors as bars:
  5. bar(1:length(bin_counts_a),[bin_counts_a;bin_counts_b])

余弦相似度和杰卡德相似度的解释(直方图的相似度)

  1. % Calculation of similarities
  2. cosine_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
  3. jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
  4. % Output
  5. cosine_similarity =
  6. 0.95473215802008
  7. jaccard_similarity =
  8. 0.0769230769230769

Question

If the cosine similarity is close to 1, which means the two vectors are similar, shouldn't the jaccard similarity be closer to 1 as well?

答案1

得分: 3

"jaccard" 度量,根据文档,仅考虑"不同的非零坐标的百分比",但不考虑它们的差异程度。

例如,假设 bin_counts_a 如您的示例所示,以及

  1. bin_counts_b = bin_counts_a + 1;

然后

  1. >> cosine_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
  2. cosine_similarity =
  3. 0.999971577948095

几乎等于 1,因为箱计数非常相似。然而,

  1. >> jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
  2. jaccard_similarity =
  3. 0

返回 0,因为bin_counts_b 中的每个条目与bin_counts_a 中的条目(略微)不同。

对于评估直方图之间的相似性,'cosine' 可能是比 'jaccard' 更有意义的选项。您还可以考虑库尔巴克-莱布勒散度,尽管它不对称于两个分布,并且不是由 pdist2 计算的。

英文:

The 'jaccard' measure, according to the documentation, only considers the "percentage of nonzero coordinates that differ", but not by how much they differ.

For instance, assume bin_counts_a as in your example and

  1. bin_counts_b = bin_counts_a + 1;

Then

  1. >> cosine_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
  2. cosine_similarity =
  3. 0.999971577948095

is almost 1 as expected, because the bin counts are very similar. However,

  1. >> jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
  2. jaccard_similarity =
  3. 0

gives 0 because each entry in bin_counts_b is (slightly) different from that in bin_counts_a.

For assessing the similarity between the histograms, 'cosine' is probably a more meaningful option than 'jaccard'. You may also want to consider the Kullback-Leibler divergence, although it is not symmetric in the two distributions, and is not computed by pdist2.

huangapple
  • 本文由 发表于 2023年6月26日 21:56:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76557362.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定