英文:
Interpretation of cosine similarity and jaccard similarity (similarity of histograms)
问题
以下是翻译好的内容:
简介
我想要评估两个“bin counts”数组(与两个直方图相关)之间的相似性,使用Matlab的“pdist2”函数:
% 输入
bin_counts_a = [689 430 311 135 66 67 99 23 37 19 8 4 3 4 1 3 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1];
bin_counts_b = [569 402 200 166 262 90 50 16 33 12 6 35 49 4 12 8 8 2 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1];
% 将两个“bin counts”向量可视化为条形图:
bar(1:length(bin_counts_a),[bin_counts_a;bin_counts_b])
% 相似性计算
cosine_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
% 输出
cosine_similarity =
0.95473215802008
jaccard_similarity =
0.0769230769230769
问题
如果余弦相似度接近1,这意味着两个向量相似,那么杰卡德相似度不应该也接近1吗?
英文:
Introduction
I would like to assess the similarity between two "bin counts" arrays (related to two histograms), by using the Matlab "pdist2" function:
% Input
bin_counts_a = [689 430 311 135 66 67 99 23 37 19 8 4 3 4 1 3 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1];
bin_counts_b = [569 402 200 166 262 90 50 16 33 12 6 35 49 4 12 8 8 2 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1];
% Visualize the two "bin counts" vectors as bars:
bar(1:length(bin_counts_a),[bin_counts_a;bin_counts_b])
% Calculation of similarities
cosine_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
% Output
cosine_similarity =
0.95473215802008
jaccard_similarity =
0.0769230769230769
Question
If the cosine similarity is close to 1, which means the two vectors are similar, shouldn't the jaccard similarity be closer to 1 as well?
答案1
得分: 3
"jaccard" 度量,根据文档,仅考虑"不同的非零坐标的百分比",但不考虑它们的差异程度。
例如,假设 bin_counts_a
如您的示例所示,以及
bin_counts_b = bin_counts_a + 1;
然后
>> cosine_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
cosine_similarity =
0.999971577948095
几乎等于 1
,因为箱计数非常相似。然而,
>> jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
jaccard_similarity =
0
返回 0
,因为bin_counts_b
中的每个条目与bin_counts_a
中的条目(略微)不同。
对于评估直方图之间的相似性,'cosine'
可能是比 'jaccard'
更有意义的选项。您还可以考虑库尔巴克-莱布勒散度,尽管它不对称于两个分布,并且不是由 pdist2
计算的。
英文:
The 'jaccard'
measure, according to the documentation, only considers the "percentage of nonzero coordinates that differ", but not by how much they differ.
For instance, assume bin_counts_a
as in your example and
bin_counts_b = bin_counts_a + 1;
Then
>> cosine_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
cosine_similarity =
0.999971577948095
is almost 1
as expected, because the bin counts are very similar. However,
>> jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
jaccard_similarity =
0
gives 0
because each entry in bin_counts_b
is (slightly) different from that in bin_counts_a
.
For assessing the similarity between the histograms, 'cosine'
is probably a more meaningful option than 'jaccard'
. You may also want to consider the Kullback-Leibler divergence, although it is not symmetric in the two distributions, and is not computed by pdist2
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论