2023年6月26日 21:56:19go评论160阅读模式

英文:

Interpretation of cosine similarity and jaccard similarity (similarity of histograms)

问题

以下是翻译好的内容：

简介

我想要评估两个“bin counts”数组（与两个直方图相关）之间的相似性，使用Matlab的“pdist2”函数：

% 输入
bin_counts_a = [689   430   311   135    66    67    99    23    37    19     8     4     3     4     1     3     1     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     1     0     0     0     0     1];
bin_counts_b = [569   402   200   166   262    90    50    16    33    12     6    35    49     4    12     8     8     2     1     0     0     0     0     1     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0     1];

% 将两个“bin counts”向量可视化为条形图：
bar(1:length(bin_counts_a),[bin_counts_a;bin_counts_b])

% 相似性计算
cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')

% 输出
cosine_similarity =

          0.95473215802008


jaccard_similarity =

        0.0769230769230769

问题

如果余弦相似度接近1，这意味着两个向量相似，那么杰卡德相似度不应该也接近1吗？

英文:

Introduction

I would like to assess the similarity between two "bin counts" arrays (related to two histograms), by using the Matlab "pdist2" function:

% Input
bin_counts_a = [689   430   311   135    66    67    99    23    37    19     8     4     3     4     1     3     1     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     1     0     0     0     0     1];
bin_counts_b = [569   402   200   166   262    90    50    16    33    12     6    35    49     4    12     8     8     2     1     0     0     0     0     1     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0     1];

% Visualize the two &quot;bin counts&quot; vectors as bars:
bar(1:length(bin_counts_a),[bin_counts_a;bin_counts_b])

% Calculation of similarities
cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,&#39;cosine&#39;)
jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,&#39;jaccard&#39;)

% Output
cosine_similarity =

          0.95473215802008


jaccard_similarity =

        0.0769230769230769

Question

If the cosine similarity is close to 1, which means the two vectors are similar, shouldn't the jaccard similarity be closer to 1 as well?

答案1

得分: 3

"jaccard" 度量，根据文档，仅考虑"不同的非零坐标的百分比"，但不考虑它们的差异程度。

例如，假设 bin_counts_a 如您的示例所示，以及

bin_counts_b = bin_counts_a + 1;

然后

>> cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
cosine_similarity =
   0.999971577948095

几乎等于 1，因为箱计数非常相似。然而，

>> jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
jaccard_similarity =
     0

返回 0，因为bin_counts_b 中的每个条目与bin_counts_a 中的条目(略微)不同。

对于评估直方图之间的相似性，'cosine' 可能是比 'jaccard' 更有意义的选项。您还可以考虑库尔巴克-莱布勒散度，尽管它不对称于两个分布，并且不是由 pdist2 计算的。

英文:

The 'jaccard' measure, according to the documentation, only considers the "percentage of nonzero coordinates that differ", but not by how much they differ.

For instance, assume bin_counts_a as in your example and

bin_counts_b = bin_counts_a + 1;

Then

&gt;&gt; cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,&#39;cosine&#39;)
cosine_similarity =
   0.999971577948095

is almost 1 as expected, because the bin counts are very similar. However,

&gt;&gt; jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,&#39;jaccard&#39;)
jaccard_similarity =
     0

gives 0 because each entry in bin_counts_b is (slightly) different from that in bin_counts_a.

For assessing the similarity between the histograms, 'cosine' is probably a more meaningful option than 'jaccard'. You may also want to consider the Kullback-Leibler divergence, although it is not symmetric in the two distributions, and is not computed by pdist2.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

余弦相似度和杰卡德相似度的解释（直方图的相似度）

问题

答案1

在Matlab中，是否可以将文件夹中组织的类的方法放在子目录/子文件夹中？

Intel MKL调用SciPy函数时出错，来自MATLAB。

两个数组的词嵌入余弦相似度

如何在 histplot 中标记核密度估计

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论