2023年7月13日 19:19:41go评论96阅读模式

英文:

langchain's chroma `vectordb.similarity_search_with_score()` and `vectordb.similarity_search_with_relevancy_scores()` returns the same output

问题

我一直在使用langchain的chroma vectordb。它有两种用于运行相似性搜索的方法。

vectordb.similarity_search_with_score()
vectordb.similarity_search_with_relevance_scores()

根据文档，第一个方法应返回一个浮点数形式的余弦距离。

数值越小越好。

而第二个方法应返回介于0到1之间的分数，0表示不相似，1表示相似。

但当我尝试相同的操作时，它给我返回了完全相同的结果和相同的分数，这使得分数超过了上限1，这不应该是第二个函数的情况。

这是怎么回事？

英文:

I have been working with langchain's chroma vectordb. It has two methods for running similarity search with scores.

vectordb.similarity_search_with_score()
vectordb.similarity_search_with_relevance_scores()

According to the documentation, the first one should return a cosine distance in float.

Smaller the better.

And the second one should return a score from 0 to 1, 0 means dissimilar and 1 means similar.

But when I tried the same it is giving me exactly same results with same scores which overflows the upperlimit 1, which should not be the case for the second function.

What's going on here?

答案1

得分: 2

我已经经历了以下问题：

vectordb.similarity_search() 和 vectordb.similarity_search_with_score() 返回的前 n 个块以相同的顺序完全相同。similarity_search_with_score() 还包含分数数据。我认为这些数据对于过滤掉不相关的块非常重要。

另一方面，我已经阅读到 vectordb.similarity_search_with_relevance_scores() 方法更为复杂，需要更多的处理来计算相似度分数，但在数十次比较中，我几乎在与 vectordb.similarity_search_with_score() 方法相同的时间内获得了完全相同的结果。

在这方面引起我的注意的另一个问题是分数的含义，这两种方法返回的结果中都有！在官方文档中，指出分数越小，相似度越高。我还读到分数的范围是 0-1。

在我的测试中，我得到了不同的分数。例如，一些不相关的结果为 1.9、2.03 和 0.03 😂...

根据我的经验，我可以说0.8-1.2 之间的分数具有较高的相似度。

英文:

I have experienced this issue as follows:

vectordb.similarity_search() and vectordb.similarity_search_with_score() return exactly the same top n chucks in the same order. similarity_search_with_score() also has score data. I think this data is important for filtering out irrelevant chucks.

On the other hand, I have read that the vectordb.similarity_search_with_relevance_scores() method is more sophisticated and requires more processing to calculate the similarity score, but I got exactly the same results nearly same duration with vectordb.similarity_search_with_score() method in dozens of comparisons.

Another issue that caught my attention in this regard is the meaning of the scores returned as a result of both methods! In the official document, it is stated that the smaller the score, the higher the similarity. I also read that the range of the score is 0-1.

In my tests, I got different scores. For example some unrelated results with 1.9, 2.03 and 0.03 😮...

I can say with my experience that scores between 0.8-1.2 have higher similarity.

答案2

得分: 1

在官方文档中提到的是余弦距离而不是余弦相似度。

余弦相似度：衡量向量之间夹角的余弦值，表示它们的相似度。数值越高表示相似度越大。

余弦距离：将向量之间的不相似度表示为余弦相似度的补数。数值越高表示不相似度越大。

余弦相似度公式：cosine_similarity(A, B) = (A . B) / (||A|| * ||B||)
余弦距离公式：cosine_distance(A, B) = 1 - cosine_similarity(A, B)

英文:

In official documentation its cosine distance and not cosine similarity.

Cosine Similarity: Measures the cosine of the angle between vectors, indicating their similarity. Higher values mean greater similarity.

Cosine Distance: Measures the dissimilarity between vectors as the complement of the cosine similarity. Higher values mean greater dissimilarity.

cosine_similarity(A, B) = (A . B) / (||A|| * ||B||)
cosine_distance(A, B) = 1 - cosine_similarity(A, B)

答案3

得分: 1

如果您正在使用Chroma，创建集合时应设置距离度量标准：https://docs.trychroma.com/usage-guide#changing-the-distance-function

默认距离度量标准为l2。这就是为什么在使用函数similarity_search_with_relevance_scores()时，它曾为我提供类似于3626.016357421875的分数。将其更改为cosine后，分数现在在(0, 1]之间，分数越接近1表示相似度越高。

Chroma.from_documents(documents=documents, embedding=cohere, collection_metadata={"hnsw:space": "cosine"})

英文:

Is you are using Chroma, you should set the distance metric when creating a collection: https://docs.trychroma.com/usage-guide#changing-the-distance-function

The default distance is l2. That is why for me it used to give scores like 3626.016357421875 when using the function similarity_search_with_relevance_scores(). On changing it to cosine, the scores are now between (0, 1] with scores closer to 1 depicting higher similarity.

Chroma.from_documents(documents=documents, embedding=cohere, collection_metadata={&quot;hnsw:space&quot;: &quot;cosine&quot;})

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

langchain's chroma `vectordb.similarity_search_with_score()` and `vectordb.similarity_search_with_relevancy_scores()` returns the same output

问题

答案1

答案2

答案3

使用句子转换器的评估器

ImportError: 无法从’llama_index.llms’导入名称’CustomLLM’

如何在具有条件的转录中计算特定关键词的数量

在Go语言中是否有mmseg，或者我可以从Go中调用自定义的C函数吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。