langchain's chroma `vectordb.similarity_search_with_score()` and `vectordb.similarity_search_with_relevancy_scores()` returns the same output

huangapple go评论65阅读模式
英文:

langchain's chroma `vectordb.similarity_search_with_score()` and `vectordb.similarity_search_with_relevancy_scores()` returns the same output

问题

我一直在使用langchain的chroma vectordb。它有两种用于运行相似性搜索的方法。

  1. vectordb.similarity_search_with_score()
  2. vectordb.similarity_search_with_relevance_scores()

根据文档,第一个方法应返回一个浮点数形式的余弦距离。
langchain's chroma `vectordb.similarity_search_with_score()` and `vectordb.similarity_search_with_relevancy_scores()` returns the same output

数值越小越好。

而第二个方法应返回介于0到1之间的分数,0表示不相似,1表示相似。
langchain's chroma `vectordb.similarity_search_with_score()` and `vectordb.similarity_search_with_relevancy_scores()` returns the same output

但当我尝试相同的操作时,它给我返回了完全相同的结果和相同的分数,这使得分数超过了上限1,这不应该是第二个函数的情况。

这是怎么回事?

英文:

I have been working with langchain's chroma vectordb. It has two methods for running similarity search with scores.

  1. vectordb.similarity_search_with_score()
  2. vectordb.similarity_search_with_relevance_scores()

According to the documentation, the first one should return a cosine distance in float.
langchain's chroma `vectordb.similarity_search_with_score()` and `vectordb.similarity_search_with_relevancy_scores()` returns the same output

Smaller the better.

And the second one should return a score from 0 to 1, 0 means dissimilar and 1 means similar.
langchain's chroma `vectordb.similarity_search_with_score()` and `vectordb.similarity_search_with_relevancy_scores()` returns the same output

But when I tried the same it is giving me exactly same results with same scores which overflows the upperlimit 1, which should not be the case for the second function.

What's going on here?

答案1

得分: 2

我已经经历了以下问题:

vectordb.similarity_search()vectordb.similarity_search_with_score() 返回的前 n 个块以相同的顺序完全相同。similarity_search_with_score() 还包含分数数据。我认为这些数据对于过滤掉不相关的块非常重要。

另一方面,我已经阅读到 vectordb.similarity_search_with_relevance_scores() 方法更为复杂,需要更多的处理来计算相似度分数,但在数十次比较中,我几乎在与 vectordb.similarity_search_with_score() 方法相同的时间内获得了完全相同的结果。

在这方面引起我的注意的另一个问题是分数的含义,这两种方法返回的结果中都有!在官方文档 中,指出分数越小,相似度越高。我还读到分数的范围是 0-1。

在我的测试中,我得到了不同的分数。例如,一些不相关的结果为 1.9、2.03 和 0.03 😂...

根据我的经验,我可以说0.8-1.2 之间的分数具有较高的相似度

英文:

I have experienced this issue as follows:

vectordb.similarity_search() and vectordb.similarity_search_with_score() return exactly the same top n chucks in the same order. similarity_search_with_score() also has score data. I think this data is important for filtering out irrelevant chucks.

On the other hand, I have read that the vectordb.similarity_search_with_relevance_scores() method is more sophisticated and requires more processing to calculate the similarity score, but I got exactly the same results nearly same duration with vectordb.similarity_search_with_score() method in dozens of comparisons.

Another issue that caught my attention in this regard is the meaning of the scores returned as a result of both methods! In the official document, it is stated that the smaller the score, the higher the similarity. I also read that the range of the score is 0-1.

In my tests, I got different scores. For example some unrelated results with 1.9, 2.03 and 0.03 😮...

I can say with my experience that scores between 0.8-1.2 have higher similarity.

答案2

得分: 1

在官方文档中提到的是余弦距离而不是余弦相似度。

余弦相似度:衡量向量之间夹角的余弦值,表示它们的相似度。数值越高表示相似度越大。

余弦距离:将向量之间的不相似度表示为余弦相似度的补数。数值越高表示不相似度越大。

余弦相似度公式:cosine_similarity(A, B) = (A . B) / (||A|| * ||B||)
余弦距离公式:cosine_distance(A, B) = 1 - cosine_similarity(A, B)

英文:

In official documentation its cosine distance and not cosine similarity.

Cosine Similarity: Measures the cosine of the angle between vectors, indicating their similarity. Higher values mean greater similarity.

Cosine Distance: Measures the dissimilarity between vectors as the complement of the cosine similarity. Higher values mean greater dissimilarity.

cosine_similarity(A, B) = (A . B) / (||A|| * ||B||)
cosine_distance(A, B) = 1 - cosine_similarity(A, B)

答案3

得分: 1

如果您正在使用Chroma,创建集合时应设置距离度量标准:https://docs.trychroma.com/usage-guide#changing-the-distance-function

默认距离度量标准为l2。这就是为什么在使用函数similarity_search_with_relevance_scores()时,它曾为我提供类似于3626.016357421875的分数。将其更改为cosine后,分数现在在(0, 1]之间,分数越接近1表示相似度越高。

Chroma.from_documents(documents=documents, embedding=cohere, collection_metadata={"hnsw:space": "cosine"})
英文:

Is you are using Chroma, you should set the distance metric when creating a collection: https://docs.trychroma.com/usage-guide#changing-the-distance-function

The default distance is l2. That is why for me it used to give scores like 3626.016357421875 when using the function similarity_search_with_relevance_scores(). On changing it to cosine, the scores are now between (0, 1] with scores closer to 1 depicting higher similarity.

Chroma.from_documents(documents=documents, embedding=cohere, collection_metadata={"hnsw:space": "cosine"})

huangapple
  • 本文由 发表于 2023年7月13日 19:19:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76678783.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定