2023年5月11日 01:11:06go评论83阅读模式

英文:

Find similarity in a very precise numerical environment

问题

我了解你的问题。要在这样精确的数值环境中找到相似性，建议使用专门处理语义相似性的模型，如Universal Sentence Encoder（USE）或BERT。这些模型在捕捉句子的语义信息方面表现更佳，相较于spaCy等传统模型更适用于你的场景。

英文:

I have a list of 100+ sentences, and I need to find which is the closest to a user prompt.
The thing is that we are dealing with very precise, nuanced prompt, because we analyze numeric data.
Eaxmple:

1. Did variable x changed at least 5% in the past week ?
2. show me variable x change in the past week

In this example, sentence 1 and 2 are totally different in the context of a chart but similar in the global context, but most simple models like spaCY will rate them as very similar (0.9+) because they have many similar words.

What is the way to go to be able to train a model or to use a trained model, to find similarity in a very precise numerical environment like this, where sentences have many similar words but have total different meaning ?

I used this spaCY model:

prompt_doc = nlp(user_prompt)
similarities = []
 
for sentence in sentences:
    sentence_doc = nlp(sentence)
    similarity = prompt_doc.similarity(sentence_doc)
    similarities.append(similarity)
    print(&quot;Sentence:&quot;, sentence)
    print(&quot;Similarity rating:&quot;, similarity)
    print()

The result for 100 sentences like the above, was that all of them have around 0.8-0.9 similarity. Which is very wrong.

答案1

得分: 1

这里是您要翻译的内容：

"Have you tried using Google's Universal Sentence Encoder?"

"Here's a code snippet which uses the encoder (I used Google Colab to run the code). For your example it returns a similarity of 0.6643186, much lower than the similarity you get with spaCy."

"Some additional info: Looking at the spaCy documentation about similarity, it seems like spaCy uses non-contextualized word vectors in conjunction with vector averaging to compute the similarities, which is the reason why it doesn't work for you: spaCy's implementation does neither consider word order nor context of a word. In contrast, Google's Universal Sentence Encoder will give you one single vector for an entire sentence, thereby considering both word order and context of the word."

"Let me know if this helps!"

"import numpy as np
import tensorflow as tf
import tensorflow_hub as hub

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)

def embed(input):
return model(input)

sentence_1 = "Did variable x changed at least 5% in the past week ?"
sentence_2 = "show me variable x change in the past week"

embeddings = embed([sentence_1, sentence_2])
corr = np.inner(embeddings[0], embeddings[1])

print(corr) # 0.6643186"

英文:

Have you tried using Google's Universal Sentence Encoder?

Here's a code snippet which uses the encoder (I used Google Colab to run the code). For your example it returns a similarity of 0.6643186, much lower than the similarity you get with spaCy.

Some additional info: Looking at the spaCy documentation about similarity, it seems like spaCy uses non-contextualized word vectors in conjunction with vector averaging to compute the similarities, which is the reason why it doesn't work for you: spaCy's implementation does neither consider word order nor context of a word. In contrast, Google's Universal Sentence Encoder will give you one single vector for an entire sentence, thereby considering both word order and context of the word.

Let me know if this helps!

import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
module_url = &quot;https://tfhub.dev/google/universal-sentence-encoder/4&quot;
model = hub.load(module_url)
def embed(input):
  return model(input)
sentence_1 = &quot;Did variable x changed at least 5% in the past week ?&quot;
sentence_2 = &quot;show me variable x change in the past week&quot;
embeddings = embed([sentence_1, sentence_2])
corr = np.inner(embeddings[0], embeddings[1])
print(corr)  # 0.6643186

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在一个非常精确的数值环境中寻找相似性。

问题

答案1

问题出现在尝试使用MNIST数据集时的tensorflow和matplotlib包。

You can use scikit-learn K-Means Clustering的时候，如何提取原始数据域中的质心?

按照第二个单字对双字组列表进行排序如何？

使用嵌套的灵活类型（np.void类型）索引结构化的numpy数组。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。