英文:
Find similarity in a very precise numerical environment
问题
我了解你的问题。要在这样精确的数值环境中找到相似性,建议使用专门处理语义相似性的模型,如Universal Sentence Encoder(USE)或BERT。这些模型在捕捉句子的语义信息方面表现更佳,相较于spaCy等传统模型更适用于你的场景。
英文:
I have a list of 100+ sentences, and I need to find which is the closest to a user prompt.
The thing is that we are dealing with very precise, nuanced prompt, because we analyze numeric data.
Eaxmple:
1. Did variable x changed at least 5% in the past week ?
2. show me variable x change in the past week
In this example, sentence 1 and 2 are totally different in the context of a chart but similar in the global context, but most simple models like spaCY
will rate them as very similar (0.9+) because they have many similar words.
What is the way to go to be able to train a model or to use a trained model, to find similarity in a very precise numerical environment like this, where sentences have many similar words but have total different meaning ?
I used this spaCY
model:
prompt_doc = nlp(user_prompt)
similarities = []
for sentence in sentences:
sentence_doc = nlp(sentence)
similarity = prompt_doc.similarity(sentence_doc)
similarities.append(similarity)
print("Sentence:", sentence)
print("Similarity rating:", similarity)
print()
The result for 100 sentences like the above, was that all of them have around 0.8-0.9 similarity. Which is very wrong.
答案1
得分: 1
这里是您要翻译的内容:
"Have you tried using Google's Universal Sentence Encoder?"
"Here's a code snippet which uses the encoder (I used Google Colab to run the code). For your example it returns a similarity of 0.6643186, much lower than the similarity you get with spaCy
."
"Some additional info: Looking at the spaCy
documentation about similarity
, it seems like spaCy
uses non-contextualized word vectors in conjunction with vector averaging to compute the similarities, which is the reason why it doesn't work for you: spaCy
's implementation does neither consider word order nor context of a word. In contrast, Google's Universal Sentence Encoder
will give you one single vector for an entire sentence, thereby considering both word order and context of the word."
"Let me know if this helps!"
"import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
def embed(input):
return model(input)
sentence_1 = "Did variable x changed at least 5% in the past week ?"
sentence_2 = "show me variable x change in the past week"
embeddings = embed([sentence_1, sentence_2])
corr = np.inner(embeddings[0], embeddings[1])
print(corr) # 0.6643186"
英文:
Have you tried using Google's Universal Sentence Encoder?
Here's a code snippet which uses the encoder (I used Google Colab to run the code). For your example it returns a similarity of 0.6643186, much lower than the similarity you get with spaCy
.
Some additional info: Looking at the spaCy
documentation about similarity
, it seems like spaCy
uses non-contextualized word vectors in conjunction with vector averaging to compute the similarities, which is the reason why it doesn't work for you: spaCy
's implementation does neither consider word order nor context of a word. In contrast, Google's Universal Sentence Encoder
will give you one single vector for an entire sentence, thereby considering both word order and context of the word.
Let me know if this helps!
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
def embed(input):
return model(input)
sentence_1 = "Did variable x changed at least 5% in the past week ?"
sentence_2 = "show me variable x change in the past week"
embeddings = embed([sentence_1, sentence_2])
corr = np.inner(embeddings[0], embeddings[1])
print(corr) # 0.6643186
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论