在一个非常精确的数值环境中寻找相似性。

huangapple go评论69阅读模式
英文:

Find similarity in a very precise numerical environment

问题

我了解你的问题。要在这样精确的数值环境中找到相似性,建议使用专门处理语义相似性的模型,如Universal Sentence Encoder(USE)或BERT。这些模型在捕捉句子的语义信息方面表现更佳,相较于spaCy等传统模型更适用于你的场景。

英文:

I have a list of 100+ sentences, and I need to find which is the closest to a user prompt.
The thing is that we are dealing with very precise, nuanced prompt, because we analyze numeric data.
Eaxmple:

1. Did variable x changed at least 5% in the past week ?
2. show me variable x change in the past week

In this example, sentence 1 and 2 are totally different in the context of a chart but similar in the global context, but most simple models like spaCY will rate them as very similar (0.9+) because they have many similar words.

What is the way to go to be able to train a model or to use a trained model, to find similarity in a very precise numerical environment like this, where sentences have many similar words but have total different meaning ?

I used this spaCY model:

prompt_doc = nlp(user_prompt)
similarities = []

 
for sentence in sentences:
    sentence_doc = nlp(sentence)
    similarity = prompt_doc.similarity(sentence_doc)
    similarities.append(similarity)
    print("Sentence:", sentence)
    print("Similarity rating:", similarity)
    print()

The result for 100 sentences like the above, was that all of them have around 0.8-0.9 similarity. Which is very wrong.

答案1

得分: 1

这里是您要翻译的内容:

"Have you tried using Google's Universal Sentence Encoder?"

"Here's a code snippet which uses the encoder (I used Google Colab to run the code). For your example it returns a similarity of 0.6643186, much lower than the similarity you get with spaCy."

"Some additional info: Looking at the spaCy documentation about similarity, it seems like spaCy uses non-contextualized word vectors in conjunction with vector averaging to compute the similarities, which is the reason why it doesn't work for you: spaCy's implementation does neither consider word order nor context of a word. In contrast, Google's Universal Sentence Encoder will give you one single vector for an entire sentence, thereby considering both word order and context of the word."

"Let me know if this helps!"

"import numpy as np
import tensorflow as tf
import tensorflow_hub as hub

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)

def embed(input):
return model(input)

sentence_1 = "Did variable x changed at least 5% in the past week ?"
sentence_2 = "show me variable x change in the past week"

embeddings = embed([sentence_1, sentence_2])
corr = np.inner(embeddings[0], embeddings[1])

print(corr) # 0.6643186"

英文:

Have you tried using Google's Universal Sentence Encoder?

Here's a code snippet which uses the encoder (I used Google Colab to run the code). For your example it returns a similarity of 0.6643186, much lower than the similarity you get with spaCy.

Some additional info: Looking at the spaCy documentation about similarity, it seems like spaCy uses non-contextualized word vectors in conjunction with vector averaging to compute the similarities, which is the reason why it doesn't work for you: spaCy's implementation does neither consider word order nor context of a word. In contrast, Google's Universal Sentence Encoder will give you one single vector for an entire sentence, thereby considering both word order and context of the word.

Let me know if this helps!

import numpy as np
import tensorflow as tf
import tensorflow_hub as hub

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)

def embed(input):
  return model(input)


sentence_1 = "Did variable x changed at least 5% in the past week ?"
sentence_2 = "show me variable x change in the past week"

embeddings = embed([sentence_1, sentence_2])
corr = np.inner(embeddings[0], embeddings[1])

print(corr)  # 0.6643186

huangapple
  • 本文由 发表于 2023年5月11日 01:11:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76221016.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定