如何找到以字符串表示的两首歌词之间的余弦相似度?

huangapple go评论46阅读模式
英文:

How can I find the cosine similarity between two song lyrics represented as strings?

问题

我和我的朋友正在进行一个关于歌曲推荐的NLP项目。

背景:我们最初计划向模型提供一个推荐歌曲播放列表,该播放列表根据随机输入语料(来自文学等领域)中的歌词最相似来生成,但是我们并没有一个具体的实施方法。

目前我们的任务是找到与随机输入的歌词相似的歌词。我们正在使用句子BERT模型(sbert)和余弦相似度来寻找歌曲之间的相似性,看起来输出的数字足够有意义,可以找到最相似的歌词。

是否有其他方法可以改进这个方法?

我们想使用BERT模型,并且乐意接受可以在BERT之上使用的建议,但如果有其他模型应该代替BERT使用,我们也很愿意学习。谢谢。

英文:

My friends and I are doing an NLP project on song recommendation.

Context: We originally planned on giving the model a recommended song playlist that has the most similar lyrics based on the random input corpus(from the literature etc), however we didn't really have a concrete idea of its implementation.

Currently our task is to find similar lyrics to a random lyric fed as a string input. We are using sentence BERT model(sbert) and cosine similarity to find the similarity between the songs and it seems like the output numbers are meaningful enough to find the most similar song lyrics.

Is there any other way that we can improve this approach?

We'd like to use BERT model and are open to suggestions that can be used on top of BERT if possible, but if there is any other models that should be used instead of BERT, we'd be happy to learn. Thanks.

答案1

得分: 0

Here are the translated portions:

计算余弦相似度

您可以使用sentence-transformers包中的util.cos_sim(embeddings1, embeddings2)来计算两个嵌入的余弦相似度。

或者,您也可以使用scikit-learn包中的sklearn.metrics.pairwise.cosine_similarity(X, Y, dense_output=True)

改进表示和模型

由于您只想在BERT之上进行推荐,您还可以考虑RoBERTa,并使用Byte-pair编码来对BERT的Wordpeice标记器进行标记。考虑从HuggingFacetransformers包中提取特征的roberta-base模型。

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')
text = "歌词在文本中。"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

标记器在不同的语法和语义文本粒度级别上工作。它们有助于生成高质量的向量/嵌入。如果为正确的任务和模型进行微调,每个都可以产生不同且更好的结果。

您还可以考虑一些其他的标记器:
字符级BPE,字节级BPE,WordPiece(BERT使用这个),SentencePiece和具有LM字符的Unigram标记器。

此外,考虑查看HuggingFace官方标记器库指南这里

英文:

Computing cosine similarity

You can use the util.cos_sim(embeddings1, embeddings2) from the sentence-transformers package to compute the cosine similarity of two embeddings.

Alternatively, you can also use sklearn.metrics.pairwise.cosine_similarity(X, Y, dense_output=True) from the scikit-learn package.

Improvements for representation and models

Since you want recommendations just on top of BERT, you can consider RoBERTa as well with Byte-pair encoding for tokenizer over BERT's Wordpeice tokenizers. Consider the roberta-base model as a feature extractor from the HuggingFacetransformers package.

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')
text = "song lyrics in text."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Tokenizers work at various text granularity level of syntax & semantics. They help generate quality vectors/embeddings. Each can yield different and better results if fine-tunned for the correct task and model.

Some other tokenizers you can consider are:
Character Level BPE, Byte-Level BPE, WordPiece (BERT uses this), SentencePiece, and Unigram tokenizer with LM Character.

Also consider exploring the HuggingFace official Tokenizer Library guide here.

huangapple
  • 本文由 发表于 2023年5月6日 21:43:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76189213.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定