2023年5月6日 21:43:56go评论61阅读模式

英文:

How can I find the cosine similarity between two song lyrics represented as strings?

问题

我和我的朋友正在进行一个关于歌曲推荐的NLP项目。

背景：我们最初计划向模型提供一个推荐歌曲播放列表，该播放列表根据随机输入语料（来自文学等领域）中的歌词最相似来生成，但是我们并没有一个具体的实施方法。

目前我们的任务是找到与随机输入的歌词相似的歌词。我们正在使用句子BERT模型（sbert）和余弦相似度来寻找歌曲之间的相似性，看起来输出的数字足够有意义，可以找到最相似的歌词。

是否有其他方法可以改进这个方法？

我们想使用BERT模型，并且乐意接受可以在BERT之上使用的建议，但如果有其他模型应该代替BERT使用，我们也很愿意学习。谢谢。

英文:

My friends and I are doing an NLP project on song recommendation.

Context: We originally planned on giving the model a recommended song playlist that has the most similar lyrics based on the random input corpus(from the literature etc), however we didn't really have a concrete idea of its implementation.

Currently our task is to find similar lyrics to a random lyric fed as a string input. We are using sentence BERT model(sbert) and cosine similarity to find the similarity between the songs and it seems like the output numbers are meaningful enough to find the most similar song lyrics.

Is there any other way that we can improve this approach?

We'd like to use BERT model and are open to suggestions that can be used on top of BERT if possible, but if there is any other models that should be used instead of BERT, we'd be happy to learn. Thanks.

答案1

得分: 0

Here are the translated portions:

计算余弦相似度

您可以使用sentence-transformers包中的util.cos_sim(embeddings1, embeddings2)来计算两个嵌入的余弦相似度。

或者，您也可以使用scikit-learn包中的sklearn.metrics.pairwise.cosine_similarity(X, Y, dense_output=True)。

改进表示和模型

由于您只想在BERT之上进行推荐，您还可以考虑RoBERTa，并使用Byte-pair编码来对BERT的Wordpeice标记器进行标记。考虑从HuggingFacetransformers包中提取特征的roberta-base模型。

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')
text = "歌词在文本中。"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

标记器在不同的语法和语义文本粒度级别上工作。它们有助于生成高质量的向量/嵌入。如果为正确的任务和模型进行微调，每个都可以产生不同且更好的结果。

您还可以考虑一些其他的标记器：
字符级BPE，字节级BPE，WordPiece（BERT使用这个），SentencePiece和具有LM字符的Unigram标记器。

此外，考虑查看HuggingFace官方标记器库指南这里。

英文:

Computing cosine similarity

You can use the util.cos_sim(embeddings1, embeddings2) from the sentence-transformers package to compute the cosine similarity of two embeddings.

Alternatively, you can also use sklearn.metrics.pairwise.cosine_similarity(X, Y, dense_output=True) from the scikit-learn package.

Improvements for representation and models

Since you want recommendations just on top of BERT, you can consider RoBERTa as well with Byte-pair encoding for tokenizer over BERT's Wordpeice tokenizers. Consider the roberta-base model as a feature extractor from the HuggingFacetransformers package.

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained(&#39;roberta-base&#39;)
model = RobertaModel.from_pretrained(&#39;roberta-base&#39;)
text = &quot;song lyrics in text.&quot;
encoded_input = tokenizer(text, return_tensors=&#39;pt&#39;)
output = model(**encoded_input)

Tokenizers work at various text granularity level of syntax & semantics. They help generate quality vectors/embeddings. Each can yield different and better results if fine-tunned for the correct task and model.

Some other tokenizers you can consider are:
Character Level BPE, Byte-Level BPE, WordPiece (BERT uses this), SentencePiece, and Unigram tokenizer with LM Character.

Also consider exploring the HuggingFace official Tokenizer Library guide here.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何找到以字符串表示的两首歌词之间的余弦相似度？

问题

答案1

如何在Langchain Faiss检索器中指定相似度阈值？

Hugging Face Transformer：模型 bio_ClinicalBERT 没有针对任何任务进行训练吗？

Can someone explain how to create a PTB Dataset And/Or Train my own model using StanfordNLP?

将行按它们在Python中的相似性分类。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论