2023年7月28日 00:29:37go评论105阅读模式

英文:

Relationship between embedding models and LLM's inference models in a RAG architecture

问题

我正在尝试在AWS中使用西班牙语文档实现RAG架构。我的问题是：如果我使用在英语或多语言上训练的模型来生成文档的嵌入向量，会不会有影响？或者我必须使用专门针对西班牙语训练的模型生成嵌入向量？

我目前正在使用GPT-J-6b模型来生成嵌入向量，以及Falcon-40b模型来生成响应（推理），但在进行相似性搜索时，我没有得到很好的结果。

另一个问题是：是否使用同一模型来生成嵌入向量和生成推理是一种良好的做法？

英文:

I am trying to implement a RAG architecture in AWS with documents that are in Spanish.

My question is the following: does it matter if I generate the embeddings of the documents with a model trained in English or multilingual? Or do I have to generate the embeddings with a model trained specifically in Spanish?

I am currently using the GPT-J-6b model to generate the embeddings and the Falcon-40b model to generate the response (inference), but when doing the similarity search I do not get good results.

The other question I have is: is it good practice to use the same model both to generate the embeddings and to generate the inference?

答案1

得分: 1

GPT-J-6b是在The Pile上训练的，其中主要是英文，除了EuroParl部分，其中包含西班牙语，但可能与您的文本领域不同。这使得GPT-J-6b不太适合生成西班牙语文本的嵌入。

您应该使用在西班牙语数据上训练的模型，可以是纯粹的西班牙语模型或多语言模型。当然，训练数据领域与您的文本越不同，匹配度就越差。

至于是否使用相同的模型来生成嵌入和生成推断，这应该不重要。它们被应用于架构的不同部分。

英文:

GPT-J-6b is trained on The Pile, which is mainly English, except for the EuroParl part, which contains Spanish but probably not of the same domain as your text. This makes GPT-J-6b not very appropriate for generating embeddings for Spanish text.

You should use a model trained on Spanish data, either only Spanish or multilingual. Of course, the more different the training data domain and yours, the worse the matches you will get.

About using the same model both to generate the embeddings and to generate the inference, it should not be important. They are applied to different parts of the architecture.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

嵌入模型与RAG架构中的LLM推理模型之间的关系

问题

答案1

GoLang PoS标注器脚本执行时间比预期长，并且终端没有输出。

你可以在哪里找到spacy.py文件以重命名。

使用自然语言处理，我们如何将自定义的停用词添加到列表中？

如何正确提示Transformer模型的解码器？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。