英文:
Relationship between embedding models and LLM's inference models in a RAG architecture
问题
我正在尝试在AWS中使用西班牙语文档实现RAG架构。我的问题是:如果我使用在英语或多语言上训练的模型来生成文档的嵌入向量,会不会有影响?或者我必须使用专门针对西班牙语训练的模型生成嵌入向量?
我目前正在使用GPT-J-6b模型来生成嵌入向量,以及Falcon-40b模型来生成响应(推理),但在进行相似性搜索时,我没有得到很好的结果。
另一个问题是:是否使用同一模型来生成嵌入向量和生成推理是一种良好的做法?
英文:
I am trying to implement a RAG architecture in AWS with documents that are in Spanish.
My question is the following: does it matter if I generate the embeddings of the documents with a model trained in English or multilingual? Or do I have to generate the embeddings with a model trained specifically in Spanish?
I am currently using the GPT-J-6b model to generate the embeddings and the Falcon-40b model to generate the response (inference), but when doing the similarity search I do not get good results.
The other question I have is: is it good practice to use the same model both to generate the embeddings and to generate the inference?
答案1
得分: 1
GPT-J-6b是在The Pile上训练的,其中主要是英文,除了EuroParl部分,其中包含西班牙语,但可能与您的文本领域不同。这使得GPT-J-6b不太适合生成西班牙语文本的嵌入。
您应该使用在西班牙语数据上训练的模型,可以是纯粹的西班牙语模型或多语言模型。当然,训练数据领域与您的文本越不同,匹配度就越差。
至于是否使用相同的模型来生成嵌入和生成推断,这应该不重要。它们被应用于架构的不同部分。
英文:
GPT-J-6b is trained on The Pile, which is mainly English, except for the EuroParl part, which contains Spanish but probably not of the same domain as your text. This makes GPT-J-6b not very appropriate for generating embeddings for Spanish text.
You should use a model trained on Spanish data, either only Spanish or multilingual. Of course, the more different the training data domain and yours, the worse the matches you will get.
About using the same model both to generate the embeddings and to generate the inference, it should not be important. They are applied to different parts of the architecture.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论