问题

我可以帮你翻译以下内容：

"I have an application that uses AutoModelForCausalLM to answer questions. I need to use this same model to extract embeddings from text. I know that I can use SentenceTransformer but that would mean that I load twice the weights of the model. How would I use AutoModelForCausalLM to extract embeddings from text?"

你可以使用AutoModelForCausalLM来从文本中提取嵌入向量，而不必加载两次模型权重。

英文:

I have an application that uses AutoModelForCausalLM to answer questions. I need to use this same model to extract embeddings from text. I know that I can use SentenceTransformer but that would mean that I load twice the weights of the model. How would I use AutoModelForCausalLM to extract embeddings from text?

答案1

得分: 4

警告：
如前面评论中提到的，您需要检查生成的句子嵌入是否有意义，这是必需的，因为您使用的模型未经训练以生成有意义的句子嵌入（请查看这个 StackOverflow 答案以获取更多信息）。

抛开这些不谈，以下代码展示了从 databricks/dolly-v2-3b 检索句子嵌入的一种方法。它使用了一种加权均值池化的方法，因为您的模型是一个带有从左到右注意力的解码器。这种方法的背后思想是，句子末尾的标记应该比句子开头的标记贡献更多，因为它们的权重与前面的标记上下文相关，而开头的标记具有更少的上下文表示。

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = &quot;databricks/dolly-v2-3b&quot;

t = AutoTokenizer.from_pretrained(model_id)
m = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=&quot;auto&quot;)
m.eval()


texts = [
    &quot;this is a test&quot;,
    &quot;this is another test case with a different length&quot;,
]
t_input = t(texts, padding=True, truncation=True, return_tensors=&quot;pt&quot;)


with torch.no_grad():
    last_hidden_state = m(**t_input, output_hidden_states=True).hidden_states[-1]


weights_for_non_padding = t_input.attention_mask * torch.arange(start=1, end=last_hidden_state.shape[1] + 1).unsqueeze(0)

sum_embeddings = torch.sum(last_hidden_state * weights_for_non_padding.unsqueeze(-1), dim=1)
num_of_none_padding_tokens = torch.sum(weights_for_non_padding, dim=-1).unsqueeze(-1)
sentence_embeddings = sum_embeddings / num_of_none_padding_tokens

print(t.input_ids)
print(weights_for_non_padding)
print(num_of_none_padding_tokens)
print(sentence_embeddings.shape)

输出:

tensor([[2520,  310,  247, 1071,    0,    0,    0,    0,    0],
        [2520,  310, 1529, 1071, 1083,  342,  247, 1027, 2978]])
tensor([[1, 2, 3, 4, 0, 0, 0, 0, 0],
        [1, 2, 3, 4, 5, 6, 7, 8, 9]])
tensor([[10],
        [45]])
torch.Size([2, 2560])

英文:

Warning:
As mentioned before in the comments, you need to check if the produced sentence embeddings are meaningful, this is required because the model you are using wasn't trained to produce meaningful sentence embeddings (check this StackOverflow answer for further information).

Putting that aside, the following code shows you a way to retrieve sentence embeddings from databricks/dolly-v2-3b. It uses a weighted-mean-pooling approach because your model is a decoder with left-to-right attention. The idea behind this approach is that the tokens at the end of the sentence should contribute more than the tokens at the beginning of the sentence because their weights are contextualized with the previous tokens, while the tokens at the beginning have far less context representation.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = &quot;databricks/dolly-v2-3b&quot;

t = AutoTokenizer.from_pretrained(model_id)
m = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=&quot;auto&quot;)
m.eval()


texts = [
    &quot;this is a test&quot;,
    &quot;this is another test case with a different length&quot;,
]
t_input = t(texts, padding=True, truncation=True, return_tensors=&quot;pt&quot;)


with torch.no_grad():
    last_hidden_state = m(**t_input, output_hidden_states=True).hidden_states[-1]


weights_for_non_padding = t_input.attention_mask * torch.arange(start=1, end=last_hidden_state.shape[1] + 1).unsqueeze(0)

sum_embeddings = torch.sum(last_hidden_state * weights_for_non_padding.unsqueeze(-1), dim=1)
num_of_none_padding_tokens = torch.sum(weights_for_non_padding, dim=-1).unsqueeze(-1)
sentence_embeddings = sum_embeddings / num_of_none_padding_tokens

print(t.input_ids)
print(weights_for_non_padding)
print(num_of_none_padding_tokens)
print(sentence_embeddings.shape)

Output:

tensor([[2520,  310,  247, 1071,    0,    0,    0,    0,    0],
        [2520,  310, 1529, 1071, 1083,  342,  247, 1027, 2978]])
tensor([[1, 2, 3, 4, 0, 0, 0, 0, 0],
        [1, 2, 3, 4, 5, 6, 7, 8, 9]])
tensor([[10],
        [45]])
torch.Size([2, 2560])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

AutoModelForCausalLM 用于提取文本嵌入。

问题

答案1

你想要对transformers Python库中的mBART-50进行微调，以便它学习新单词。

Transformers from scratch – shape '[1, 40, 64]' is invalid for input of size when passing input from encoder to decoder

ModuleNotFoundError: 找不到模块名称 ‘transformers.models.mmbt’ – 如何修复？

如何在PyTorch中修复GPU内存不足问题

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论