2023年7月12日 22:09:00go评论81阅读模式

英文:

How to get the embedding of any vocabulary token in GPT?

问题

I have a GPT model

model = BioGptForCausalLM.from_pretrained("microsoft/biogpt").to(device)

When I send my batch to it I can get the logits and the hidden states:

out = model(batch["input_ids"].to(device), output_hidden_states=True, return_dict=True)
print(out.keys())
>>> odict_keys(['logits', 'past_key_values', 'hidden_states'])

The logits have shape of

torch.Size([2, 1024, 42386]) # batch of size 2, sequence length = 1024, vocab size = 42386

Corresponding to (batch, seq_length, vocab_length). If I understand correctly, for each token in the sequence, the logits is a vector of size vocab_length which points the model to which token from the vocabulary to use, after passing it to softmax. I believe that each of these tokens should have an embedding.

From my previous question I found how to get the embeddings of each sequence token (shape [2,1024,1024] in my setting). But, how can I get the embeddings of each token in the vocabulary of the model? This should be of size [2, 1024, 42386, 1024] (BioGPT has a hidden size of length 1024).

I'm mainly interested in just a few special tokens (e.g., indices 1,2,6,112 out of the 42386).

英文:

I have a GPT model

model = BioGptForCausalLM.from_pretrained(&quot;microsoft/biogpt&quot;).to(device)

When I send my batch to it I can get the logits and the hidden states:

out = model(batch[&quot;input_ids&quot;].to(device), output_hidden_states=True, return_dict=True)
print(out.keys())
&gt;&gt;&gt; odict_keys([&#39;logits&#39;, &#39;past_key_values&#39;, &#39;hidden_states&#39;])

The logits have shape of

torch.Size([2, 1024, 42386]) # batch of size 2, sequence length = 1024, vocab size = 42386

I'm mainly interested in just a few special tokens (e.g., indices 1,2,6,112 out of the 42386).

答案1

得分: 1

1st solution:
如果我理解正确，您想要表示词汇表中单个标记的嵌入。对于这个问题，我知道两种答案，具体取决于您想要的嵌入确切是什么。

第一种解决方案：
模型中的第一层是torch.nn.Embedding，它在底层是一个没有偏置的线性层，因此它具有形状为[V, D]的weight参数，其中V是词汇量的大小（对于您来说是42386），D是嵌入的维度（1024）。您可以通过以下方式访问标记k的表示：model.biogpt.embed_tokens.weight[k]。这是直接表示第k个标记的大小为1024的向量。

2nd solution:
第二种解决方案：
您可以向模型提供一个已创建的序列，其中只包含您想要表示的标记。这个表示对应于模型的第一个注意力层的输入。例如，要获取第5个标记的表示：

inp = torch.Tensor([[5]]).long()
output = model(inp, output_hidden_states=True)
print(output.hidden_states[0])

这两种表示不完全相同，因为第一种只表示一个标记，而第二种表示该标记在其句子中的位置，而这是一个仅包含一个标记的序列。您可以根据您希望在之后执行的操作来决定哪种适合您的需求。

英文:

If I understand correctly, you want an embedding representing a single token from the vocabulary. They are two answers that I know for that, depending on which embedding you want exactly.

1st solution

The first layer in the model is a torch.nn.Embedding, which is under the hood a linear layer with no bias, so it has a weight parameter of shape [V, D] where V is the vocab size (42386 for you) and D is the dimension of the embedding (1024). You can access to the representation of a token k with : model.biogpt.embed_tokens.weight[k]. This is the 1024-sized vector that directly represents the k-th token.

2nd solution

You can feed the model with a created sequence, containing just the token of which you want the representation. This representation corresponds to the input of the first attention layer of the model. For example, to get the 5th token representation:

inp = torch.Tensor([[5]]).long()
output = model(inp, output_hidden_states=True)
print(output.hidden_states[0])

These two representations are not exactly the same, because the first one only represents a token, while the second represents the token in its sentence, which is a sequence of one single token. It is up to you to decide which one suits to what you want to do after.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何获取GPT中任何词汇标记的嵌入？

问题

答案1

1st solution

2nd solution

RuntimeError: 期望所有张量在相同的设备上，但至少发现两个不同的设备

最佳方式在PyTorch中使用Python迭代器作为数据集。

Why are the weights not updating when splitting the model into two `class` in pytorch and torch-geometric?

Pycaret在变换前后导出训练和测试数据。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。