英文:
How to get the embedding of any vocabulary token in GPT?
问题
I have a GPT model
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt").to(device)
When I send my batch to it I can get the logits and the hidden states:
out = model(batch["input_ids"].to(device), output_hidden_states=True, return_dict=True)
print(out.keys())
>>> odict_keys(['logits', 'past_key_values', 'hidden_states'])
The logits have shape of
torch.Size([2, 1024, 42386]) # batch of size 2, sequence length = 1024, vocab size = 42386
Corresponding to (batch, seq_length, vocab_length)
. If I understand correctly, for each token in the sequence, the logits is a vector of size vocab_length
which points the model to which token from the vocabulary to use, after passing it to softmax. I believe that each of these tokens should have an embedding.
From my previous question I found how to get the embeddings of each sequence token (shape [2,1024,1024]
in my setting). But, how can I get the embeddings of each token in the vocabulary of the model? This should be of size [2, 1024, 42386, 1024]
(BioGPT has a hidden size of length 1024
).
I'm mainly interested in just a few special tokens (e.g., indices 1,2,6,112 out of the 42386).
英文:
I have a GPT model
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt").to(device)
When I send my batch to it I can get the logits and the hidden states:
out = model(batch["input_ids"].to(device), output_hidden_states=True, return_dict=True)
print(out.keys())
>>> odict_keys(['logits', 'past_key_values', 'hidden_states'])
The logits have shape of
torch.Size([2, 1024, 42386]) # batch of size 2, sequence length = 1024, vocab size = 42386
Corresponding to (batch, seq_length, vocab_length)
. If I understand correctly, for each token in the sequence, the logits is a vector of size vocab_length
which points the model to which token from the vocabulary to use, after passing it to softmax. I believe that each of these tokens should have an embedding.
From my previous question I found how to get the embeddings of each sequence token (shape [2,1024,1024]
in my setting). But, how can I get the embeddings of each token in the vocabulary of the model? This should be of size [2, 1024, 42386, 1024]
(BioGPT has a hidden size of length 1024
).
I'm mainly interested in just a few special tokens (e.g., indices 1,2,6,112 out of the 42386).
答案1
得分: 1
1st solution:
如果我理解正确,您想要表示词汇表中单个标记的嵌入。对于这个问题,我知道两种答案,具体取决于您想要的嵌入确切是什么。
第一种解决方案:
模型中的第一层是torch.nn.Embedding
,它在底层是一个没有偏置的线性层,因此它具有形状为[V, D]
的weight
参数,其中V
是词汇量的大小(对于您来说是42386),D
是嵌入的维度(1024)。您可以通过以下方式访问标记k
的表示:model.biogpt.embed_tokens.weight[k]
。这是直接表示第k个标记的大小为1024的向量。
2nd solution:
第二种解决方案:
您可以向模型提供一个已创建的序列,其中只包含您想要表示的标记。这个表示对应于模型的第一个注意力层的输入。例如,要获取第5个标记的表示:
inp = torch.Tensor([[5]]).long()
output = model(inp, output_hidden_states=True)
print(output.hidden_states[0])
这两种表示不完全相同,因为第一种只表示一个标记,而第二种表示该标记在其句子中的位置,而这是一个仅包含一个标记的序列。您可以根据您希望在之后执行的操作来决定哪种适合您的需求。
英文:
If I understand correctly, you want an embedding representing a single token from the vocabulary. They are two answers that I know for that, depending on which embedding you want exactly.
1st solution
The first layer in the model is a torch.nn.Embedding
, which is under the hood a linear layer with no bias, so it has a weight
parameter of shape [V, D]
where V
is the vocab size (42386 for you) and D
is the dimension of the embedding (1024). You can access to the representation of a token k
with : model.biogpt.embed_tokens.weight[k]
. This is the 1024-sized vector that directly represents the k-th token.
2nd solution
You can feed the model with a created sequence, containing just the token of which you want the representation. This representation corresponds to the input of the first attention layer of the model. For example, to get the 5th token representation:
inp = torch.Tensor([[5]]).long()
output = model(inp, output_hidden_states=True)
print(output.hidden_states[0])
These two representations are not exactly the same, because the first one only represents a token, while the second represents the token in its sentence, which is a sequence of one single token. It is up to you to decide which one suits to what you want to do after.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论