如何获取GPT中任何词汇标记的嵌入?

huangapple go评论61阅读模式
英文:

How to get the embedding of any vocabulary token in GPT?

问题

I have a GPT model

model = BioGptForCausalLM.from_pretrained("microsoft/biogpt").to(device)

When I send my batch to it I can get the logits and the hidden states:

out = model(batch["input_ids"].to(device), output_hidden_states=True, return_dict=True)
print(out.keys())
>>> odict_keys(['logits', 'past_key_values', 'hidden_states'])

The logits have shape of

torch.Size([2, 1024, 42386]) # batch of size 2, sequence length = 1024, vocab size = 42386

Corresponding to (batch, seq_length, vocab_length). If I understand correctly, for each token in the sequence, the logits is a vector of size vocab_length which points the model to which token from the vocabulary to use, after passing it to softmax. I believe that each of these tokens should have an embedding.

From my previous question I found how to get the embeddings of each sequence token (shape [2,1024,1024] in my setting). But, how can I get the embeddings of each token in the vocabulary of the model? This should be of size [2, 1024, 42386, 1024] (BioGPT has a hidden size of length 1024).

I'm mainly interested in just a few special tokens (e.g., indices 1,2,6,112 out of the 42386).

英文:

I have a GPT model

model = BioGptForCausalLM.from_pretrained("microsoft/biogpt").to(device)

When I send my batch to it I can get the logits and the hidden states:

out = model(batch["input_ids"].to(device), output_hidden_states=True, return_dict=True)
print(out.keys())
>>> odict_keys(['logits', 'past_key_values', 'hidden_states'])

The logits have shape of

torch.Size([2, 1024, 42386]) # batch of size 2, sequence length = 1024, vocab size = 42386

Corresponding to (batch, seq_length, vocab_length). If I understand correctly, for each token in the sequence, the logits is a vector of size vocab_length which points the model to which token from the vocabulary to use, after passing it to softmax. I believe that each of these tokens should have an embedding.

From my previous question I found how to get the embeddings of each sequence token (shape [2,1024,1024] in my setting). But, how can I get the embeddings of each token in the vocabulary of the model? This should be of size [2, 1024, 42386, 1024] (BioGPT has a hidden size of length 1024).

I'm mainly interested in just a few special tokens (e.g., indices 1,2,6,112 out of the 42386).

答案1

得分: 1

1st solution:
如果我理解正确,您想要表示词汇表中单个标记的嵌入。对于这个问题,我知道两种答案,具体取决于您想要的嵌入确切是什么。

第一种解决方案:
模型中的第一层是torch.nn.Embedding,它在底层是一个没有偏置的线性层,因此它具有形状为[V, D]weight参数,其中V是词汇量的大小(对于您来说是42386),D是嵌入的维度(1024)。您可以通过以下方式访问标记k的表示:model.biogpt.embed_tokens.weight[k]。这是直接表示第k个标记的大小为1024的向量。

2nd solution:
第二种解决方案:
您可以向模型提供一个已创建的序列,其中只包含您想要表示的标记。这个表示对应于模型的第一个注意力层的输入。例如,要获取第5个标记的表示:

inp = torch.Tensor([[5]]).long()
output = model(inp, output_hidden_states=True)
print(output.hidden_states[0])

这两种表示不完全相同,因为第一种只表示一个标记,而第二种表示该标记在其句子中的位置,而这是一个仅包含一个标记的序列。您可以根据您希望在之后执行的操作来决定哪种适合您的需求。

英文:

If I understand correctly, you want an embedding representing a single token from the vocabulary. They are two answers that I know for that, depending on which embedding you want exactly.

1st solution

The first layer in the model is a torch.nn.Embedding, which is under the hood a linear layer with no bias, so it has a weight parameter of shape [V, D] where V is the vocab size (42386 for you) and D is the dimension of the embedding (1024). You can access to the representation of a token k with : model.biogpt.embed_tokens.weight[k]. This is the 1024-sized vector that directly represents the k-th token.

2nd solution

You can feed the model with a created sequence, containing just the token of which you want the representation. This representation corresponds to the input of the first attention layer of the model. For example, to get the 5th token representation:

inp = torch.Tensor([[5]]).long()
output = model(inp, output_hidden_states=True)
print(output.hidden_states[0])

These two representations are not exactly the same, because the first one only represents a token, while the second represents the token in its sentence, which is a sequence of one single token. It is up to you to decide which one suits to what you want to do after.

huangapple
  • 本文由 发表于 2023年7月12日 22:09:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76671494.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定