使用GPT-2从输入嵌入中恢复输入ID。

huangapple go评论70阅读模式
英文:

Recovering input IDs from input embeddings using GPT-2

问题

以下是您请求的翻译部分:

假设我有以下文本

aim = 'Hello world! you are a wonderful place to be in.'

我想使用GPT2生成input_ids然后生成嵌入然后从嵌入中恢复input_ids为此我执行

from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")

可以定义input_ids如下

input_ids = tokenizer(aim)['input_ids']
#输出:[15496, 995, 0, 345, 389, 257, 7932, 1295, 284, 307, 287, 13]

我可以解码它以确保它能够复制aim

tokenizer.decode(input_id)
#输出:'Hello world! you are a wonderful place to be in.'

如预期要生成嵌入我将input_ids转换为张量

input_ids_tensor = torch.tensor([input_ids])

然后我可以生成我的嵌入

# 为input IDs生成嵌入
with torch.no_with():
    model_output = model(input_ids_tensor)
    last_hidden_states = model_output.last_hidden_state

# 从最后一个隐藏层中提取input IDs的嵌入
input_embeddings = last_hidden_states[0,1:-1,:]

现在如前所述目标是使用input_embeddings并恢复input_ids所以我执行

x = torch.unsqueeze(input_embeddings, 1) # 使维度合适
with torch.no_grad():
    text = model(x.long())
    decoded_text = tokenizer.decode(text[0].argmax(dim=-1).tolist())

但是这样做会引发错误

IndexError: index out of range in self
`text = model(x.long())`的级别上我想知道我做错了什么如何使用我生成的嵌入恢复input_ids

请注意,代码中的错误原文未更改。如果您需要任何进一步的帮助或解释,请随时提问。

英文:

Suppose I have the following text

aim = 'Hello world! you are a wonderful place to be in.'

I want to use GPT2 to produce the input_ids and then produce the embedding and from embeddings recover the input_ids, to do this I do:

from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")

The input_ids can be defines as:

input_ids = tokenizer(aim)['input_ids']
#output: [15496, 995, 0, 345, 389, 257, 7932, 1295, 284, 307, 287, 13]

I can decode this to make sure it reproduce the aim:

tokenizer.decode(input_id)
#output: 'Hello world! you are a wonderful place to be in.'

as expected! To produce the embedding I convert the input_ids to tensor:

input_ids_tensor = torch.tensor([input_ids])

I can then procude my embeddings as:

# Generate the embeddings for input IDs 
with torch.no_grad():
    model_output = model(input_ids_tensor)
    last_hidden_states = model_output.last_hidden_state
    
# Extract the embeddings for the input IDs from the last hidden layer
input_embeddings = last_hidden_states[0,1:-1,:]

Now as mentioned earlier, the aim is to use input_embeddings and recover the input_ids, so I do:

x = torch.unsqueeze(input_embeddings, 1) # to make the dim acceptable
with torch.no_grad():
    text = model(x.long())
    decoded_text = tokenizer.decode(text[0].argmax(dim=-1).tolist())

But doing this I get:

IndexError: index out of range in self

at the level of text = model(x.long()) I wonder what am I doing wrong? How can I recover the input_ids using the embedding I produced?

答案1

得分: 1

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# 实例化模型和分词器
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# 设置输入文本
text = "Hello, how are you?"

# 对输入文本进行分词处理
input_ids = tokenizer.encode(text, return_tensors='pt')

# 使用模型的前向函数获取logits
logits = model(input_ids).logits

# 通过在token维度上获取logits的argmax来获取预测的token IDs
predicted_token_ids = torch.argmax(logits, dim=-1)

# 将预测的token IDs解码回文本
output_text = tokenizer.decode(predicted_token_ids[0], skip_special_tokens=True)

# 打印输出文本和token IDs
print("输出文本: ", output_text)
print("输出token IDs: ", predicted_token_ids.tolist())
英文:

You should use GPT2LMHeadModel instead of GPT2Model, because GPT2Model doesn't have a prediction head.

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Instantiate the model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Set the input text
text = "Hello, how are you?"

# Tokenize the input text
input_ids = tokenizer.encode(text, return_tensors='pt')

# Use the model's forward function to obtain logits
logits = model(input_ids).logits

# Obtain the predicted token IDs by getting the argmax of the logits along the token dimension
predicted_token_ids = torch.argmax(logits, dim=-1)

# Decode the predicted token IDs back to text
output_text = tokenizer.decode(predicted_token_ids[0], skip_special_tokens=True)

# Print the output text and token IDs
print("Output text: ", output_text)
print("Output token IDs: ", predicted_token_ids.tolist())

Output:

Output text:  , I about you doing

Output token IDs:  [[11, 314, 546, 345, 1804, 198]]

The output text seems weird because they are only predicting next token at step t given tokens from step 1 to step t_1. For example,

Hello => ,
Hello, => I
Hello, how => about

To generate text step by step, you should use generate function. https://huggingface.co/docs/transformers/main_classes/text_generation

huangapple
  • 本文由 发表于 2023年2月24日 00:41:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/75547772.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定