英文:
Recovering input IDs from input embeddings using GPT-2
问题
以下是您请求的翻译部分:
假设我有以下文本
aim = 'Hello world! you are a wonderful place to be in.'
我想使用GPT2生成input_ids,然后生成嵌入,然后从嵌入中恢复input_ids,为此我执行:
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")
可以定义input_ids如下:
input_ids = tokenizer(aim)['input_ids']
#输出:[15496, 995, 0, 345, 389, 257, 7932, 1295, 284, 307, 287, 13]
我可以解码它以确保它能够复制aim:
tokenizer.decode(input_id)
#输出:'Hello world! you are a wonderful place to be in.'
如预期!要生成嵌入,我将input_ids转换为张量:
input_ids_tensor = torch.tensor([input_ids])
然后,我可以生成我的嵌入:
# 为input IDs生成嵌入
with torch.no_with():
model_output = model(input_ids_tensor)
last_hidden_states = model_output.last_hidden_state
# 从最后一个隐藏层中提取input IDs的嵌入
input_embeddings = last_hidden_states[0,1:-1,:]
现在如前所述,目标是使用input_embeddings并恢复input_ids,所以我执行:
x = torch.unsqueeze(input_embeddings, 1) # 使维度合适
with torch.no_grad():
text = model(x.long())
decoded_text = tokenizer.decode(text[0].argmax(dim=-1).tolist())
但是这样做会引发错误:
IndexError: index out of range in self
在`text = model(x.long())`的级别上,我想知道我做错了什么?如何使用我生成的嵌入恢复input_ids?
请注意,代码中的错误原文未更改。如果您需要任何进一步的帮助或解释,请随时提问。
英文:
Suppose I have the following text
aim = 'Hello world! you are a wonderful place to be in.'
I want to use GPT2 to produce the input_ids and then produce the embedding and from embeddings recover the input_ids, to do this I do:
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")
The input_ids can be defines as:
input_ids = tokenizer(aim)['input_ids']
#output: [15496, 995, 0, 345, 389, 257, 7932, 1295, 284, 307, 287, 13]
I can decode this to make sure it reproduce the aim:
tokenizer.decode(input_id)
#output: 'Hello world! you are a wonderful place to be in.'
as expected! To produce the embedding I convert the input_ids to tensor:
input_ids_tensor = torch.tensor([input_ids])
I can then procude my embeddings as:
# Generate the embeddings for input IDs
with torch.no_grad():
model_output = model(input_ids_tensor)
last_hidden_states = model_output.last_hidden_state
# Extract the embeddings for the input IDs from the last hidden layer
input_embeddings = last_hidden_states[0,1:-1,:]
Now as mentioned earlier, the aim is to use input_embeddings and recover the input_ids, so I do:
x = torch.unsqueeze(input_embeddings, 1) # to make the dim acceptable
with torch.no_grad():
text = model(x.long())
decoded_text = tokenizer.decode(text[0].argmax(dim=-1).tolist())
But doing this I get:
IndexError: index out of range in self
at the level of text = model(x.long())
I wonder what am I doing wrong? How can I recover the input_ids using the embedding I produced?
答案1
得分: 1
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# 实例化模型和分词器
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# 设置输入文本
text = "Hello, how are you?"
# 对输入文本进行分词处理
input_ids = tokenizer.encode(text, return_tensors='pt')
# 使用模型的前向函数获取logits
logits = model(input_ids).logits
# 通过在token维度上获取logits的argmax来获取预测的token IDs
predicted_token_ids = torch.argmax(logits, dim=-1)
# 将预测的token IDs解码回文本
output_text = tokenizer.decode(predicted_token_ids[0], skip_special_tokens=True)
# 打印输出文本和token IDs
print("输出文本: ", output_text)
print("输出token IDs: ", predicted_token_ids.tolist())
英文:
You should use GPT2LMHeadModel
instead of GPT2Model
, because GPT2Model
doesn't have a prediction head.
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Instantiate the model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# Set the input text
text = "Hello, how are you?"
# Tokenize the input text
input_ids = tokenizer.encode(text, return_tensors='pt')
# Use the model's forward function to obtain logits
logits = model(input_ids).logits
# Obtain the predicted token IDs by getting the argmax of the logits along the token dimension
predicted_token_ids = torch.argmax(logits, dim=-1)
# Decode the predicted token IDs back to text
output_text = tokenizer.decode(predicted_token_ids[0], skip_special_tokens=True)
# Print the output text and token IDs
print("Output text: ", output_text)
print("Output token IDs: ", predicted_token_ids.tolist())
Output:
Output text: , I about you doing
Output token IDs: [[11, 314, 546, 345, 1804, 198]]
The output text seems weird because they are only predicting next token at step t given tokens from step 1 to step t_1. For example,
Hello => ,
Hello, => I
Hello, how => about
To generate text step by step, you should use generate
function. https://huggingface.co/docs/transformers/main_classes/text_generation
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论