2023年2月24日 00:41:17go评论99阅读模式

英文:

Recovering input IDs from input embeddings using GPT-2

问题

以下是您请求的翻译部分：

假设我有以下文本
aim = 'Hello world! you are a wonderful place to be in.'
我想使用GPT2生成input_ids，然后生成嵌入，然后从嵌入中恢复input_ids，为此我执行：
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")
可以定义input_ids如下：
input_ids = tokenizer(aim)['input_ids']
#输出：[15496, 995, 0, 345, 389, 257, 7932, 1295, 284, 307, 287, 13]
我可以解码它以确保它能够复制aim：
tokenizer.decode(input_id)
#输出：'Hello world! you are a wonderful place to be in.'
如预期！要生成嵌入，我将input_ids转换为张量：
input_ids_tensor = torch.tensor([input_ids])
然后，我可以生成我的嵌入：
# 为input IDs生成嵌入
with torch.no_with():
    model_output = model(input_ids_tensor)
    last_hidden_states = model_output.last_hidden_state
# 从最后一个隐藏层中提取input IDs的嵌入
input_embeddings = last_hidden_states[0,1:-1,:]
现在如前所述，目标是使用input_embeddings并恢复input_ids，所以我执行：
x = torch.unsqueeze(input_embeddings, 1) # 使维度合适
with torch.no_grad():
    text = model(x.long())
    decoded_text = tokenizer.decode(text[0].argmax(dim=-1).tolist())
但是这样做会引发错误：
IndexError: index out of range in self
在`text = model(x.long())`的级别上，我想知道我做错了什么？如何使用我生成的嵌入恢复input_ids？

请注意，代码中的错误原文未更改。如果您需要任何进一步的帮助或解释，请随时提问。

英文:

Suppose I have the following text

aim = &#39;Hello world! you are a wonderful place to be in.&#39;

I want to use GPT2 to produce the input_ids and then produce the embedding and from embeddings recover the input_ids, to do this I do:

from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained(&quot;gpt2&quot;)
model = GPT2Model.from_pretrained(&quot;gpt2&quot;)

The input_ids can be defines as:

input_ids = tokenizer(aim)[&#39;input_ids&#39;]
#output: [15496, 995, 0, 345, 389, 257, 7932, 1295, 284, 307, 287, 13]

I can decode this to make sure it reproduce the aim:

tokenizer.decode(input_id)
#output: &#39;Hello world! you are a wonderful place to be in.&#39;

as expected! To produce the embedding I convert the input_ids to tensor:

input_ids_tensor = torch.tensor([input_ids])

I can then procude my embeddings as:

# Generate the embeddings for input IDs 
with torch.no_grad():
    model_output = model(input_ids_tensor)
    last_hidden_states = model_output.last_hidden_state
    
# Extract the embeddings for the input IDs from the last hidden layer
input_embeddings = last_hidden_states[0,1:-1,:]

Now as mentioned earlier, the aim is to use input_embeddings and recover the input_ids, so I do:

x = torch.unsqueeze(input_embeddings, 1) # to make the dim acceptable
with torch.no_grad():
    text = model(x.long())
    decoded_text = tokenizer.decode(text[0].argmax(dim=-1).tolist())

But doing this I get:

IndexError: index out of range in self

at the level of text = model(x.long()) I wonder what am I doing wrong? How can I recover the input_ids using the embedding I produced?

答案1

得分: 1

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# 实例化模型和分词器
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# 设置输入文本
text = "Hello, how are you?"
# 对输入文本进行分词处理
input_ids = tokenizer.encode(text, return_tensors='pt')
# 使用模型的前向函数获取logits
logits = model(input_ids).logits
# 通过在token维度上获取logits的argmax来获取预测的token IDs
predicted_token_ids = torch.argmax(logits, dim=-1)
# 将预测的token IDs解码回文本
output_text = tokenizer.decode(predicted_token_ids[0], skip_special_tokens=True)
# 打印输出文本和token IDs
print("输出文本: ", output_text)
print("输出token IDs: ", predicted_token_ids.tolist())

英文:

You should use GPT2LMHeadModel instead of GPT2Model, because GPT2Model doesn't have a prediction head.

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Instantiate the model and tokenizer
model = GPT2LMHeadModel.from_pretrained(&#39;gpt2&#39;)
tokenizer = GPT2Tokenizer.from_pretrained(&#39;gpt2&#39;)
# Set the input text
text = &quot;Hello, how are you?&quot;
# Tokenize the input text
input_ids = tokenizer.encode(text, return_tensors=&#39;pt&#39;)
# Use the model&#39;s forward function to obtain logits
logits = model(input_ids).logits
# Obtain the predicted token IDs by getting the argmax of the logits along the token dimension
predicted_token_ids = torch.argmax(logits, dim=-1)
# Decode the predicted token IDs back to text
output_text = tokenizer.decode(predicted_token_ids[0], skip_special_tokens=True)
# Print the output text and token IDs
print(&quot;Output text: &quot;, output_text)
print(&quot;Output token IDs: &quot;, predicted_token_ids.tolist())

Output:

Output text:  , I about you doing
Output token IDs:  [[11, 314, 546, 345, 1804, 198]]

The output text seems weird because they are only predicting next token at step t given tokens from step 1 to step t_1. For example,

Hello =&gt; ,
Hello, =&gt; I
Hello, how =&gt; about

To generate text step by step, you should use generate function. https://huggingface.co/docs/transformers/main_classes/text_generation

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用GPT-2从输入嵌入中恢复输入ID。

问题

答案1

从等效的Python代码转换而来的Go语言代码有什么问题？

如何根据数据框中的模式变化删除重复的行？

TypeError: 元组索引必须是整数或切片，而不是列表 – 在加载Keras模型时

更多的标头是否意味着服务器认为你是人类的机会更大？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。