Huggingface GPT2 损失理解

huangapple go评论68阅读模式
英文:

Huggingface GPT2 loss understanding

问题

我明白你只需要中文翻译代码部分的内容。以下是代码部分的翻译:

我在理解GPT2的损失时遇到了困难我想要给模型一个标签使其生成的目标与输入相同以便我可以看到损失为零

我有一个输入文本
`input_text  = "Welcome to New York"`
当前模型预测下一个单词为`City`
如果我将标签设置为`input_text`,损失永远不会为零我应该如何模拟将标签设置为"Welcome to New York City"以便内部神经网络不考虑模型会产生接近零的损失

为了更详细地解释我的意思以下是代码片段

请注意我已经阅读了论坛和文件其中提到标签可以与输入文本相同模型会将标签向左移动损失不会计算最后一个标记但损失仍然不会变为零

> 用于语言建模的标签请注意标签在模型内部被移位即您可以设置`labels = input_ids`....

```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name, model_max_length=1024, padding_side='left')
tokenizer.pad_token = tokenizer.eos_token  # == <|endoftext|> = 50256
model = GPT2LMHeadModel.from_pretrained(model_name)

batch_size = 5
input_text = "<|endoftext|> Welcome to New York"
target_text = "Welcome to New York City"

# 编码输入
encoding = tokenizer(input_text, padding=True, max_length=batch_size, truncation=True, return_tensors="pt")
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
# 编码目标
target_encoding = tokenizer(target_text, padding=True, max_length=batch_size, truncation=True, return_tensors="pt")
labels = target_encoding.input_ids
# 将标签中的填充标记ID替换为-100,以便损失计算时被忽略
labels[labels == tokenizer.pad_token_id] = -100  # 在我们的情况下没有填充
print(f"input_ids={input_ids}")
print(f"attention_mask={attention_mask}") # 全部为1
print(f"labels ={labels}")
# 前向传递
outputs = model(input_ids=input_ids, labels=labels) 
print(f"模型损失 {outputs.loss}")
# 测试模型以检查其下一个预测
outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=1)
answer = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(f"结果 '{answer}'")

输出

input_ids = tensor([[50256, 19134, 284, 968, 1971]]) # 不确定输入中的eostoken(50256)对模型有什么影响
attention_mask = tensor([[1, 1, 1, 1, 1]])
labels = tensor([[14618, 284, 968, 1971, 2254]]) # 2254 = City; 模型应该预测的内容
模型损失 8.248174667358398
`pad_token_id`设置为`eos_token_id`:50256以进行开放式生成
结果 '<|endoftext|> Welcome to New York City>'

当我像下面这样尝试时

input_text  = "Welcome to New York"
target_text = input_text

我得到约3.26的损失

input_ids = tensor([[14618, 284, 968, 1971]]) # 1971 = York
attention_mask = tensor([[1, 1, 1, 1]])
labels = tensor([[14618, 284, 968, 1971]])
模型损失 3.2614505290985107
`pad_token_id`设置为`eos_token_id`:50256以进行开放式生成
结果 'Welcome to New York City'

是否是因为

outputs = model(input_ids=input_ids, labels=labels) 

生成了多于1个标记。

英文:

(Also posted here https://discuss.huggingface.co/t/newbie-understanding-gpt2-loss/33590)

I am getting stuck with understanding the GPT2 loss. I want to give the model the label having the target it will generate so that I can see that loss is zero.

I have a input text
input_text = &quot;Welcome to New York&quot;
The current model predicts the next word as City
The loss will never be zero if I give the label as input_text. How do I simulate giving the label "Welcome to New York City" so that the internal neural net (irrespective of the model) will give a loss of zero or near that?

To explain more what I mean, here is the snippet.

Note - I have read the forum and documents that the labels can be the same as the input text, that the model will shift left the labels, and that the loss is not calculated for the last token. But then still loss should become zero, which it is not.

> Labels for language modeling. Note that the labels are shifted inside the model,
> i.e. you can set labels = input_ids....

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = &#39;gpt2&#39;
tokenizer = GPT2Tokenizer.from_pretrained(model_name,model_max_length=1024,padding_side=&#39;left&#39;)
tokenizer.pad_token = tokenizer.eos_token # == &lt;|endoftext|&gt; = 50256
model = GPT2LMHeadModel.from_pretrained(model_name)

batch_size=5
input_text  = &quot;&lt;|endoftext|&gt; Welcome to New York&quot;
target_text = &quot;Welcome to New York City&quot;

# encode the inputs
encoding = tokenizer(input_text,padding=True,max_length=batch_size,truncation=True,return_tensors=&quot;pt&quot;,)
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
# encode the targets
target_encoding = tokenizer(target_text,padding=True, max_length=batch_size, truncation=True,return_tensors=&quot;pt&quot;,)
labels = target_encoding.input_ids
# replace padding token id&#39;s of the labels by -100 so it&#39;s ignored by the loss
labels[labels == tokenizer.pad_token_id] = -100  # in our case there is no padding
print(f&quot;input_ids={input_ids}&quot;)
print(f&quot;attention_mask={attention_mask}&quot;) # all ones
print(f&quot;labels ={labels}&quot;)
# forward pass
outputs = model(input_ids=input_ids,labels=labels) 
print(f&quot;Model Loss {outputs.loss}&quot;)
# Test the model to check what it predicts next
outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask,max_new_tokens=1)
answer = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(f&quot;Result &#39;{answer}&#39;&quot;)

Output

input_ids=tensor([[50256, 19134,   284,   968,  1971]]) # not sure what eostoken (50256) in input does to model
attention_mask=tensor([[1, 1, 1, 1, 1]])
labels =tensor([[14618,   284,   968,  1971,  2254]]) # 2254 = City;  which is that the model should predict
Model Loss 8.248174667358398
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result &#39;&lt;|endoftext|&gt; Welcome to New York City&#39;

When I try something proper as is done everywhere

input_text  = &quot;Welcome to New York&quot;
target_text = input_text

I get a loss of about 3.26

input_ids=tensor([[14618,   284,   968,  1971]]) # 1971 = York
attention_mask=tensor([[1, 1, 1, 1]])
labels =tensor([[14618,   284,   968,  1971]])
Model Loss 3.2614505290985107
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result &#39;Welcome to New York City&#39;

Is it that

outputs = model(input_ids=input_ids, labels=labels) 

is generating more than 1 token.

Updated-

Based on the answer by Jindfitch - Putting it here as the SO moderators have delted when I try to add that as answer.

>You try to fine-tune the model to be absolutely sure that City will follow with 100% probability

I trained the GPT2 with this particular text (trained only the last 2 layers and froze the others) and took the model whose loss was the lowest and used that tested again, and sure enough, the loss was much lower - Model Loss 0.01076329406350851

For anyone else who would like to follow. The training code is below.

Note training with this small text and the way I have done I am not really fully sure if it is proper, as the training loss seemed to jump around a bit (that is increased after some epochs, i this case Epoch 8)

2023-03-12 16:03:20,579 [INFO] Epoch 7 complete. Loss: 0.18975284695625305 saving ./test/gpt2-epoch-8-2023-03-12 16:02:19.289492
2023-03-12 16:03:20,985 [INFO] Epoch 9 of 10
2023-03-12 16:03:27,655 [INFO] Epoch 8 complete. Loss: 0.3775772750377655 saving ./test/gpt2-epoch-9-2023-03-12 16:02:19.289492
2023-03-12 16:03:27,655 [INFO] Epoch 10 of 10
2023-03-12 16:03:34,140 [INFO] Epoch 9 complete. Loss: 6.827305332990363e-05 saving ./test/gpt2-epoch-10-2023-03-12 16:02:19.289492

Training script - https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/gpt2_train_model.py

Training Output log https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/training/training_2023-03-12%2016%3A02%3A19.289492.log

Training data
Welcome to New York City (space in the end)
https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/data/small.txt

Eval script - https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/older/gpt2_loss_learn.py

I removed the token corresponding to 'City' from Input-ids when giving the model to generate

# remove the last token off for input-id&#39;s as well as attention Mask
input_ids = input_ids[:,:-1] # input_text  = &quot;Welcome to New York&quot;
attention_mask = attention_mask[:,:-1]
print(f&quot;input_ids={input_ids}&quot;)
outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask,max_new_tokens=1)

Eval Script Output

python3 ./older/gpt2_loss_learn.py 
input_ids=tensor([[14618,   284,   968,  1971,  2254]])
attention_mask=tensor([[1, 1, 1, 1, 1]])
labels =tensor([[14618,   284,   968,  1971,  2254]])
Model Loss 0.01076329406350851
input_ids=tensor([[14618,   284,   968,  1971]])
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result &#39;Welcome to New York City&#39;

A much more illustrative example https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/LLM_Loss_Understanding.ipynb

答案1

得分: 4

默认的损失函数是负对数似然。实际模型输出不是标记“City”,而是整个5万个词汇的分类分布。根据生成策略,您可以从这些分布中抽样,或者选择最可能的标记。

标记“City”,显然是最可能的一个,获得了一定的概率,损失值则为这个概率的负对数。损失接近零意味着标记的概率接近1。然而,标记分布还考虑了许多可能性较低的后续标记。损失3.26对应于概率exp(-3.26),约为3.8%。看起来很小,但在5万个词汇中,它大约比随机猜测可能性高2000倍。

您可以尝试微调模型,以确保“City”将以100%的概率出现,但这可能会破坏其他语言建模能力。

英文:

The default loss function is negative log-likelihood. The actual model output is not the token City but a categorical distribution over the entire 50k vocabulary. Depending on the generation strategy, you either sample from these distributions or take the most probable token.

The token City, apparently the most probable one, gets some probability, and the loss is then minus the logarithm of this probability. Loss close to zero would mean the token would get a probability close to one. However, the token distribution also considers many plausible but less likely follow-ups. Loss 3.26 corresponds to the probability of exp(-3.26), approximately 3.8%. It seems small, but in a 50k vocabulary, it is approximately 2000 times more probable than a random guess.

You can try to fine-tune the model to be absolutely sure that City will follow with 100% probability, but it would probably break other language modeling capabilities.

huangapple
  • 本文由 发表于 2023年3月12日 10:34:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75710776.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定