你想要对transformers Python库中的mBART-50进行微调,以便它学习新单词。

huangapple go评论101阅读模式
英文:

How can I fine-tune mBART-50 for machine translation in the transformers Python library so that it learns a new word?

问题

The new word "billozarion" wasn't learned because fine-tuning a pre-trained language model like mBART-50 typically requires a substantial amount of data and training iterations. The code you provided doesn't include the training loop where the model's weights are updated, which is why you're not observing any learning.

The "forward pass" you mentioned is just a single forward pass through the model without any backpropagation for gradient updates. Fine-tuning usually involves training on a large dataset, where gradients are calculated during forward and backward passes to update the model's parameters over many iterations.

Additionally, you mentioned that you're using the Hugging Face documentation, but the provided code is missing the fine-tuning step. To fine-tune a model, you would typically have a training loop where you compute gradients and update the model's weights based on a loss function and an optimizer.

To resolve this, you would need to obtain a dataset for translation tasks, create a DataLoader to feed the data to the model, define a loss function, and use an optimizer (like Adam) to perform backpropagation and update the model's parameters. This is a more involved process beyond what's provided in the code snippet you shared.

If you have specific questions or need guidance on the fine-tuning process, please feel free to ask.

英文:

I try to fine-tune mBART-50 (paper, pre-trained model on Hugging Face) for machine translation in the transformers Python library. To test the fine-tuning, I am trying to simply teach mBART-50 a new word that I made up.

I use the following code. Over 95% of the code is from the Hugging Face documentation:

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

print('Model loading started')
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="fr_XX", tgt_lang="en_XX")
print('Model loading done')

src_text = " billozarion "
tgt_text =  " plorization "

model_inputs = tokenizer(src_text, return_tensors="pt")
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_text, return_tensors="pt").input_ids

print('Fine-tuning started')
for i in range(1000):
    #pass
    model(**model_inputs, labels=labels) # forward pass
print('Fine-tuning ended')
    
# Testing whether the model learned the new word. Translate French to English
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer.src_lang = "fr_XX"
article_fr = src_text
encoded_fr = tokenizer(article_fr, return_tensors="pt")
generated_tokens = model.generate(**encoded_fr, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translation)

However, the new word wasn't learned. The output is "billozarion" instead of "plorization". Why?

I'm strictly following the Hugging Face documentation, unless I missed something. The # forward pass does make me concerned, as one would need a backward pass to update the gradients. Maybe this means that the documentation is incorrect, however I can't test that hypothesis as I don't know how to add the backward pass.


Environment that I used to run the code: Ubuntu 20.04.5 LTS with an NVIDIA A100 40GB GPU (I also tested with an NVIDIA T4 Tensor Core GPU) and CUDA 12.0 with the following conda environment:

conda create --name mbart-python39 python=3.9
conda activate mbart-python39 
pip install transformers==4.28.1
pip install chardet==5.1.0
pip install sentencepiece==0.1.99
pip install protobuf==3.20

答案1

得分: 1

I understand your request. Here is the translated code:

from transformers.optimization import AdamW

# 设置优化器和训练设置
optimizer = AdamW(model.parameters(), lr=1e-4)
model.train()

print('开始微调')
for i in range(100):
    optimizer.zero_grad()
    output = model(**model_inputs, labels=labels) # 前向传播
    loss = output.loss
    loss.backward()
    optimizer.step()
print('微调结束')

# 将法语翻译成英语
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer.src_lang = "fr_XX"
article_fr = src_text
encoded_fr = tokenizer(article_fr, return_tensors="pt")
generated_tokens = model.generate(**encoded_fr, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translation)

Please note that the translated code provided here is just a segment and might not work on its own without the context of the entire code.

英文:

One could add the following to fine-tune mBART-50:

from transformers.optimization import AdamW

# Set up the optimizer and training settings
optimizer = AdamW(model.parameters(), lr=1e-4)
model.train()

print('Fine-tuning started')
for i in range(100):
    optimizer.zero_grad()
    output = model(**model_inputs, labels=labels) # forward pass
    loss = output.loss
    loss.backward()
    optimizer.step()
print('Fine-tuning ended')

Full code:

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
from transformers.optimization import AdamW
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"


print('Model loading started')
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="fr_XX", tgt_lang="en_XX")
print('Model loading done')

src_text = " billozarion "
tgt_text =  " plorizatizzzon "

model_inputs = tokenizer(src_text, return_tensors="pt")
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_text, return_tensors="pt").input_ids

# Set up the optimizer and training settings
optimizer = AdamW(model.parameters(), lr=1e-4)
model.train()

print('Fine-tuning started')
for i in range(100):
    optimizer.zero_grad()
    output = model(**model_inputs, labels=labels) # forward pass
    loss = output.loss
    loss.backward()
    optimizer.step()
print('Fine-tuning ended')
    
# translate French to English
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer.src_lang = "fr_XX"
article_fr = src_text
encoded_fr = tokenizer(article_fr, return_tensors="pt")
generated_tokens = model.generate(**encoded_fr, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
translation =tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translation)

It outputs the correct made up translation "plorizatizzzon".

I reported the documentation issue on https://github.com/huggingface/transformers/issues/23185


https://github.com/huggingface/transformers/tree/main/examples/pytorch/translation contains two more advanced scripts to fine-tune mBART and T5 (thanks sgugger for pointing me to it). Here is how to use the script to fine-tune mBART:

Create a new conda environment:

conda create --name mbart-source-transformers-python39 python=3.9
conda activate mbart-source-transformers-python39 
git clone https://github.com/huggingface/transformers.git
cd transformers
pip install git+https://github.com/huggingface/transformers
pip install datasets evaluate accelerate sacrebleu
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install sentencepiece==0.1.99
pip install protobuf==3.20
pip install --force-reinstall charset-normalizer==3.1.0

Command:

python examples/pytorch/translation/run_translation.py \
    --model_name_or_path facebook/mbart-large-50 \
    --do_train \
    --do_eval \
    --source_lang fr_XX \
    --target_lang en_XX \
    --source_prefix "translate French to English: " \
    --train_file finetuning-translation-train.json \
    --validation_file finetuning-translation-validation.json  \
    --test_file finetuning-translation-test.json \
    --output_dir tmp/tst-translation4 \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --do_predict \
    --predict_with_generate

(Note: the readme seems to have missed --do_predict)

with finetuning-translation-train.json, finetuning-translation-validation.json and finetuning-translation-test.json formatted as follows with the JSON Lines format:

{"translation": {"en": "20 year-old male tennis player.", "fr": "Joueur de tennis de 12 ans"}}
{"translation": {"en": "2 soldiers in an old military Jeep", "fr": "2 soldats dans une vielle Jeep militaire"}}

(Note: one must use double quotes in the .json files. Single quotes e.g. 'en' will make the script crash.)

I run the code on Ubuntu 20.04.5 LTS with an NVIDIA T4 Tensor Core GPU (16GB memory) and CUDA 12.0. The mBART-50 model takes around 15GB of GPU memory.

huangapple
  • 本文由 发表于 2023年5月7日 09:29:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76191862.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定