2023年5月7日 09:29:35go评论107阅读模式

英文:

How can I fine-tune mBART-50 for machine translation in the transformers Python library so that it learns a new word?

问题

The new word "billozarion" wasn't learned because fine-tuning a pre-trained language model like mBART-50 typically requires a substantial amount of data and training iterations. The code you provided doesn't include the training loop where the model's weights are updated, which is why you're not observing any learning.

The "forward pass" you mentioned is just a single forward pass through the model without any backpropagation for gradient updates. Fine-tuning usually involves training on a large dataset, where gradients are calculated during forward and backward passes to update the model's parameters over many iterations.

Additionally, you mentioned that you're using the Hugging Face documentation, but the provided code is missing the fine-tuning step. To fine-tune a model, you would typically have a training loop where you compute gradients and update the model's weights based on a loss function and an optimizer.

To resolve this, you would need to obtain a dataset for translation tasks, create a DataLoader to feed the data to the model, define a loss function, and use an optimizer (like Adam) to perform backpropagation and update the model's parameters. This is a more involved process beyond what's provided in the code snippet you shared.

If you have specific questions or need guidance on the fine-tuning process, please feel free to ask.

英文:

I try to fine-tune mBART-50 (paper, pre-trained model on Hugging Face) for machine translation in the transformers Python library. To test the fine-tuning, I am trying to simply teach mBART-50 a new word that I made up.

I use the following code. Over 95% of the code is from the Hugging Face documentation:

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

print(&#39;Model loading started&#39;)
model = MBartForConditionalGeneration.from_pretrained(&quot;facebook/mbart-large-50&quot;)
tokenizer = MBart50TokenizerFast.from_pretrained(&quot;facebook/mbart-large-50&quot;, src_lang=&quot;fr_XX&quot;, tgt_lang=&quot;en_XX&quot;)
print(&#39;Model loading done&#39;)

src_text = &quot; billozarion &quot;
tgt_text =  &quot; plorization &quot;

model_inputs = tokenizer(src_text, return_tensors=&quot;pt&quot;)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_text, return_tensors=&quot;pt&quot;).input_ids

print(&#39;Fine-tuning started&#39;)
for i in range(1000):
    #pass
    model(**model_inputs, labels=labels) # forward pass
print(&#39;Fine-tuning ended&#39;)
    
# Testing whether the model learned the new word. Translate French to English
tokenizer = MBart50TokenizerFast.from_pretrained(&quot;facebook/mbart-large-50-many-to-many-mmt&quot;)
tokenizer.src_lang = &quot;fr_XX&quot;
article_fr = src_text
encoded_fr = tokenizer(article_fr, return_tensors=&quot;pt&quot;)
generated_tokens = model.generate(**encoded_fr, forced_bos_token_id=tokenizer.lang_code_to_id[&quot;en_XX&quot;])
translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translation)

However, the new word wasn't learned. The output is "billozarion" instead of "plorization". Why?

I'm strictly following the Hugging Face documentation, unless I missed something. The # forward pass does make me concerned, as one would need a backward pass to update the gradients. Maybe this means that the documentation is incorrect, however I can't test that hypothesis as I don't know how to add the backward pass.

Environment that I used to run the code: Ubuntu 20.04.5 LTS with an NVIDIA A100 40GB GPU (I also tested with an NVIDIA T4 Tensor Core GPU) and CUDA 12.0 with the following conda environment:

conda create --name mbart-python39 python=3.9
conda activate mbart-python39 
pip install transformers==4.28.1
pip install chardet==5.1.0
pip install sentencepiece==0.1.99
pip install protobuf==3.20

答案1

得分: 1

I understand your request. Here is the translated code:

from transformers.optimization import AdamW

# 设置优化器和训练设置
optimizer = AdamW(model.parameters(), lr=1e-4)
model.train()

print('开始微调')
for i in range(100):
    optimizer.zero_grad()
    output = model(**model_inputs, labels=labels) # 前向传播
    loss = output.loss
    loss.backward()
    optimizer.step()
print('微调结束')

# 将法语翻译成英语
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer.src_lang = "fr_XX"
article_fr = src_text
encoded_fr = tokenizer(article_fr, return_tensors="pt")
generated_tokens = model.generate(**encoded_fr, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translation)

Please note that the translated code provided here is just a segment and might not work on its own without the context of the entire code.

英文:

One could add the following to fine-tune mBART-50:

from transformers.optimization import AdamW

# Set up the optimizer and training settings
optimizer = AdamW(model.parameters(), lr=1e-4)
model.train()

print(&#39;Fine-tuning started&#39;)
for i in range(100):
    optimizer.zero_grad()
    output = model(**model_inputs, labels=labels) # forward pass
    loss = output.loss
    loss.backward()
    optimizer.step()
print(&#39;Fine-tuning ended&#39;)

Full code:

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
from transformers.optimization import AdamW
import os
os.environ[&quot;TOKENIZERS_PARALLELISM&quot;] = &quot;false&quot;


print(&#39;Model loading started&#39;)
model = MBartForConditionalGeneration.from_pretrained(&quot;facebook/mbart-large-50&quot;)
tokenizer = MBart50TokenizerFast.from_pretrained(&quot;facebook/mbart-large-50&quot;, src_lang=&quot;fr_XX&quot;, tgt_lang=&quot;en_XX&quot;)
print(&#39;Model loading done&#39;)

src_text = &quot; billozarion &quot;
tgt_text =  &quot; plorizatizzzon &quot;

model_inputs = tokenizer(src_text, return_tensors=&quot;pt&quot;)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_text, return_tensors=&quot;pt&quot;).input_ids

# Set up the optimizer and training settings
optimizer = AdamW(model.parameters(), lr=1e-4)
model.train()

print(&#39;Fine-tuning started&#39;)
for i in range(100):
    optimizer.zero_grad()
    output = model(**model_inputs, labels=labels) # forward pass
    loss = output.loss
    loss.backward()
    optimizer.step()
print(&#39;Fine-tuning ended&#39;)
    
# translate French to English
tokenizer = MBart50TokenizerFast.from_pretrained(&quot;facebook/mbart-large-50-many-to-many-mmt&quot;)
tokenizer.src_lang = &quot;fr_XX&quot;
article_fr = src_text
encoded_fr = tokenizer(article_fr, return_tensors=&quot;pt&quot;)
generated_tokens = model.generate(**encoded_fr, forced_bos_token_id=tokenizer.lang_code_to_id[&quot;en_XX&quot;])
translation =tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translation)

It outputs the correct made up translation "plorizatizzzon".

I reported the documentation issue on https://github.com/huggingface/transformers/issues/23185

https://github.com/huggingface/transformers/tree/main/examples/pytorch/translation contains two more advanced scripts to fine-tune mBART and T5 (thanks sgugger for pointing me to it). Here is how to use the script to fine-tune mBART:

Create a new conda environment:

conda create --name mbart-source-transformers-python39 python=3.9
conda activate mbart-source-transformers-python39 
git clone https://github.com/huggingface/transformers.git
cd transformers
pip install git+https://github.com/huggingface/transformers
pip install datasets evaluate accelerate sacrebleu
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install sentencepiece==0.1.99
pip install protobuf==3.20
pip install --force-reinstall charset-normalizer==3.1.0

Command:

python examples/pytorch/translation/run_translation.py \
    --model_name_or_path facebook/mbart-large-50 \
    --do_train \
    --do_eval \
    --source_lang fr_XX \
    --target_lang en_XX \
    --source_prefix &quot;translate French to English: &quot; \
    --train_file finetuning-translation-train.json \
    --validation_file finetuning-translation-validation.json  \
    --test_file finetuning-translation-test.json \
    --output_dir tmp/tst-translation4 \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --do_predict \
    --predict_with_generate

(Note: the readme seems to have missed --do_predict)

with finetuning-translation-train.json, finetuning-translation-validation.json and finetuning-translation-test.json formatted as follows with the JSON Lines format:

{&quot;translation&quot;: {&quot;en&quot;: &quot;20 year-old male tennis player.&quot;, &quot;fr&quot;: &quot;Joueur de tennis de 12 ans&quot;}}
{&quot;translation&quot;: {&quot;en&quot;: &quot;2 soldiers in an old military Jeep&quot;, &quot;fr&quot;: &quot;2 soldats dans une vielle Jeep militaire&quot;}}

(Note: one must use double quotes in the .json files. Single quotes e.g. 'en' will make the script crash.)

I run the code on Ubuntu 20.04.5 LTS with an NVIDIA T4 Tensor Core GPU (16GB memory) and CUDA 12.0. The mBART-50 model takes around 15GB of GPU memory.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

你想要对transformers Python库中的mBART-50进行微调，以便它学习新单词。

问题

答案1

Why does my list productsCombined[] change after running my for loop? How can I change this code so it doesn't? You will see the prints diverge at TAC

有没有办法使用多线程来写入同一个CSV文件的不同列？

结合分组条形图和折线图

如何在Python中从字符串中移除子集

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论