英文:
Huggingface Transformers (PyTorch) - Custom training loop doubles speed?
问题
我在使用PyTorch中的Huggingface Transformers与自定义训练循环时发现了一些相当奇怪的事情。
首先,一些背景信息:我目前正在尝试在多个节点上微调预训练的GPT2小模型(GPT2LMHeadModel;大约有1.7亿个参数版本),使用Huggingface Accelerate。我正在使用Huggingface的datasets
库进行训练。
当然,在加速过程中的第一步是编写一个自定义的PyTorch训练循环,我在huggingface的官方教程的帮助下完成了这一步。当然,我决定在实施加速之前使用这个新的训练循环来测试模型,以确保它确实可以工作。
以下是我原始模型的相关代码,以及新训练循环中的相应代码:
注意:BATCH_SIZE
在两个模型中都等于2。未显示的所有代码在两个模型之间完全相同。
原始模型:
data = data['train']
dc = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
train_args = TrainingArguments(
output_dir=OUTPUT_DIRECTORY,
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=BATCH_SIZE,
save_steps=10_000,
save_total_limit=1, # 一次保存多少个“检查点”
prediction_loss_only=True,
remove_unused_columns=False,
optim="adamw_torch"
)
trainer = Trainer(
model=model,
args=train_args,
data_collator=dc,
train_dataset=data
)
trainer.train()
自定义训练循环:
data = data['train']
dc = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
optimizer = AdamW(model.parameters(), lr=5e-5)
train_dl = DataLoader(
data, shuffle=True, batch_size=BATCH_SIZE, collate_fn=dc
)
epochs = 1
training_steps = epochs * len(train_dl)
scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=training_steps
)
progress_bar = tqdm(range(training_steps))
model.train()
for epoch in range(epochs):
for batch in train_dl:
# 将一个批次传递给模型
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
我使用了两个GPU进行了测试(当然只是一个节点),每个GPU都是16GB。但结果让人感到非常可疑。
- 我的原始模型平均大约是1-2次迭代/秒。
- 另一方面,我的自定义循环平均大约是3-4次迭代/秒。
这绝对是离奇的。简单地添加了我自己的训练循环(只有几行代码),竟然不仅比Huggingface提供的官方循环更快,而且几乎是它的两倍快?我是不是写错了训练循环?我是不是完全错过了什么重要的东西?
英文:
I've found something quite strange when using Huggingface Transformers with a custom training loop in PyTorch.
But first, some context: I'm currently trying to fine tune a pretrained GPT2 small (GPT2LMHeadModel; the ~170M param version) on multiple nodes, using Huggingface Accelerate. I'm using Huggingface's datasets
library for training.
Of course, the first step in this process in accelerate is to write a custom PyTorch training loop, which I did with the help of the official tutorial from huggingface. Naturally, I decided to test the model with this new training loop before implementing accelerate to ensure it actually worked.
Here's the relevant code from my original model, as well as the corresponding code from the new training loop:
Note: BATCH_SIZE
is equal to 2 in both models. All code not shown is exactly the same between both models.
Original:
data = data['train']
dc = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
train_args = TrainingArguments(
output_dir=OUTPUT_DIRECTORY,
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=BATCH_SIZE,
save_steps=10_000,
save_total_limit=1, # How many "checkpoints" to save at a time
prediction_loss_only=True,
remove_unused_columns=False,
optim="adamw_torch"
)
trainer = Trainer(
model=model,
args=train_args,
data_collator=dc,
train_dataset=data
)
trainer.train()
Custom Train Loop:
data = data['train']
dc = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
optimizer = AdamW(model.parameters(), lr=5e-5)
train_dl = DataLoader(
data, shuffle=True, batch_size=BATCH_SIZE, collate_fn=dc
)
epochs = 1
training_steps = epochs * len(train_dl)
scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=training_steps
)
progress_bar = tqdm(range(training_steps))
model.train()
for epoch in range(epochs):
for batch in train_dl:
# Run a batch through the model
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
I tested it (with one node of course) with two GPUs, both 16GB each. And it worked... but suspiciously well.
- My original model averaged about 1-2 iterations/s.
- My custom loop on the other hand averaged about 3-4 iterations/s.
This is absolutely bizarre. How is it possible that simply adding my own training loop, that's just a couple of lines of code, is not only faster than the official one provided by Huggingface - but nearly TWICE as fast? Did I write the training loop incorrectly? Am I completely missing something here?
答案1
得分: 1
在训练循环中,你在计算损失后直接调用 optimizer.step()
,没有梯度累积。
默认的 Trainer 使用梯度累积(默认为1个梯度累积步骤),这会导致在模型权重更新之前梯度在多个批次上累积;这对于提高准确性很有用,但会减慢训练过程。
英文:
In your training loop, you call optimizer.step()
directly after computing the loss, with no gradient accumulation.
Default Trainer uses gradient accumulation (1 gradient accumulation step by default), this causes gradient to be accumulated over multiple batches before model weights update; this is useful to improve accuracy but slows down the training procedure.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论