英文:
How is the number of steps calculated in HuggingFace trainer?
问题
我有一个大小为4107的训练数据集。
批大小为8,训练周期数为2。
当我开始训练时,我可以看到步数为128。
英文:
I have a train dataset of size 4107.
DatasetDict({
train: Dataset({
features: ['input_ids'],
num_rows: 4107
})
valid: Dataset({
features: ['input_ids'],
num_rows: 498
})
})
In my training arguments, the batch size is 8 and number of epochs is 2.
from transformers import Trainer, TrainingArguments
args = TrainingArguments(
output_dir="code_gen_epoch",
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
evaluation_strategy="epoch",
save_strategy="epoch",
eval_steps=100,
logging_steps=100,
gradient_accumulation_steps=8,
num_train_epochs=2,
weight_decay=0.1,
warmup_steps=1_000,
lr_scheduler_type="cosine",
learning_rate=3.0e-4,
# save_steps=200,
# fp16=True,
load_best_model_at_end = True,
)
trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=args,
data_collator=data_collator,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["valid"],
)
When I start the training, I can see that the number of steps is 128.
My assumption is that the steps should have been 4107/8 = 512(approx) for 1 epoch.
For 2 epochs 512+512 = 1024.
I don't understand how it came to be 128.
答案1
得分: 5
由于您指定了 gradient_accumulation_steps=8
,有效步数也会除以 8
。这是因为您不是在每个批次上进行反向传播,而是在一定数量的累积批次上进行反向传播。
因此,一个时期内的结果步数将为:4107 个实例 ÷ 8 批次大小 ÷ 8 梯度累积 ≈ 128 步。当禁用梯度累积(gradient_accumulation_steps=1
)时,您会得到 512 步(4107 ÷ 8 ÷ 1 ≈ 512)。
英文:
Since you're specifying gradient_accumulation_steps=8
, the effective number of steps is is also divided by 8
. This is because you're not doing a backward pass on every batch, but on a certain number of accumulated batches.
Hence, the resulting number of steps in an epoch would be: 4107 instances ÷ 8 batch size ÷ 8 gradient accumulation ≈ 128 steps. When gradient accumulation is disabled (gradient_accumulation_steps=1
) you get 512 steps (4107 ÷ 8 ÷ 1 ≈ 512).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论