英文:
Hugging Face Transformers trainer: per_device_train_batch_size vs auto_find_batch_size
问题
Hugging Face Transformers Trainer
可以接受per_device_train_batch_size
参数或auto_find_batch_size
参数。
然而,它们似乎有不同的效果。需要考虑的一点是,per_device_train_batch_size
默认为8:它始终被设置,无法禁用。
我还观察到,如果遇到OOM错误,降低per_device_train_batch_size
可以解决问题,但auto_find_batch_size
无法解决问题。这相当令人反感,因为它应该找到足够小的批次大小(我可以手动做到)。
那么,auto_find_batch_size
到底是什么作用呢?
英文:
A Hugging Face Transformers Trainer
can receive a per_device_train_batch_size
argument, or an auto_find_batch_size
argument.
However, they seem to have different effects. One thing to consider is that per_device_train_batch_size
defaults to 8: it is always set, and you can't disable it.
I have also observed that if I run into OOM errors, lowering per_device_train_batch_size
can solve the issue, but auto_find_batch_size
doesn't solve the problem. This is quite counter-intuitive, since it should find a batch size that is small enough (I can do it manually).
So: what does auto_find_batch_size
do, exactly?
答案1
得分: 2
auto_find_batch_size
参数是一个可选参数,可额外用于per_device_train_batch_size
参数。
正如您所指出的,降低批量大小是解决内存不足错误的一种方法。auto_find_batch_size
参数自动化了降低过程。启用此参数将使用accelerate
中的find_executable_batch_size
,它:
> 采用指数衰减,每次运行失败后将批量大小减半。
per_device_train_batch_size
用作初始批量大小。因此,如果您使用默认值8
,它将以批量大小8
(在单个设备上)开始训练,如果失败,将以批量大小4
重新启动训练过程。
英文:
The auto_find_batch_size
argument is an optional argument which can be used in addition to the per_device_train_batch_size
argument.
As you point out, lowering the batch size is one way to resolve out-of-memory errors. The auto_find_batch_size
argument automates the lowering process. Enabling this, will use find_executable_batch_size
from accelerate
, which:
> operates with exponential decay, decreasing the batch size in half after each failed run
The per_device_train_batch_size
is used as the initial batch size to start off with. So if you use the default of 8
, it starts training with a batch size of 8
(on a single device), & if it fails, it will restart the training procedure with a batch size of 4
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论