英文:
Why is evaluation set draining the memory in pytorch hugging face?
问题
I am using a quite large GPU which is around 80 GB. The training epochs runs fine but for some reason when evaluating (the training set and validation sets have the same length more or less), I am running out of memory and getting this error:
File "/home.../transformers/trainer_pt_utils.py", line 75, in torch_pad_and_concatenate
return torch.cat((tensor1, tensor2), dim=0)
RuntimeError: CUDA out of memory. Tried to allocate 33.84 GiB (GPU 0; 79.35 GiB total
capacity; 36.51 GiB already allocated; 32.48 GiB free; 44.82 GiB reserved in total by
PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to
avoid fragmentation. See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF
The training and validation data was created like this:
train_texts, train_labels = read_dataset('basic_train.tsv')
val_texts, val_labels = read_dataset('basic_val.tsv')
train_encodings = tokenizer(train_texts, truncation=False, padding=True)
val_encodings = tokenizer(val_texts, truncation=False, padding=True)
class Dataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
...
return item
train_dataset = Dataset(train_encodings, train_labels)
val_dataset = Dataset(val_encodings, val_labels)
My training code looks like this:
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
warmup_steps=500,
weight_decay= 5e-5,
logging_dir='./logs',
logging_steps=10,
learning_rate= 2e-5,
eval_steps= 100,
save_steps=30000,
evaluation_strategy= 'steps'
)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
metric = load_metric('accuracy')
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return metric.compute(predictions=predictions, references=labels)
def collate_fn_t5(batch):
input_ids = torch.stack([example['input_ids'] for example in batch])
attention_mask = torch.stack([example['attention_mask'] for example in batch])
labels = torch.stack([example['input_ids'] for example in batch])
return {'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': labels}
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
data_collator=collate_fn_t5,
# evaluation dataset
)
trainer.train()
eval_results = trainer.evaluate()
英文:
I am using a quite large GPU which is around 80 GB. The training epochs runs fine but for some reason when evaluating (the training set and validation sets have the same length more or less), I am running out of memory and getting this error:
File "/home.../transformers/trainer_pt_utils.py", line 75, in torch_pad_and_concatenate
return torch.cat((tensor1, tensor2), dim=0)
RuntimeError: CUDA out of memory. Tried to allocate 33.84 GiB (GPU 0; 79.35 GiB total
capacity; 36.51 GiB already allocated; 32.48 GiB free; 44.82 GiB reserved in total by
PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to
avoid fragmentation. See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF
The training and validation data was created like this:
train_texts, train_labels = read_dataset('basic_train.tsv')
val_texts, val_labels = read_dataset('basic_val.tsv')
train_encodings = tokenizer(train_texts, truncation=False, padding=True)
val_encodings = tokenizer(val_texts, truncation=False, padding=True)
class Dataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
...
return item
train_dataset = Dataset(train_encodings, train_labels)
val_dataset = Dataset(val_encodings, val_labels)
My training code looks like this:
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
warmup_steps=500,
weight_decay= 5e-5,
logging_dir='./logs',
logging_steps=10,
learning_rate= 2e-5,
eval_steps= 100,
save_steps=30000,
evaluation_strategy= 'steps'
)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
metric = load_metric('accuracy')
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return metric.compute(predictions=predictions, references=labels)
def collate_fn_t5(batch):
input_ids = torch.stack([example['input_ids'] for example in batch])
attention_mask = torch.stack([example['attention_mask'] for example in batch])
labels = torch.stack([example['input_ids'] for example in batch])
return {'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': labels}
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
data_collator=collate_fn_t5,
# evaluation dataset
)
trainer.train()
eval_results = trainer.evaluate()
答案1
得分: 1
From
RuntimeError: CUDA out of memory. Tried to allocate 33.84 GiB (GPU 0; 79.35 GiB total
capacity; 36.51 GiB already allocated; 32.48 GiB free; 44.82 GiB reserved in total by PyTorch)
Most probably, that's because it takes
- 79.35 GB available
Then in RAM
- 36.51 GB allocated, most probably model loaded onto GPU RAM
- 44.82 GB reserved, should be including 36.51 allocated + pytorch overheads
And you need
- 33.84 GB for the evaluation batch
- but only 32.48 GB is available
So I guess there's a few options, you can try reducing the per_device_eval_batch_size
, from 7 all the way to 1 to see if what works, e.g.
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
per_device_train_batch_size=8,
per_device_eval_batch_size=1,
...)
If that doesn't work, perhaps its the default accumulation, see https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.eval_accumulation_steps
You can try:
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
per_device_train_batch_size=8,
per_device_eval_batch_size=1,
eval_accumulation_steps=1,
...)
Sometimes it's also how predict is not generating by default. I'm not sure why that would happen but I think when it's just predicting with the model.eval()
or with torch.no_grad()
when the predict_with_generate
is set to False, it takes some some overhead. But that's just my speculation, https://discuss.huggingface.co/t/cuda-out-of-memory-only-during-validation-not-training/18378
If so, you can try:
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
per_device_train_batch_size=8,
per_device_eval_batch_size=1,
eval_accumulation_steps=1,
predict_with_generate=True,
...)
Or you could try auto_find_batch_size
, i.e.
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
predict_with_generate=True,
auto_find_batch_size=True,
...)
A few more memory tricks:
# At the imports part of your code.
# See https://pytorch.org/docs/stable/generated/torch.cuda.set_per_process_memory_fraction.html
import torch
torch.cuda.set_per_process_memory_fraction(0.9)
Then if it's still not working, try the algorithmic tricks.
From https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
fp16=True,
optim="adafactor",
gradient_checkpointing=True,
per_device_train_batch_size=8,
per_device_eval_batch_size=1,
eval_accumulation_steps=1,
predict_with_generate=True,
英文:
From
> RuntimeError: CUDA out of memory. Tried to allocate 33.84 GiB (GPU 0; 79.35 GiB total
> capacity; 36.51 GiB already allocated; 32.48 GiB free; 44.82 GiB reserved in total by PyTorch)
Most probably, that's because it takes
- 79.35 GB available
Then in RAM
- 36.51 GB allocated, most probably model loaded onto GPU RAM
- 44.82 GB reserved, should be including 36.51 allocated + pytorch overheads
And you need
- 33.84 GB for the evaluation batch
- but only 32.48 GB is available
So I guess there's a few options, you can try reducing the per_device_eval_batch_size
, from 7 all the way to 1 to see if what works, e.g.
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
per_device_train_batch_size=8,
per_device_eval_batch_size=1,
...)
If that doesn't work, perhaps its the default accumulation, see https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.eval_accumulation_steps
You can try:
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
per_device_train_batch_size=8,
per_device_eval_batch_size=1,
eval_accumulation_steps=1,
...)
Sometimes it's also how predict is not generating by default. I'm not sure why that would happen but I think when it's just predicting with the model.eval()
or with torch.no_grad()
when the predict_with_generate
is set to False, it takes some some overhead. But that's just my speculation, https://discuss.huggingface.co/t/cuda-out-of-memory-only-during-validation-not-training/18378
If so, you can try:
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
per_device_train_batch_size=8,
per_device_eval_batch_size=1,
eval_accumulation_steps=1,
predict_with_generate=True,
...)
Or you could try auto_find_batch_size
, i.e.
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
predict_with_generate=True,
auto_find_batch_size=True,
...)
A few more memory tricks:
# At the imports part of your code.
# See https://pytorch.org/docs/stable/generated/torch.cuda.set_per_process_memory_fraction.html
import torch
torch.cuda.set_per_process_memory_fraction(0.9)
Then if it's still not working, try the algorithmic tricks.
From https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
fp16=True,
optim="adafactor",
gradient_checkpointing=True,
per_device_train_batch_size=8,
per_device_eval_batch_size=1,
eval_accumulation_steps=1,
predict_with_generate=True,
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论