评估集在PyTorch Hugging Face中为什么会占用内存?

huangapple go评论63阅读模式
英文:

Why is evaluation set draining the memory in pytorch hugging face?

问题

I am using a quite large GPU which is around 80 GB. The training epochs runs fine but for some reason when evaluating (the training set and validation sets have the same length more or less), I am running out of memory and getting this error:

File "/home.../transformers/trainer_pt_utils.py", line 75, in torch_pad_and_concatenate
return torch.cat((tensor1, tensor2), dim=0)
RuntimeError: CUDA out of memory. Tried to allocate 33.84 GiB (GPU 0; 79.35 GiB total 
capacity; 36.51 GiB already allocated; 32.48 GiB free; 44.82 GiB reserved in total by 
PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to 
avoid fragmentation.  See documentation for Memory Management and 
 PYTORCH_CUDA_ALLOC_CONF

The training and validation data was created like this:

train_texts, train_labels = read_dataset('basic_train.tsv') 

val_texts, val_labels = read_dataset('basic_val.tsv')  

train_encodings = tokenizer(train_texts, truncation=False, padding=True) 
val_encodings = tokenizer(val_texts, truncation=False, padding=True)

class Dataset(torch.utils.data.Dataset):     
    def __init__(self, encodings, labels):         
        self.encodings = encodings         
        self.labels = labels 
         ...         
        return item 

train_dataset = Dataset(train_encodings, train_labels) 
val_dataset = Dataset(val_encodings, val_labels) 

My training code looks like this:

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,             
per_device_train_batch_size=8,  
per_device_eval_batch_size=8,   
warmup_steps=500,                
weight_decay= 5e-5,              
logging_dir='./logs',            
logging_steps=10,
learning_rate= 2e-5,
eval_steps= 100,
save_steps=30000,
evaluation_strategy= 'steps'
)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")


metric = load_metric('accuracy')

def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  return metric.compute(predictions=predictions, references=labels)

def collate_fn_t5(batch):
  input_ids = torch.stack([example['input_ids'] for example in batch])
  attention_mask = torch.stack([example['attention_mask'] for example in batch])
  labels = torch.stack([example['input_ids'] for example in batch])
   return {'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': labels}


trainer = Trainer(
model=model,                       
args=training_args,                  
train_dataset=train_dataset,         
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
data_collator=collate_fn_t5,
        # evaluation dataset
 )

trainer.train()

eval_results = trainer.evaluate()
英文:

I am using a quite large GPU which is around 80 GB. The training epochs runs fine but for some reason when evaluating (the training set and validation sets have the same length more or less), I am running out of memory and getting this error:

File "/home.../transformers/trainer_pt_utils.py", line 75, in torch_pad_and_concatenate
return torch.cat((tensor1, tensor2), dim=0)
RuntimeError: CUDA out of memory. Tried to allocate 33.84 GiB (GPU 0; 79.35 GiB total 
capacity; 36.51 GiB already allocated; 32.48 GiB free; 44.82 GiB reserved in total by 
PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to 
avoid fragmentation.  See documentation for Memory Management and 
 PYTORCH_CUDA_ALLOC_CONF

The training and validation data was created like this:

train_texts, train_labels = read_dataset('basic_train.tsv') 

val_texts, val_labels = read_dataset('basic_val.tsv')  

train_encodings = tokenizer(train_texts, truncation=False, padding=True) 
val_encodings = tokenizer(val_texts, truncation=False, padding=True)

class Dataset(torch.utils.data.Dataset):     
    def __init__(self, encodings, labels):         
        self.encodings = encodings         
        self.labels = labels 
         ...         
        return item 

train_dataset = Dataset(train_encodings, train_labels) 
val_dataset = Dataset(val_encodings, val_labels) 

My training code looks like this:

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,             
per_device_train_batch_size=8,  
per_device_eval_batch_size=8,   
warmup_steps=500,                
weight_decay= 5e-5,              
logging_dir='./logs',            
logging_steps=10,
learning_rate= 2e-5,
eval_steps= 100,
save_steps=30000,
evaluation_strategy= 'steps'
)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")


metric = load_metric('accuracy')

def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  return metric.compute(predictions=predictions, references=labels)

def collate_fn_t5(batch):
  input_ids = torch.stack([example['input_ids'] for example in batch])
  attention_mask = torch.stack([example['attention_mask'] for example in batch])
  labels = torch.stack([example['input_ids'] for example in batch])
   return {'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': labels}


trainer = Trainer(
model=model,                       
args=training_args,                  
train_dataset=train_dataset,         
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
data_collator=collate_fn_t5,
        # evaluation dataset
 )

trainer.train()

eval_results = trainer.evaluate()

答案1

得分: 1

From

RuntimeError: CUDA out of memory. Tried to allocate 33.84 GiB (GPU 0; 79.35 GiB total
capacity; 36.51 GiB already allocated; 32.48 GiB free; 44.82 GiB reserved in total by PyTorch)

Most probably, that's because it takes

  • 79.35 GB available

Then in RAM

  • 36.51 GB allocated, most probably model loaded onto GPU RAM
  • 44.82 GB reserved, should be including 36.51 allocated + pytorch overheads

And you need

  • 33.84 GB for the evaluation batch
  • but only 32.48 GB is available

So I guess there's a few options, you can try reducing the per_device_eval_batch_size, from 7 all the way to 1 to see if what works, e.g.

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          
per_device_train_batch_size=8,  
per_device_eval_batch_size=1,                
...)

If that doesn't work, perhaps its the default accumulation, see https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.eval_accumulation_steps

You can try:

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          
per_device_train_batch_size=8,  
per_device_eval_batch_size=1,
eval_accumulation_steps=1,                
...)

Sometimes it's also how predict is not generating by default. I'm not sure why that would happen but I think when it's just predicting with the model.eval() or with torch.no_grad() when the predict_with_generate is set to False, it takes some some overhead. But that's just my speculation, https://discuss.huggingface.co/t/cuda-out-of-memory-only-during-validation-not-training/18378

If so, you can try:

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          
per_device_train_batch_size=8,  
per_device_eval_batch_size=1,
eval_accumulation_steps=1,  
predict_with_generate=True,
...)

Or you could try auto_find_batch_size, i.e.

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,             

predict_with_generate=True,
auto_find_batch_size=True,
...)

A few more memory tricks:

# At the imports part of your code.
# See https://pytorch.org/docs/stable/generated/torch.cuda.set_per_process_memory_fraction.html
import torch
torch.cuda.set_per_process_memory_fraction(0.9)

Then if it's still not working, try the algorithmic tricks.

From https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          

fp16=True,
optim="adafactor",
gradient_checkpointing=True,


per_device_train_batch_size=8,  
per_device_eval_batch_size=1,
eval_accumulation_steps=1,  
predict_with_generate=True,
英文:

From

> RuntimeError: CUDA out of memory. Tried to allocate 33.84 GiB (GPU 0; 79.35 GiB total
> capacity; 36.51 GiB already allocated; 32.48 GiB free; 44.82 GiB reserved in total by PyTorch)

Most probably, that's because it takes

  • 79.35 GB available

Then in RAM

  • 36.51 GB allocated, most probably model loaded onto GPU RAM
  • 44.82 GB reserved, should be including 36.51 allocated + pytorch overheads

And you need

  • 33.84 GB for the evaluation batch
  • but only 32.48 GB is available

So I guess there's a few options, you can try reducing the per_device_eval_batch_size, from 7 all the way to 1 to see if what works, e.g.

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          
per_device_train_batch_size=8,  
per_device_eval_batch_size=1,                
...)

If that doesn't work, perhaps its the default accumulation, see https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.eval_accumulation_steps

You can try:

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          
per_device_train_batch_size=8,  
per_device_eval_batch_size=1,
eval_accumulation_steps=1,                
...)

Sometimes it's also how predict is not generating by default. I'm not sure why that would happen but I think when it's just predicting with the model.eval() or with torch.no_grad() when the predict_with_generate is set to False, it takes some some overhead. But that's just my speculation, https://discuss.huggingface.co/t/cuda-out-of-memory-only-during-validation-not-training/18378

If so, you can try:

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          
per_device_train_batch_size=8,  
per_device_eval_batch_size=1,
eval_accumulation_steps=1,  
predict_with_generate=True,
...)

Or you could try auto_find_batch_size, i.e.

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,             

predict_with_generate=True,
auto_find_batch_size=True,
...)

A few more memory tricks:

# At the imports part of your code.
# See https://pytorch.org/docs/stable/generated/torch.cuda.set_per_process_memory_fraction.html
import torch
torch.cuda.set_per_process_memory_fraction(0.9)

Then if it's still not working, try the algorithmic tricks.

From https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          

fp16=True,
optim="adafactor",
gradient_checkpointing=True,


per_device_train_batch_size=8,  
per_device_eval_batch_size=1,
eval_accumulation_steps=1,  
predict_with_generate=True,

huangapple
  • 本文由 发表于 2023年3月21日 00:31:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/75792922.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定