2023年3月21日 00:31:25go评论63阅读模式

英文:

Why is evaluation set draining the memory in pytorch hugging face?

问题

I am using a quite large GPU which is around 80 GB. The training epochs runs fine but for some reason when evaluating (the training set and validation sets have the same length more or less), I am running out of memory and getting this error:

File "/home.../transformers/trainer_pt_utils.py", line 75, in torch_pad_and_concatenate
return torch.cat((tensor1, tensor2), dim=0)
RuntimeError: CUDA out of memory. Tried to allocate 33.84 GiB (GPU 0; 79.35 GiB total 
capacity; 36.51 GiB already allocated; 32.48 GiB free; 44.82 GiB reserved in total by 
PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to 
avoid fragmentation.  See documentation for Memory Management and 
 PYTORCH_CUDA_ALLOC_CONF

The training and validation data was created like this:

train_texts, train_labels = read_dataset('basic_train.tsv') 

val_texts, val_labels = read_dataset('basic_val.tsv')  

train_encodings = tokenizer(train_texts, truncation=False, padding=True) 
val_encodings = tokenizer(val_texts, truncation=False, padding=True)

class Dataset(torch.utils.data.Dataset):     
    def __init__(self, encodings, labels):         
        self.encodings = encodings         
        self.labels = labels 
         ...         
        return item 

train_dataset = Dataset(train_encodings, train_labels) 
val_dataset = Dataset(val_encodings, val_labels)

My training code looks like this:

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,             
per_device_train_batch_size=8,  
per_device_eval_batch_size=8,   
warmup_steps=500,                
weight_decay= 5e-5,              
logging_dir='./logs',            
logging_steps=10,
learning_rate= 2e-5,
eval_steps= 100,
save_steps=30000,
evaluation_strategy= 'steps'
)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")


metric = load_metric('accuracy')

def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  return metric.compute(predictions=predictions, references=labels)

def collate_fn_t5(batch):
  input_ids = torch.stack([example['input_ids'] for example in batch])
  attention_mask = torch.stack([example['attention_mask'] for example in batch])
  labels = torch.stack([example['input_ids'] for example in batch])
   return {'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': labels}


trainer = Trainer(
model=model,                       
args=training_args,                  
train_dataset=train_dataset,         
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
data_collator=collate_fn_t5,
        # evaluation dataset
 )

trainer.train()

eval_results = trainer.evaluate()

英文:

File &quot;/home.../transformers/trainer_pt_utils.py&quot;, line 75, in torch_pad_and_concatenate
return torch.cat((tensor1, tensor2), dim=0)
RuntimeError: CUDA out of memory. Tried to allocate 33.84 GiB (GPU 0; 79.35 GiB total 
capacity; 36.51 GiB already allocated; 32.48 GiB free; 44.82 GiB reserved in total by 
PyTorch) If reserved memory is &gt;&gt; allocated memory try setting max_split_size_mb to 
avoid fragmentation.  See documentation for Memory Management and 
 PYTORCH_CUDA_ALLOC_CONF

The training and validation data was created like this:

train_texts, train_labels = read_dataset(&#39;basic_train.tsv&#39;) 

val_texts, val_labels = read_dataset(&#39;basic_val.tsv&#39;)  

train_encodings = tokenizer(train_texts, truncation=False, padding=True) 
val_encodings = tokenizer(val_texts, truncation=False, padding=True)

class Dataset(torch.utils.data.Dataset):     
    def __init__(self, encodings, labels):         
        self.encodings = encodings         
        self.labels = labels 
         ...         
        return item 

train_dataset = Dataset(train_encodings, train_labels) 
val_dataset = Dataset(val_encodings, val_labels)

My training code looks like this:

training_args = TrainingArguments(
output_dir=&#39;./results&#39;,          
num_train_epochs=10,             
per_device_train_batch_size=8,  
per_device_eval_batch_size=8,   
warmup_steps=500,                
weight_decay= 5e-5,              
logging_dir=&#39;./logs&#39;,            
logging_steps=10,
learning_rate= 2e-5,
eval_steps= 100,
save_steps=30000,
evaluation_strategy= &#39;steps&#39;
)
model = AutoModelForSeq2SeqLM.from_pretrained(&quot;t5-base&quot;)


metric = load_metric(&#39;accuracy&#39;)

def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  return metric.compute(predictions=predictions, references=labels)

def collate_fn_t5(batch):
  input_ids = torch.stack([example[&#39;input_ids&#39;] for example in batch])
  attention_mask = torch.stack([example[&#39;attention_mask&#39;] for example in batch])
  labels = torch.stack([example[&#39;input_ids&#39;] for example in batch])
   return {&#39;input_ids&#39;: input_ids, &#39;attention_mask&#39;: attention_mask, &#39;labels&#39;: labels}


trainer = Trainer(
model=model,                       
args=training_args,                  
train_dataset=train_dataset,         
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
data_collator=collate_fn_t5,
        # evaluation dataset
 )

trainer.train()

eval_results = trainer.evaluate()

答案1

得分: 1

From

RuntimeError: CUDA out of memory. Tried to allocate 33.84 GiB (GPU 0; 79.35 GiB total
capacity; 36.51 GiB already allocated; 32.48 GiB free; 44.82 GiB reserved in total by PyTorch)

Most probably, that's because it takes

79.35 GB available

Then in RAM

36.51 GB allocated, most probably model loaded onto GPU RAM
44.82 GB reserved, should be including 36.51 allocated + pytorch overheads

And you need

33.84 GB for the evaluation batch
but only 32.48 GB is available

So I guess there's a few options, you can try reducing the per_device_eval_batch_size, from 7 all the way to 1 to see if what works, e.g.

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          
per_device_train_batch_size=8,  
per_device_eval_batch_size=1,                
...)

If that doesn't work, perhaps its the default accumulation, see https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.eval_accumulation_steps

You can try:

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          
per_device_train_batch_size=8,  
per_device_eval_batch_size=1,
eval_accumulation_steps=1,                
...)

Sometimes it's also how predict is not generating by default. I'm not sure why that would happen but I think when it's just predicting with the model.eval() or with torch.no_grad() when the predict_with_generate is set to False, it takes some some overhead. But that's just my speculation, https://discuss.huggingface.co/t/cuda-out-of-memory-only-during-validation-not-training/18378

If so, you can try:

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          
per_device_train_batch_size=8,  
per_device_eval_batch_size=1,
eval_accumulation_steps=1,  
predict_with_generate=True,
...)

Or you could try auto_find_batch_size, i.e.

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,             

predict_with_generate=True,
auto_find_batch_size=True,
...)

A few more memory tricks:

# At the imports part of your code.
# See https://pytorch.org/docs/stable/generated/torch.cuda.set_per_process_memory_fraction.html
import torch
torch.cuda.set_per_process_memory_fraction(0.9)

Then if it's still not working, try the algorithmic tricks.

From https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          

fp16=True,
optim="adafactor",
gradient_checkpointing=True,


per_device_train_batch_size=8,  
per_device_eval_batch_size=1,
eval_accumulation_steps=1,  
predict_with_generate=True,

英文:

From

> RuntimeError: CUDA out of memory. Tried to allocate 33.84 GiB (GPU 0; 79.35 GiB total
> capacity; 36.51 GiB already allocated; 32.48 GiB free; 44.82 GiB reserved in total by PyTorch)

Most probably, that's because it takes

79.35 GB available

Then in RAM

36.51 GB allocated, most probably model loaded onto GPU RAM
44.82 GB reserved, should be including 36.51 allocated + pytorch overheads

And you need

33.84 GB for the evaluation batch
but only 32.48 GB is available

So I guess there's a few options, you can try reducing the per_device_eval_batch_size, from 7 all the way to 1 to see if what works, e.g.

training_args = TrainingArguments(
output_dir=&#39;./results&#39;,          
num_train_epochs=10,   
          
per_device_train_batch_size=8,  
per_device_eval_batch_size=1,                
...)

If that doesn't work, perhaps its the default accumulation, see https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.eval_accumulation_steps

You can try:

training_args = TrainingArguments(
output_dir=&#39;./results&#39;,          
num_train_epochs=10,   
          
per_device_train_batch_size=8,  
per_device_eval_batch_size=1,
eval_accumulation_steps=1,                
...)

If so, you can try:

training_args = TrainingArguments(
output_dir=&#39;./results&#39;,          
num_train_epochs=10,   
          
per_device_train_batch_size=8,  
per_device_eval_batch_size=1,
eval_accumulation_steps=1,  
predict_with_generate=True,
...)

Or you could try auto_find_batch_size, i.e.

training_args = TrainingArguments(
output_dir=&#39;./results&#39;,          
num_train_epochs=10,             

predict_with_generate=True,
auto_find_batch_size=True,
...)

A few more memory tricks:

# At the imports part of your code.
# See https://pytorch.org/docs/stable/generated/torch.cuda.set_per_process_memory_fraction.html
import torch
torch.cuda.set_per_process_memory_fraction(0.9)

Then if it's still not working, try the algorithmic tricks.

From https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one

training_args = TrainingArguments(
output_dir=&#39;./results&#39;,          
num_train_epochs=10,   
          

fp16=True,
optim=&quot;adafactor&quot;,
gradient_checkpointing=True,


per_device_train_batch_size=8,  
per_device_eval_batch_size=1,
eval_accumulation_steps=1,  
predict_with_generate=True,

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

评估集在PyTorch Hugging Face中为什么会占用内存？

问题

答案1

建议处理混合了数值和分类特征的自然语言处理（NLP）数据的最佳方法：

为什么需要位置编码，而输入的id已经表示了Bert中单词的顺序？

使用Python包（spaCy）仅覆盖特定语言词汇的单词列表。

如何在张量（图像）中进行位移而不必使用循环？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论