英文:
CPU Out of memory when training a model with pytorch lightning
问题
我正在尝试使用`pytorch-lightning`中的`Trainer`类在我的数据上训练BERT模型。然而,我在**CPU内存**方面遇到了内存溢出异常。
以下是代码:
```python
from transformers.data.data_collator import DataCollatorForLanguageModeling
from datasets import load_dataset
dataset = load_dataset('text', data_path) # 数据集只有33GB,而我的GPU有350GB内存。
class BertDataModule(pl.LightningDataModule):
def __init__(self, dataset, train_split, batch_size, data_collator):
super().__init__()
self.train_dataset = None
self.dataset = dataset
self.collator = data_collator
self.batch_size = batch_size
self.train_split = train_split
def train_dataloader(self):
return torch.utils.data.DataLoader(self.train_dataset, batch_size=self.batch_size,
collate_fn=self.collator, num_workers=30)
# 我为val_dataloader、test_dataloader和predict_dataloader定义了类似的方法。
bert_collator = DataCollatorForLanguageModeling(tokenizer=bert_tokenizer)
bert_data_module = BertDataModule(dataset=my_dataset['train'], train_split=0.98,
batch_size=32, data_collator=bert_collator)
bert_model = BertModel(...)
trainer = pl.Trainer(devices=1, max_epochs=50, logger=comment_logger, accelerator='gpu',
precision=16, val_check_interval=50000, callbacks=[checkpoint_callbacks])
trainer.fit(bert_model, datamodule=bert_data_module) # 在这里发生了CPU内存溢出崩溃
我的代码由于内存溢出(OOM)错误而崩溃,尽管我的数据只有33GB,而我的OpenShift pod中的CPU内存为350GB。
您有没有想法是什么原因导致CPU内存在训练过程中继续增加?
非常感谢。
<details>
<summary>英文:</summary>
I am trying to train a BERT model on my data using the `Trainer` class from `pytorch-lightning`. However, I encountered an out-of-memory exception in the **CPU memory**.
Here is the code:
from transformers.data.data_collator import DataCollatorForLanguageModeling
from datasets import load_dataset
dataset = load_dataset('text', data_path) # The dataset is only 33GB, while my GPU has 350GB
of RAM.
class BertDataModule(pl.LightningDataModule):
def __init__(self, dataset, train_split, batch_size, data_collator):
super().__init__()
self.train_dataset = None
self.dataset = dataset
self.collator = data_collator
self.batch_size = batch_size
self.train_split = train_split
def train_dataloader(self):
return torch.utils.data.DataLoader(self.train_dataset, batch_size=self.batch_size,
collate_fn=self.collator, num_workers=30)
# I defined the similar methods for val_dataloader, test_dataloader, and predict_dataloader.
bert_collator = DataCollatorForLanguageModeling(tokenizer=bert_tokenizer)
bert_data_module = BertDataModule(dataset=my_dataset['train'], train_split=0.98,
batch_size=32, data_collator=bert_collator)
bert_model = BertModel(...)
trainer = pl.Trainer(devices=1, max_epochs=50, logger=comment_logger, accelerator='gpu',
precision=16, val_check_interval=50000, callbacks=[checkpoint_callbacks])
trainer.fit(bert_model, datamodule=bert_data_module) # Crash due to CPU OOM here
My code crashed due to an out-of-memory (OOM) error, even though my data is only 33GB and my CPU memory in the OpenShift pod is 350GB.
Do you have any idea what could be causing the memory of the CPU to continue increasing during training?
Thank you very much.
</details>
# 答案1
**得分**: 1
During training, you need to store the model, all the gradients for all parameters, and all the activations during the forward path. So, it can take quite a lot of memory if you have a big model, a big batch, and each sample alone takes a lot of data. Also, data usually takes much more space in RAM as they are usually compressed when on the hard disk. However, the memory consumption should not increase during training but stay rather constant. When does it crash exactly?
Also, do you have a GPU or CPU? Your title says CPU, but your post says a 350GB GPU. (By the way, I'm rather skeptical since there is currently no GPU with that much memory that exists to my knowledge). If it crashes from CPU, then this means you simply can't load the entire dataset in RAM. If it crashes from GPU, then your batch+model can't fit in your GPU setup during training.
<details>
<summary>英文:</summary>
During training you need to store the model, all the gradients for all parameters and all the activations during forward path, so it can take quite a lof of memory if you have a big model , a big batch and each sample alone take a lot of data. Also data usually take much more space in RAM as they are usually compressed when on hard disk. However the memory consumption should not increase during training but stay rather constant... When does it crash exactly ?
Also, do you have a GPU or CPU ??? YOur title says CPU, but your post says a 350GB GPU. (btw i'm rather skeptical since there is currently no GPU with that much memory that exists to my knowlege). If it crashes from CPU then this means you simply cant load the entire dataset in RAM. If it crashes from GPU then your batch+model cant fit in your GPU setup during training.
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论