CPU在使用PyTorch Lightning训练模型时内存不足。

huangapple go评论87阅读模式
英文:

CPU Out of memory when training a model with pytorch lightning

问题

我正在尝试使用`pytorch-lightning`中的`Trainer`类在我的数据上训练BERT模型然而我在**CPU内存**方面遇到了内存溢出异常

以下是代码

```python
from transformers.data.data_collator import DataCollatorForLanguageModeling
from datasets import load_dataset

dataset = load_dataset('text', data_path)  # 数据集只有33GB,而我的GPU有350GB内存。

class BertDataModule(pl.LightningDataModule):
    def __init__(self, dataset, train_split, batch_size, data_collator):
        super().__init__()
        self.train_dataset = None
        self.dataset = dataset
        self.collator = data_collator
        self.batch_size = batch_size
        self.train_split = train_split

    def train_dataloader(self):
        return torch.utils.data.DataLoader(self.train_dataset, batch_size=self.batch_size, 
                                           collate_fn=self.collator, num_workers=30)
    # 我为val_dataloader、test_dataloader和predict_dataloader定义了类似的方法。

bert_collator = DataCollatorForLanguageModeling(tokenizer=bert_tokenizer) 
bert_data_module = BertDataModule(dataset=my_dataset['train'], train_split=0.98, 
                                  batch_size=32, data_collator=bert_collator)
bert_model = BertModel(...)

trainer = pl.Trainer(devices=1, max_epochs=50, logger=comment_logger, accelerator='gpu', 
                     precision=16, val_check_interval=50000, callbacks=[checkpoint_callbacks])

trainer.fit(bert_model, datamodule=bert_data_module) # 在这里发生了CPU内存溢出崩溃

我的代码由于内存溢出(OOM)错误而崩溃,尽管我的数据只有33GB,而我的OpenShift pod中的CPU内存为350GB。

您有没有想法是什么原因导致CPU内存在训练过程中继续增加?

非常感谢。


<details>
<summary>英文:</summary>

I am trying to train a BERT model on my data using the `Trainer` class from `pytorch-lightning`. However, I encountered an out-of-memory exception in the **CPU memory**.

Here is the code:

    from transformers.data.data_collator import DataCollatorForLanguageModeling
    from datasets import load_dataset

    dataset = load_dataset(&#39;text&#39;, data_path)  # The dataset is only 33GB, while my GPU has 350GB 
     of RAM.

    class BertDataModule(pl.LightningDataModule):
       def __init__(self, dataset, train_split, batch_size, data_collator):
          super().__init__()
          self.train_dataset = None
          self.dataset = dataset
          self.collator = data_collator
          self.batch_size = batch_size
          self.train_split = train_split

       def train_dataloader(self):
          return torch.utils.data.DataLoader(self.train_dataset, batch_size=self.batch_size, 
          collate_fn=self.collator, num_workers=30)
       # I defined the similar methods for val_dataloader, test_dataloader, and predict_dataloader.

    bert_collator = DataCollatorForLanguageModeling(tokenizer=bert_tokenizer) 
    bert_data_module = BertDataModule(dataset=my_dataset[&#39;train&#39;], train_split=0.98, 
    batch_size=32, data_collator=bert_collator)
    bert_model = BertModel(...)

    trainer = pl.Trainer(devices=1, max_epochs=50, logger=comment_logger, accelerator=&#39;gpu&#39;, 
     precision=16, val_check_interval=50000, callbacks=[checkpoint_callbacks])

    trainer.fit(bert_model, datamodule=bert_data_module) # Crash due to CPU OOM here


My code crashed due to an out-of-memory (OOM) error, even though my data is only 33GB and my CPU memory in the OpenShift pod is 350GB.

Do you have any idea what could be causing the memory of the CPU to continue increasing during training?

Thank you very much.






</details>


# 答案1
**得分**: 1

During training, you need to store the model, all the gradients for all parameters, and all the activations during the forward path. So, it can take quite a lot of memory if you have a big model, a big batch, and each sample alone takes a lot of data. Also, data usually takes much more space in RAM as they are usually compressed when on the hard disk. However, the memory consumption should not increase during training but stay rather constant. When does it crash exactly?

Also, do you have a GPU or CPU? Your title says CPU, but your post says a 350GB GPU. (By the way, I'm rather skeptical since there is currently no GPU with that much memory that exists to my knowledge). If it crashes from CPU, then this means you simply can't load the entire dataset in RAM. If it crashes from GPU, then your batch+model can't fit in your GPU setup during training.

<details>
<summary>英文:</summary>

During training you need to store the model, all the gradients for all parameters and all the activations during forward path, so it can take quite a lof of memory if you have a big model , a big batch and each sample alone take a lot of data. Also data usually take much more space in RAM as they are usually compressed when on hard disk. However the memory consumption should not increase during training but stay rather constant... When does it crash exactly ?

Also, do you have a GPU or CPU ??? YOur title says CPU, but your post says a 350GB GPU. (btw i&#39;m rather skeptical since there is currently no GPU with that much memory that exists to my knowlege). If it crashes from CPU then this means you simply cant load the entire dataset in RAM. If it crashes from GPU then your batch+model cant fit in your GPU setup during training.

</details>



huangapple
  • 本文由 发表于 2023年7月17日 15:43:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76702390.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定