问题

我正在尝试使用`pytorch-lightning`中的`Trainer`类在我的数据上训练BERT模型。然而，我在**CPU内存**方面遇到了内存溢出异常。
以下是代码：
```python
from transformers.data.data_collator import DataCollatorForLanguageModeling
from datasets import load_dataset
dataset = load_dataset('text', data_path)  # 数据集只有33GB，而我的GPU有350GB内存。
class BertDataModule(pl.LightningDataModule):
    def __init__(self, dataset, train_split, batch_size, data_collator):
        super().__init__()
        self.train_dataset = None
        self.dataset = dataset
        self.collator = data_collator
        self.batch_size = batch_size
        self.train_split = train_split
    def train_dataloader(self):
        return torch.utils.data.DataLoader(self.train_dataset, batch_size=self.batch_size, 
                                           collate_fn=self.collator, num_workers=30)
    # 我为val_dataloader、test_dataloader和predict_dataloader定义了类似的方法。
bert_collator = DataCollatorForLanguageModeling(tokenizer=bert_tokenizer) 
bert_data_module = BertDataModule(dataset=my_dataset['train'], train_split=0.98, 
                                  batch_size=32, data_collator=bert_collator)
bert_model = BertModel(...)
trainer = pl.Trainer(devices=1, max_epochs=50, logger=comment_logger, accelerator='gpu', 
                     precision=16, val_check_interval=50000, callbacks=[checkpoint_callbacks])
trainer.fit(bert_model, datamodule=bert_data_module) # 在这里发生了CPU内存溢出崩溃

我的代码由于内存溢出（OOM）错误而崩溃，尽管我的数据只有33GB，而我的OpenShift pod中的CPU内存为350GB。

您有没有想法是什么原因导致CPU内存在训练过程中继续增加？

非常感谢。


<details>
<summary>英文:</summary>
I am trying to train a BERT model on my data using the `Trainer` class from `pytorch-lightning`. However, I encountered an out-of-memory exception in the **CPU memory**.
Here is the code:
    from transformers.data.data_collator import DataCollatorForLanguageModeling
    from datasets import load_dataset
    dataset = load_dataset(&#39;text&#39;, data_path)  # The dataset is only 33GB, while my GPU has 350GB 
     of RAM.
    class BertDataModule(pl.LightningDataModule):
       def __init__(self, dataset, train_split, batch_size, data_collator):
          super().__init__()
          self.train_dataset = None
          self.dataset = dataset
          self.collator = data_collator
          self.batch_size = batch_size
          self.train_split = train_split
       def train_dataloader(self):
          return torch.utils.data.DataLoader(self.train_dataset, batch_size=self.batch_size, 
          collate_fn=self.collator, num_workers=30)
       # I defined the similar methods for val_dataloader, test_dataloader, and predict_dataloader.
    bert_collator = DataCollatorForLanguageModeling(tokenizer=bert_tokenizer) 
    bert_data_module = BertDataModule(dataset=my_dataset[&#39;train&#39;], train_split=0.98, 
    batch_size=32, data_collator=bert_collator)
    bert_model = BertModel(...)
    trainer = pl.Trainer(devices=1, max_epochs=50, logger=comment_logger, accelerator=&#39;gpu&#39;, 
     precision=16, val_check_interval=50000, callbacks=[checkpoint_callbacks])
    trainer.fit(bert_model, datamodule=bert_data_module) # Crash due to CPU OOM here
My code crashed due to an out-of-memory (OOM) error, even though my data is only 33GB and my CPU memory in the OpenShift pod is 350GB.
Do you have any idea what could be causing the memory of the CPU to continue increasing during training?
Thank you very much.
</details>
# 答案1
**得分**: 1
During training, you need to store the model, all the gradients for all parameters, and all the activations during the forward path. So, it can take quite a lot of memory if you have a big model, a big batch, and each sample alone takes a lot of data. Also, data usually takes much more space in RAM as they are usually compressed when on the hard disk. However, the memory consumption should not increase during training but stay rather constant. When does it crash exactly?
Also, do you have a GPU or CPU? Your title says CPU, but your post says a 350GB GPU. (By the way, I'm rather skeptical since there is currently no GPU with that much memory that exists to my knowledge). If it crashes from CPU, then this means you simply can't load the entire dataset in RAM. If it crashes from GPU, then your batch+model can't fit in your GPU setup during training.
<details>
<summary>英文:</summary>
During training you need to store the model, all the gradients for all parameters and all the activations during forward path, so it can take quite a lof of memory if you have a big model , a big batch and each sample alone take a lot of data. Also data usually take much more space in RAM as they are usually compressed when on hard disk. However the memory consumption should not increase during training but stay rather constant... When does it crash exactly ?
Also, do you have a GPU or CPU ??? YOur title says CPU, but your post says a 350GB GPU. (btw i&#39;m rather skeptical since there is currently no GPU with that much memory that exists to my knowlege). If it crashes from CPU then this means you simply cant load the entire dataset in RAM. If it crashes from GPU then your batch+model cant fit in your GPU setup during training.
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

CPU在使用PyTorch Lightning训练模型时内存不足。

问题

Python typing: Pylance不显示输入类型

所有Python实例是否都在同一个线程上运行？

你想在Python中根据特定条件对随机单词进行排序。

My pygame slider moves back to the default position when mouse click is released rather than staying where it was left

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论