英文:
OutOfMemoryError: CUDA out of memory despite available GPU memory
问题
我在使用具有24 GB VRAM的GPU在训练GPT-2模型时遇到了GPU内存分配的问题。尽管有大量可用内存,但我收到以下错误信息:
OutOfMemoryError:CUDA内存不足。尝试分配64.00 MiB(GPU
0; 总容量23.68 GiB;已分配18.17 GiB;剩余64.62 MiB
可用;PyTorch总共保留了18.60 GiB)如果已分配的内存大于已分配的内存,请尝试设置max_split_size_mb以避免碎片化。
请参阅内存管理和PYTORCH_CUDA_ALLOC_CONF的文档。
以下是我的设置和模型训练的规格:
GPU:具有24 GB VRAM的NVIDIA GPU
模型:GPT-2,大小约为3 GB,每个参数为32位
训练数据:36,000个训练示例,向量长度为600
训练配置:5个epoch,批大小为16,并启用fp16
这是我的计算:
模型大小:
GPT-2模型:约3 GB
梯度:
梯度通常与模型的参数大小相同。
批大小和训练示例:
批大小:16
训练示例:36,000
向量长度:600
每批内存分配:
模型:3 GB(每批不变)
梯度:3 GB(每批不变)
输入数据:16 x 600(向量长度)x 4字节(假设每个值是32位浮点数)=每批37.5 KB
输出数据:16 x 600(向量长度)x 4字节(假设每个值是32位浮点数)=每批37.5 KB
根据上述计算,对于我的情况,每批的内存分配大约如下:
模型:3 GB
梯度:3 GB
输入和输出数据:75 KB
我将感激任何关于如何解决此问题的见解或建议。提前感谢您的帮助!
英文:
I’m encountering an issue with GPU memory allocation while training a GPT-2 model on a GPU with 24 GB of VRAM. Despite having a substantial amount of available memory, I’m receiving the following error:
> OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU
> 0; 23.68 GiB total capacity; 18.17 GiB already allocated; 64.62 MiB
> free; 18.60 GiB reserved in total by PyTorch) If reserved memory is >>
> allocated memory try setting max_split_size_mb to avoid fragmentation.
> See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.
Here are the specifications of my setup and the model training:
GPU: NVIDIA GPU with 24 GB VRAM
Model: GPT-2 with approximately 3 GB in size and 800 parameters of 32-bit each
Training Data: 36,000 training examples with vector length of 600
Training Configuration: 5 epochs, batch size of 16, and fp16 enabled
These are my calculations:
Model Size:
> GPT-2 model: ~3 GB
Gradients:
> Gradients are typically of the same size as the model’s parameters.
Batch Size and Training Examples:
> Batch Size: 16
>
> Training Examples: 36,000
>
> Vector Length: 600
>
> Memory Allocation per Batch:
>
> Model: 3 GB (unchanged per batch)
>
> Gradients: 3 GB (unchanged per batch)
>
> Input Data: 16 x 600 (vector length) x 4 bytes (assuming each value is
> a 32-bit float) = 37.5 KB per batch
>
> Output Data: 16 x 600 (vector length) x 4 bytes (assuming each value
> is a 32-bit float) = 37.5 KB per batch
Based on the above calculations, the memory allocation per batch for my scenario would be approximately:
Model: 3 GB
Gradients: 3 GB
Input and Output Data: 75 KB
I would appreciate any insights or suggestions on how to resolve this issue. Thank you in advance for your assistance!
答案1
得分: 1
通常,这个问题是由使用CUDA而没有刷新内存的进程引起的。
如果没有任何正在运行的进程,最有效的方法是识别它们并终止它们。
从命令行运行:
nvidia-smi
如果您还没有安装它,可以使用以下命令安装:
sudo apt-get install -y nvidia-smi
它将打印类似以下内容:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 驱动程序版本: 495.29.05 CUDA 版本: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU 名称 持久-M | 总线-Id 显示.A | 不稳定的 ECC |
| Fan Temp Perf Pwr:Usage/Cap| 内存-使用 | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:18:00.0 Off | 0 |
| N/A 32C P0 37W / 250W | 11480MiB / 40536MiB | 0% 默认 |
| | | 禁用 |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... On | 00000000:3B:00.0 Off | 0 |
| N/A 31C P0 33W / 250W | 10200MiB / 40536MiB | 0% 默认 |
| | | 禁用 |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PCI... On | 00000000:86:00.0 Off | 0 |
| N/A 53C P0 41W / 250W | 10200MiB / 40536MiB | 0% 默认 |
| | | 禁用 |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PCI... On | 00000000:AF:00.0 Off | 0 |
| N/A 31C P0 35W / 250W | 10200MiB / 40536MiB | 0% 默认 |
| | | 禁用 |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| 进程: |
| GPU GI CI PID 类型 进程名称 GPU 内存 |
| ID ID 使用 |
|=============================================================================|
| 0 N/A N/A 29142 C /usr/bin/python3 11477MiB |
| 1 N/A N/A 29142 C /usr/bin/python3 10197MiB |
| 2 N/A N/A 29142 C /usr/bin/python3 10197MiB |
| 3 N/A N/A 29142 C /usr/bin/python3 10197MiB |
+-----------------------------------------------------------------------------+
在打印的底部,您将找到正在使用GPU(s)的进程以及它们的PID。假设您正在使用Linux,您可以使用以下命令终止它们,将ProcessPID
替换为您进程的实际PID(再次确保所有进程已经结束):
kill ProcessPID
如果这不起作用,请尝试:
kill -9 ProcessPID
英文:
Usually this issue is caused by processes using CUDA without flushing memory.
If you don't have any process running, the most effective way is to identify them and kill them.
From command line, run:
nvidia-smi
If you have not installed it, you can do it with the following command:
sudo apt-get install -y nvidia-smi
It will print something like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:18:00.0 Off | 0 |
| N/A 32C P0 37W / 250W | 11480MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... On | 00000000:3B:00.0 Off | 0 |
| N/A 31C P0 33W / 250W | 10200MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PCI... On | 00000000:86:00.0 Off | 0 |
| N/A 53C P0 41W / 250W | 10200MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PCI... On | 00000000:AF:00.0 Off | 0 |
| N/A 31C P0 35W / 250W | 10200MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 29142 C /usr/bin/python3 11477MiB |
| 1 N/A N/A 29142 C /usr/bin/python3 10197MiB |
| 2 N/A N/A 29142 C /usr/bin/python3 10197MiB |
| 3 N/A N/A 29142 C /usr/bin/python3 10197MiB |
+-----------------------------------------------------------------------------+
At the bottom of the print, you will find the Processes that are using the GPU(s) with their PIDs. Assuming you are using Linux, you can kill them with the following command, by replacing ProcessPID
with the actual PID of your process (again, be sure all processes have reached the end):
kill ProcessPID
If this does not work, try:
kill -9 ProcessPID
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论