CUDA内存不足错误:尽管有可用的GPU内存,但CUDA内存不足。

huangapple go评论220阅读模式
英文:

OutOfMemoryError: CUDA out of memory despite available GPU memory

问题

我在使用具有24 GB VRAM的GPU在训练GPT-2模型时遇到了GPU内存分配的问题。尽管有大量可用内存,但我收到以下错误信息:

OutOfMemoryError:CUDA内存不足。尝试分配64.00 MiB(GPU
0; 总容量23.68 GiB;已分配18.17 GiB;剩余64.62 MiB
可用;PyTorch总共保留了18.60 GiB)如果已分配的内存大于已分配的内存,请尝试设置max_split_size_mb以避免碎片化。
请参阅内存管理和PYTORCH_CUDA_ALLOC_CONF的文档。

以下是我的设置和模型训练的规格:

GPU:具有24 GB VRAM的NVIDIA GPU
模型:GPT-2,大小约为3 GB,每个参数为32位
训练数据:36,000个训练示例,向量长度为600
训练配置:5个epoch,批大小为16,并启用fp16

这是我的计算:

模型大小:

GPT-2模型:约3 GB

梯度:

梯度通常与模型的参数大小相同。

批大小和训练示例:

批大小:16

训练示例:36,000

向量长度:600

每批内存分配:

模型:3 GB(每批不变)

梯度:3 GB(每批不变)

输入数据:16 x 600(向量长度)x 4字节(假设每个值是32位浮点数)=每批37.5 KB

输出数据:16 x 600(向量长度)x 4字节(假设每个值是32位浮点数)=每批37.5 KB

根据上述计算,对于我的情况,每批的内存分配大约如下:

模型:3 GB

梯度:3 GB

输入和输出数据:75 KB

我将感激任何关于如何解决此问题的见解或建议。提前感谢您的帮助!

英文:

I’m encountering an issue with GPU memory allocation while training a GPT-2 model on a GPU with 24 GB of VRAM. Despite having a substantial amount of available memory, I’m receiving the following error:

> OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU
> 0; 23.68 GiB total capacity; 18.17 GiB already allocated; 64.62 MiB
> free; 18.60 GiB reserved in total by PyTorch) If reserved memory is >>
> allocated memory try setting max_split_size_mb to avoid fragmentation.
> See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

Here are the specifications of my setup and the model training:

GPU: NVIDIA GPU with 24 GB VRAM
Model: GPT-2 with approximately 3 GB in size and 800 parameters of 32-bit each
Training Data: 36,000 training examples with vector length of 600
Training Configuration: 5 epochs, batch size of 16, and fp16 enabled
These are my calculations:

Model Size:

> GPT-2 model: ~3 GB

Gradients:

> Gradients are typically of the same size as the model’s parameters.

Batch Size and Training Examples:

> Batch Size: 16
>
> Training Examples: 36,000
>
> Vector Length: 600
>
> Memory Allocation per Batch:
>
> Model: 3 GB (unchanged per batch)
>
> Gradients: 3 GB (unchanged per batch)
>
> Input Data: 16 x 600 (vector length) x 4 bytes (assuming each value is
> a 32-bit float) = 37.5 KB per batch
>
> Output Data: 16 x 600 (vector length) x 4 bytes (assuming each value
> is a 32-bit float) = 37.5 KB per batch

Based on the above calculations, the memory allocation per batch for my scenario would be approximately:

Model: 3 GB

Gradients: 3 GB

Input and Output Data: 75 KB

I would appreciate any insights or suggestions on how to resolve this issue. Thank you in advance for your assistance!

答案1

得分: 1

通常,这个问题是由使用CUDA而没有刷新内存的进程引起的。
如果没有任何正在运行的进程,最有效的方法是识别它们并终止它们。

从命令行运行:

nvidia-smi

如果您还没有安装它,可以使用以下命令安装:

sudo apt-get install -y nvidia-smi

它将打印类似以下内容:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    驱动程序版本: 495.29.05    CUDA 版本: 11.5         |
|-------------------------------+----------------------+----------------------+
| GPU  名称            持久-M | 总线-Id        显示.A | 不稳定的 ECC          |
| Fan  Temp  Perf  Pwr:Usage/Cap|         内存-使用      | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000000:18:00.0 Off |                    0 |
| N/A   32C    P0    37W / 250W |  11480MiB / 40536MiB |      0%      默认     |
|                               |                      |             禁用       |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   31C    P0    33W / 250W |  10200MiB / 40536MiB |      0%      默认     |
|                               |                      |             禁用       |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCI...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   53C    P0    41W / 250W |  10200MiB / 40536MiB |      0%      默认     |
|                               |                      |             禁用       |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCI...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   31C    P0    35W / 250W |  10200MiB / 40536MiB |      0%      默认     |
|                               |                      |             禁用       |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| 进程:                                                                        |
|  GPU   GI   CI        PID   类型   进程名称                      GPU 内存     |
|        ID   ID                                                   使用        |
|=============================================================================|
|    0   N/A  N/A     29142      C   /usr/bin/python3              11477MiB |
|    1   N/A  N/A     29142      C   /usr/bin/python3              10197MiB |
|    2   N/A  N/A     29142      C   /usr/bin/python3              10197MiB |
|    3   N/A  N/A     29142      C   /usr/bin/python3              10197MiB |
+-----------------------------------------------------------------------------+

在打印的底部,您将找到正在使用GPU(s)的进程以及它们的PID。假设您正在使用Linux,您可以使用以下命令终止它们,将ProcessPID替换为您进程的实际PID(再次确保所有进程已经结束):

kill ProcessPID

如果这不起作用,请尝试:

kill -9 ProcessPID
英文:

Usually this issue is caused by processes using CUDA without flushing memory.
If you don't have any process running, the most effective way is to identify them and kill them.

From command line, run:

nvidia-smi

If you have not installed it, you can do it with the following command:

sudo apt-get install -y nvidia-smi

It will print something like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000000:18:00.0 Off |                    0 |
| N/A   32C    P0    37W / 250W |  11480MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   31C    P0    33W / 250W |  10200MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCI...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   53C    P0    41W / 250W |  10200MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCI...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   31C    P0    35W / 250W |  10200MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     29142      C   /usr/bin/python3                11477MiB |
|    1   N/A  N/A     29142      C   /usr/bin/python3                10197MiB |
|    2   N/A  N/A     29142      C   /usr/bin/python3                10197MiB |
|    3   N/A  N/A     29142      C   /usr/bin/python3                10197MiB |
+-----------------------------------------------------------------------------+

At the bottom of the print, you will find the Processes that are using the GPU(s) with their PIDs. Assuming you are using Linux, you can kill them with the following command, by replacing ProcessPID with the actual PID of your process (again, be sure all processes have reached the end):

kill ProcessPID

If this does not work, try:

kill -9 ProcessPID

huangapple
  • 本文由 发表于 2023年6月26日 17:56:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76555569.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定