Tensorflow 使用一个GPU的全部资源,但另一个GPU的利用率较低。

huangapple go评论67阅读模式
英文:

Tensorflow using all of one GPU but little of the other

问题

在使用Tensorflow目标检测API时,遇到了使用NVIDIA 3080(10GB)出现OOM错误,于是购买了4090(24GB)。目前我同时在两张卡上运行,但我注意到在高批次大小运行时,几乎用尽了3080,但4090的使用量不同。理想情况下,我想要同时充分利用两张卡,以尽可能提高批次大小。我似乎找不到一种方法来更改策略,以便GPU可以承受不同的负载。镜像策略似乎在每个拆分期间为每个GPU提供相同数量的数据。是否有一种方法可以让一张GPU的负载更高,而另一张负载更低?

我的计算机和环境规格如下:

操作系统 = Ubuntu 22.04
GPU:[0: 4090, 1: 3080]
Python版本:3.10.9
CUDAToolkit版本:11.2.2(通过Anaconda安装)
CuDNN版本:8.1.0.77(通过Anaconda安装)

训练期间的GPU内存使用情况

我对这方面还比较新,所以感激任何帮助。如果我遗漏了任何有用的信息,请告诉我,我将相应编辑帖子。提前感谢。

我尝试将分发策略从MultiWorkerMirroredStrategy更改为MirroredStrategy和experimental.CentralStorageStrategy,但没有真正的改变。我希望中央存储策略能够使CPU更有效地分发数据。

Derek

英文:

After running into OOM errors using Tensorflow Object Detection API with an NVIDIA 3080 (10GB) I bought a 4090 (24GB). I am currently running both together, but I noticed that in high batch size runs, I'm using almost all the 3080 but varying amounts of the 4090. Ideally I'd like to use all of both cards to push the batch size as high as possible. I can't seem to find a way to change the strategy so that the GPUs can take different loads. The Mirrored strategies seem to give each GPU the same amount of data to process during each split. Is there a way that one GPU can have more and the other less?

My machine and environment specs are as follows:

OS = Ubuntu 22.04
GPUs: [0: 4090, 1: 3080]
python: 3.10.9
cudatoolkit: 11.2.2 (installed through anaconda)
cudnn: 8.1.0.77 (installed through anaconda)

GPU Memory usage during training

I'm fairly new to this, so any help is appreciated. If I've left out any useful information, please let me know and I'll edit the post accordingly. Thanks in advance.

I've tried changing the distribution strategies from MultiWorkerMirroredStrategy to MirroredStrategy and experimental.CentralStorageStrategy with no real change. I was hoping that the central storage strategy would allow the CPU to more effectively distribute the data.

Derek

答案1

得分: 0

我最终通过从conda-forge下载ncurses并将其设置为默认通道来解决了这个问题。以下是执行此操作的说明:

conda config --add channels conda-forge
conda config --set channel_priority strict
conda install -c conda-forge ncurses
conda search ncurses --channel conda-forge

希望这能节省某人的时间!
DB

英文:

I ended up solving this by downloading ncurses from conda-forge and setting it as the default channel. Here are the instructions for doing this:

conda config --add channels conda-forge
conda config --set channel_priority strict
conda install -c conda-forge ncurses
conda search ncurses --channel conda-forge

Hope this saves someone some time!

DB

huangapple
  • 本文由 发表于 2023年3月21日 00:39:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/75792992.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定