RuntimeError: CUDA错误: 设备上没有可执行的内核图像 (rastervision)

huangapple go评论118阅读模式
英文:

RuntimeError: CUDA error: no kernel image is available for execution on the device (rastervision)

问题

你好,我正在尝试在 NVIDIA GEFORCE 3050 RTX GPU 上运行 rastervision 管道。

  • Ubuntu 22.04
  • Pytorch:版本:1.12.0+cu116
  • CUDA:12

但当我像这样运行 Docker 容器时:

  1. sudo docker run --rm --runtime=nvidia --gpus all -it -v ${RV_QUICKSTART_CODE_DIR}:/opt/src/code -v ${RV_QUICKSTART_OUT_DIR}:/opt/data/output quay.io/azavea/raster-vision:pytorch-0.20 /bin/bash

模型无法训练,并输出以下错误:

  1. RuntimeError: CUDA error: no kernel image is available for execution on the device
  2. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  3. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

PD:运行 nvidia-smi 输出了 GPU 的特性,表明它已被识别。
我非常感谢您对此问题的帮助。

英文:

Hi I am trying to run rastervision pipeline on a GPU NVIDIA GEOFORCE 3050 RTX.

  • Ubuntu 22.04
  • Pytorch: Version: 1.12.0+cu116
  • CUDA: 12

But when I run the Docker container like that:
sudo docker run --rm --runtime=nvidia --gpus all -it -v ${RV_QUICKSTART_CODE_DIR}:/opt/src/code -v ${RV_QUICKSTART_OUT_DIR}:/opt/data/output quay.io/azavea/raster-vision:pytorch-0.20 /bin/bash

The model does not train and outputs this error:
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

PD: running nvidia-smi outputs the characteristics of the GPU, meaning it is recognized.
I would very much appreciate some help in this issue.
Thanks!

This is the output I get:

  1. `Skipping 'analyze' command...
  2. python -m rastervision.pipeline.cli run_command /opt/data/output/pipeline-config.json train
  3. Running train command...
  4. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Building datasets ...
  5. 2023-03-09 08:53:29:rastervision.core.data.raster_source.rasterio_source: WARNING - Raster block size (2, 650) is too non-square. This can slow down reading. Consider re-tiling using GDAL.
  6. 2023-03-09 08:53:29:rastervision.core.data.raster_source.rasterio_source: WARNING - Raster block size (2, 650) is too non-square. This can slow down reading. Consider re-tiling using GDAL.
  7. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Physical CPUs: 12
  8. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Logical CPUs: 16
  9. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Total memory: 15.30 GB
  10. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Size of /opt/data volume: 445.44 GB
  11. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Size of / volume: 445.44 GB
  12. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Python version: 3.9.16 (main, Jan 11 2023, 16:05:54)
  13. [GCC 11.2.0]
  14. /bin/sh: 1: nvcc: not found
  15. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO -
  16. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Thu Mar 9 08:53:29 2023
  17. +-----------------------------------------------------------------------------+
  18. | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
  19. |-------------------------------+----------------------+----------------------+
  20. | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
  21. | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
  22. | | | MIG M. |
  23. |===============================+======================+======================|
  24. | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
  25. | N/A 37C P3 14W / 30W | 262MiB / 4096MiB | 7% Default |
  26. | | | N/A |
  27. +-------------------------------+----------------------+----------------------+
  28. +-----------------------------------------------------------------------------+
  29. | Processes: |
  30. | GPU GI CI PID Type Process name GPU Memory |
  31. | ID ID Usage |
  32. |=============================================================================|
  33. +-----------------------------------------------------------------------------+
  34. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Devices:
  35. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - index, name, driver_version, memory.total [MiB], memory.used [MiB], memory.free [MiB]
  36. 0, NVIDIA GeForce RTX 3050 Ti Laptop GPU, 525.89.02, 4096 MiB, 262 MiB, 3639 MiB
  37. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - PyTorch version: 1.12.1+cu102
  38. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - CUDA available: True
  39. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - CUDA version: 10.2
  40. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - CUDNN version: 7605
  41. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Number of CUDA devices: 1
  42. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Active CUDA Device: GPU 0
  43. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - model=SemanticSegmentationModelConfig(backbone=<Backbone.resnet50: 'resnet50'>, pretrained=True, init_weights=None, load_strict=True, external_def=None) solver=SolverConfig(lr=0.0001, num_epochs=1, test_num_epochs=2, test_batch_sz=4, overfit_num_steps=1, sync_interval=1, batch_sz=2, one_cycle=True, multi_stage=[], class_loss_weights=None, ignore_class_index=None, external_loss_def=None) data=SemanticSegmentationGeoDataConfig(scene_dataset='<1 train_scenes, 1 validation_scenes, 0 test_scenes>', window_opts="method=<GeoDataWindowMethod.random: 'random'> size=300 stride=None padding=None pad_direction='end' size_lims=(300, 301) h_lims=None w_lims=None max_windows=10 max_sample_attempts=100 efficient_aoi_sampling=True") predict_mode=False test_mode=False overfit_mode=False eval_train=False save_model_bundle=True log_tensorboard=True run_tensorboard=False output_uri='/opt/data/output/train'
  44. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Using device: cuda
  45. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - train_ds: 10 items
  46. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - valid_ds: 10 items
  47. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - test_ds: 0 items
  48. 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Plotting sample training batch.
  49. 2023-03-09 08:53:30:rastervision.pytorch_learner.learner: INFO - Plotting sample validation batch.
  50. 2023-03-09 08:53:31:rastervision.pytorch_learner.learner: INFO - epoch: 0
  51. Training: 0%| | 0/5 [00:00<?, ?it/s]
  52. Traceback (most recent call last):
  53. File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
  54. return _run_code(code, main_globals, None,
  55. File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code
  56. exec(code, run_globals)
  57. File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 251, in <module>
  58. _main()
  59. File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 247, in _main
  60. main()
  61. File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
  62. return self.main(*args, **kwargs)
  63. File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1055, in main
  64. rv = self.invoke(ctx)
  65. File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
  66. return _process_result(sub_ctx.command.invoke(sub_ctx))
  67. File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
  68. return ctx.invoke(self.callback, **ctx.params)
  69. File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
  70. return __callback(*args, **kwargs)
  71. File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 236, in run_command
  72. _run_command(
  73. File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 218, in _run_command
  74. command_fn()
  75. File "/opt/src/rastervision_core/rastervision/core/rv_pipeline/rv_pipeline.py", line 154, in train
  76. backend.train(source_bundle_uri=self.config.source_bundle_uri)
  77. File "/opt/src/rastervision_pytorch_backend/rastervision/pytorch_backend/pytorch_learner_backend.py", line 120, in train
  78. learner.main()
  79. File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py", line 267, in main
  80. self.train()
  81. File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py", line 1265, in train
  82. train_metrics = self.train_epoch(
  83. File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py", line 1188, in train_epoch
  84. output = self.train_step(batch, batch_ind)
  85. File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/semantic_segmentation_learner.py", line 26, in train_step
  86. out = self.post_forward(self.model(x))
  87. File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
  88. return forward_call(*input, **kwargs)
  89. File "/opt/conda/lib/python3.9/site-packages/torchvision/models/segmentation/_utils.py", line 23, in forward
  90. features = self.backbone(x)
  91. File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
  92. return forward_call(*input, **kwargs)
  93. File "/opt/conda/lib/python3.9/site-packages/torchvision/models/_utils.py", line 69, in forward
  94. x = module(x)
  95. File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
  96. return forward_call(*input, **kwargs)
  97. File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 148, in forward
  98. self.num_batches_tracked.add_(1) # type: ignore[has-type]
  99. RuntimeError: CUDA error: no kernel image is available for execution on the device
  100. CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
  101. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
  102. make: *** [/opt/data/output/Makefile:6: 0] Error 1`

答案1

得分: 8

此错误是由于CUDA代码未编译为适用于您的GPU架构而引起的。在这里,Rastervision Docker镜像使用的PyTorch版本不包括为sm_86(Ampere GeForce)编译的CUDA代码。

作为一种解决方法,您可以强制重新安装包含sm_86代码的PyTorch版本。在使用docker run启动容器后,运行以下命令:

  1. pip install --force-reinstall torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/
英文:

This error arises when CUDA code was not compiled to target your GPU architecture. Here, the version of PyTorch the Rastervision Docker image is using does not include CUDA code compiled for sm_86 (Ampere GeForce).

As a workaround, you can force the reinstallation of a version of PyTorch that contains code for sm_86. Once you start your container using docker run, run the following command:

  1. pip install --force-reinstall torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/

huangapple
  • 本文由 发表于 2023年3月9日 17:01:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75682385.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定