英文:
RuntimeError: CUDA error: no kernel image is available for execution on the device (rastervision)
问题
你好,我正在尝试在 NVIDIA GEFORCE 3050 RTX GPU 上运行 rastervision 管道。
- Ubuntu 22.04
- Pytorch:版本:1.12.0+cu116
- CUDA:12
但当我像这样运行 Docker 容器时:
sudo docker run --rm --runtime=nvidia --gpus all -it -v ${RV_QUICKSTART_CODE_DIR}:/opt/src/code -v ${RV_QUICKSTART_OUT_DIR}:/opt/data/output quay.io/azavea/raster-vision:pytorch-0.20 /bin/bash
模型无法训练,并输出以下错误:
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
PD:运行 nvidia-smi
输出了 GPU 的特性,表明它已被识别。
我非常感谢您对此问题的帮助。
英文:
Hi I am trying to run rastervision pipeline on a GPU NVIDIA GEOFORCE 3050 RTX.
- Ubuntu 22.04
- Pytorch: Version: 1.12.0+cu116
- CUDA: 12
But when I run the Docker container like that:
sudo docker run --rm --runtime=nvidia --gpus all -it -v ${RV_QUICKSTART_CODE_DIR}:/opt/src/code -v ${RV_QUICKSTART_OUT_DIR}:/opt/data/output quay.io/azavea/raster-vision:pytorch-0.20 /bin/bash
The model does not train and outputs this error:
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
PD: running nvidia-smi outputs the characteristics of the GPU, meaning it is recognized.
I would very much appreciate some help in this issue.
Thanks!
This is the output I get:
`Skipping 'analyze' command...
python -m rastervision.pipeline.cli run_command /opt/data/output/pipeline-config.json train
Running train command...
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Building datasets ...
2023-03-09 08:53:29:rastervision.core.data.raster_source.rasterio_source: WARNING - Raster block size (2, 650) is too non-square. This can slow down reading. Consider re-tiling using GDAL.
2023-03-09 08:53:29:rastervision.core.data.raster_source.rasterio_source: WARNING - Raster block size (2, 650) is too non-square. This can slow down reading. Consider re-tiling using GDAL.
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Physical CPUs: 12
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Logical CPUs: 16
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Total memory: 15.30 GB
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Size of /opt/data volume: 445.44 GB
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Size of / volume: 445.44 GB
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Python version: 3.9.16 (main, Jan 11 2023, 16:05:54)
[GCC 11.2.0]
/bin/sh: 1: nvcc: not found
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO -
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Thu Mar 9 08:53:29 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 37C P3 14W / 30W | 262MiB / 4096MiB | 7% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Devices:
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - index, name, driver_version, memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, NVIDIA GeForce RTX 3050 Ti Laptop GPU, 525.89.02, 4096 MiB, 262 MiB, 3639 MiB
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - PyTorch version: 1.12.1+cu102
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - CUDA available: True
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - CUDA version: 10.2
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - CUDNN version: 7605
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Number of CUDA devices: 1
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Active CUDA Device: GPU 0
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - model=SemanticSegmentationModelConfig(backbone=<Backbone.resnet50: 'resnet50'>, pretrained=True, init_weights=None, load_strict=True, external_def=None) solver=SolverConfig(lr=0.0001, num_epochs=1, test_num_epochs=2, test_batch_sz=4, overfit_num_steps=1, sync_interval=1, batch_sz=2, one_cycle=True, multi_stage=[], class_loss_weights=None, ignore_class_index=None, external_loss_def=None) data=SemanticSegmentationGeoDataConfig(scene_dataset='<1 train_scenes, 1 validation_scenes, 0 test_scenes>', window_opts="method=<GeoDataWindowMethod.random: 'random'> size=300 stride=None padding=None pad_direction='end' size_lims=(300, 301) h_lims=None w_lims=None max_windows=10 max_sample_attempts=100 efficient_aoi_sampling=True") predict_mode=False test_mode=False overfit_mode=False eval_train=False save_model_bundle=True log_tensorboard=True run_tensorboard=False output_uri='/opt/data/output/train'
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Using device: cuda
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - train_ds: 10 items
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - valid_ds: 10 items
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - test_ds: 0 items
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Plotting sample training batch.
2023-03-09 08:53:30:rastervision.pytorch_learner.learner: INFO - Plotting sample validation batch.
2023-03-09 08:53:31:rastervision.pytorch_learner.learner: INFO - epoch: 0
Training: 0%| | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 251, in <module>
_main()
File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 247, in _main
main()
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 236, in run_command
_run_command(
File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 218, in _run_command
command_fn()
File "/opt/src/rastervision_core/rastervision/core/rv_pipeline/rv_pipeline.py", line 154, in train
backend.train(source_bundle_uri=self.config.source_bundle_uri)
File "/opt/src/rastervision_pytorch_backend/rastervision/pytorch_backend/pytorch_learner_backend.py", line 120, in train
learner.main()
File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py", line 267, in main
self.train()
File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py", line 1265, in train
train_metrics = self.train_epoch(
File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py", line 1188, in train_epoch
output = self.train_step(batch, batch_ind)
File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/semantic_segmentation_learner.py", line 26, in train_step
out = self.post_forward(self.model(x))
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torchvision/models/segmentation/_utils.py", line 23, in forward
features = self.backbone(x)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torchvision/models/_utils.py", line 69, in forward
x = module(x)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 148, in forward
self.num_batches_tracked.add_(1) # type: ignore[has-type]
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
make: *** [/opt/data/output/Makefile:6: 0] Error 1`
答案1
得分: 8
此错误是由于CUDA代码未编译为适用于您的GPU架构而引起的。在这里,Rastervision Docker镜像使用的PyTorch版本不包括为sm_86
(Ampere GeForce)编译的CUDA代码。
作为一种解决方法,您可以强制重新安装包含sm_86
代码的PyTorch版本。在使用docker run
启动容器后,运行以下命令:
pip install --force-reinstall torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/
英文:
This error arises when CUDA code was not compiled to target your GPU architecture. Here, the version of PyTorch the Rastervision Docker image is using does not include CUDA code compiled for sm_86
(Ampere GeForce).
As a workaround, you can force the reinstallation of a version of PyTorch that contains code for sm_86
. Once you start your container using docker run
, run the following command:
pip install --force-reinstall torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论