问题

I was training the yolact model on resnet18 backbone and it was going all good but then suddenly I got some error and the training got aborted. This error came when the training of about 3-4 hours was completed.

我正在对resnet18背景下的yolact模型进行训练，一切都进行得很顺利，但突然间我遇到了一些错误，训练被中止。这个错误出现在大约3-4小时的训练后。

I modified the configuration for yolact backbone to use resnet18

我修改了yolact骨干网络的配置，以使用resnet18。

Using -
ubuntu 10.04, pytorch 1.12.1+cu113, python 3.9.12, GPU: 2 x NVIDIA RTX A6000

使用 -
ubuntu 10.04，pytorch 1.12.1+cu113，python 3.9.12，GPU：2 x NVIDIA RTX A6000

2 32930 || B: 3.996 | C: 4.955 | M: 4.408 | S: 1.036 | T: 14.395 || ETA: 1 day, 4:21:46 || timer: 0.127
2 32940 || B: 4.020 | C: 5.072 | M: 4.458 | S: 1.104 | T: 14.655 || ETA: 1 day, 4:23:08 || timer: 0.132
2 32950 || B: 4.063 | C: 5.246 | M: 4.552 | S: 1.199 | T: 15.060 || ETA: 1 day, 4:23:34 || timer: 0.151
2 32960 || B: 4.057 | C: 5.435 | M: 4.608 | S: 1.271 | T: 15.371 || ETA: 1 day, 4:23:31 || timer: 0.158
2 32970 || B: 4.071 | C: 5.574 | M: 4.675 | S: 1.306 | T: 15.626 || ETA: 1 day, 4:24:23 || timer: 0.166
2 32980 || B: 4.033 | C: 5.746 | M: 4.758 | S: 1.381 | T: 15.918 || ETA: 1 day, 4:26:38 || timer: 0.140
2 32990 || B: 4.031 | C: 5.817 | M: 4.741 | S: 1.411 | T: 15.999 || ETA: 1 day, 4:25:21 || timer: 0.139
2 33000 || B: 4.055 | C: 5.763 | M: 4.799 | S: 1.412 | T: 16.028 || ETA: 1 day, 4:25:51 || timer: 0.128

Traceback (most recent call last):
File "/home/gangwa/miniconda3/lib/python3.9/multiprocessing/queues.py", line 245, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/gangwa/miniconda3/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 364, in reduce_storage
shared_cache[cache_key] = StorageWeakRef(storage)
File "/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 65, in setitem
self.free_dead_references()
File "/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 70, in free_dead_references
if storage_ref.expired():
File "/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 35, in expired
return torch.Storage._expired(self.cdata) # type: ignore[attr-defined]
File "/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/storage.py", line 757, in _expired
return eval(cls.module)._UntypedStorage._expired(*args, **kwargs)
AttributeError: module 'torch.cuda' has no attribute '_UntypedStorage'

2 33010 || B: 4.178 | C: 5.768 | M: 4.934 | S: 1.417 | T: 16.296 || ETA: 1 day, 4:49:09 || timer: 0.126

Anyone has any idea why I got this after around 3-4 hours of training.

有人知道为什么在大约3-4小时的训练后会出现这个错误吗？

英文:

Using -
ubuntu 10.04, pytorch 1.12.1+cu113, python 3.9.12, GPU: 2 x NVIDIA RTX A6000

[ 2] 32930 || B: 3.996 | C: 4.955 | M: 4.408 | S: 1.036 | T: 14.395 || ETA: 1 day, 4:21:46 || timer: 0.127
[ 2] 32940 || B: 4.020 | C: 5.072 | M: 4.458 | S: 1.104 | T: 14.655 || ETA: 1 day, 4:23:08 || timer: 0.132
[ 2] 32950 || B: 4.063 | C: 5.246 | M: 4.552 | S: 1.199 | T: 15.060 || ETA: 1 day, 4:23:34 || timer: 0.151
[ 2] 32960 || B: 4.057 | C: 5.435 | M: 4.608 | S: 1.271 | T: 15.371 || ETA: 1 day, 4:23:31 || timer: 0.158
[ 2] 32970 || B: 4.071 | C: 5.574 | M: 4.675 | S: 1.306 | T: 15.626 || ETA: 1 day, 4:24:23 || timer: 0.166
[ 2] 32980 || B: 4.033 | C: 5.746 | M: 4.758 | S: 1.381 | T: 15.918 || ETA: 1 day, 4:26:38 || timer: 0.140
[ 2] 32990 || B: 4.031 | C: 5.817 | M: 4.741 | S: 1.411 | T: 15.999 || ETA: 1 day, 4:25:21 || timer: 0.139
[ 2] 33000 || B: 4.055 | C: 5.763 | M: 4.799 | S: 1.412 | T: 16.028 || ETA: 1 day, 4:25:51 || timer: 0.128
Traceback (most recent call last):
File &quot;/home/gangwa/miniconda3/lib/python3.9/multiprocessing/queues.py&quot;, line 245, in _feed
obj = _ForkingPickler.dumps(obj)
File &quot;/home/gangwa/miniconda3/lib/python3.9/multiprocessing/reduction.py&quot;, line 51, in dumps
cls(buf, protocol).dump(obj)
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 364, in reduce_storage
shared_cache[cache_key] = StorageWeakRef(storage)
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 65, in setitem
self.free_dead_references()
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 70, in free_dead_references
if storage_ref.expired():
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 35, in expired
return torch.Storage._expired(self.cdata) # type: ignore[attr-defined]
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/storage.py&quot;, line 757, in _expired
return eval(cls.module)._UntypedStorage._expired(*args, **kwargs)
AttributeError: module &#39;torch.cuda&#39; has no attribute &#39;_UntypedStorage&#39;
Traceback (most recent call last):
File &quot;/home/gangwa/miniconda3/lib/python3.9/multiprocessing/queues.py&quot;, line 245, in _feed
obj = _ForkingPickler.dumps(obj)
File &quot;/home/gangwa/miniconda3/lib/python3.9/multiprocessing/reduction.py&quot;, line 51, in dumps
cls(buf, protocol).dump(obj)
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 364, in reduce_storage
shared_cache[cache_key] = StorageWeakRef(storage)
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 65, in setitem
self.free_dead_references()
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 70, in free_dead_references
if storage_ref.expired():
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 35, in expired
return torch.Storage._expired(self.cdata) # type: ignore[attr-defined]
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/storage.py&quot;, line 757, in _expired
return eval(cls.module)._UntypedStorage._expired(*args, **kwargs)
AttributeError: module &#39;torch.cuda&#39; has no attribute &#39;_UntypedStorage&#39;
[ 2] 33010 || B: 4.178 | C: 5.768 | M: 4.934 | S: 1.417 | T: 16.296 || ETA: 1 day, 4:49:09 || timer: 0.126
Traceback (most recent call last):
File &quot;/home/gangwa/miniconda3/lib/python3.9/multiprocessing/queues.py&quot;, line 245, in _feed
obj = _ForkingPickler.dumps(obj)
File &quot;/home/gangwa/miniconda3/lib/python3.9/multiprocessing/reduction.py&quot;, line 51, in dumps
cls(buf, protocol).dump(obj)
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 364, in reduce_storage
shared_cache[cache_key] = StorageWeakRef(storage)
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 65, in setitem
self.free_dead_references()
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 70, in free_dead_references
if storage_ref.expired():
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 35, in expired
return torch.Storage._expired(self.cdata) # type: ignore[attr-defined]
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/storage.py&quot;, line 757, in _expired
return eval(cls.module)._UntypedStorage._expired(*args, **kwargs)
AttributeError: module &#39;torch.cuda&#39; has no attribute &#39;_UntypedStorage&#39;
Traceback (most recent call last):
File &quot;/home/gangwa/miniconda3/lib/python3.9/multiprocessing/queues.py&quot;, line 245, in _feed
obj = _ForkingPickler.dumps(obj)
File &quot;/home/gangwa/miniconda3/lib/python3.9/multiprocessing/reduction.py&quot;, line 51, in dumps
cls(buf, protocol).dump(obj)
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 364, in reduce_storage
shared_cache[cache_key] = StorageWeakRef(storage)
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 65, in setitem
self.free_dead_references()
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 70, in free_dead_references
if storage_ref.expired():
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/reductions.py&quot;, line 35, in expired
return torch.Storage._expired(self.cdata) # type: ignore[attr-defined]
File &quot;/home/gangwa/miniconda3/lib/python3.9/site-packages/torch/storage.py&quot;, line 757, in _expired
return eval(cls.module)._UntypedStorage._expired(*args, **kwargs)
AttributeError: module &#39;torch.cuda&#39; has no attribute &#39;_UntypedStorage&#39;

Anyone has any idea why I got this after around 3-4 hours of training.

答案1

得分: 1

我之前遇到过类似的问题，找不到原因，但有些解决方法对我有效。您可以尝试以下修复方法来解决它：

将 num_workers 的值更改为 0。来源
将 pytorch 的版本从 1.12.1 更改为 1.13。来源

英文:

I had a similar problem previously, could not find a reason but some solutions that worked for me. You can try the following fixes to solve it:-

Change the value of num_workers to 0. Source
Change the version of pytorch from 1.12.1 to 1.13. Source

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在使用ResNet18骨干网络训练YOLACT模型时出现错误。

问题

答案1

pandas修改了三个其他列的条件值

如何解决 AttributeError: ‘Snake’ 对象没有 ‘distance’ 属性错误。

如何使 toxenv 无效。

在Python中使用DataFrame计算数值的公式。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论