Pytorch nn.DataParallel: RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

huangapple go评论90阅读模式
英文:

Pytorch nn.DataParallel: RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

问题

I am implementing nn.DataParallel class to utilize multiple GPUs on a single machine. I have followed some Stack Overflow questions and answers but still get a simple error. I have no idea why I am getting this error.

Followed Questions

  1. Link 1

  2. Link 2

Code

  1. # Utilize multiple GPUs
  2. if 'cuda' in device:
  3. print(device)
  4. print("using data parallel")
  5. net = torch.nn.DataParallel(model_ft) # make parallel
  6. cudnn.benchmark = True
  7. # Rest of your code...

Traceback

  1. Traceback (most recent call last):
  2. File "/home2/coremax/Documents/pytorch-image-classification/train.py", line 263, in <module>
  3. model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler,
  4. File "/home2/coremax/Documents/pytorch-image-classification/train.py", line 214, in train_model
  5. outputs = model(inputs)
  6. File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  7. return forward_call(*args, **kwargs)
  8. File "/home2/coremax/anaconda3/lib/python3.9/site-packages/timm/models/resnet.py", line 730, in forward
  9. x = self.forward_features(x)
  10. File "/home2/coremax/anaconda3/lib/python3.9/site-packages/timm/models/resnet.py", line 709, in forward_features
  11. x = self.conv1(x)
  12. File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  13. return forward_call(*args, **kwargs)
  14. File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward
  15. input = module(input)
  16. File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  17. return forward_call(*args, **kwargs)
  18. File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward
  19. return self._conv_forward(input, self.weight, self.bias)
  20. File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
  21. return F.conv2d(input, weight, bias, self.stride,
  22. RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Please let me know if you need any further assistance.

英文:

I am implementing nn.DataParallel class to utilize multiple GPUs on single machine. I have followed some stack overflow questions and answers but still get a simple error. I have no idea why I am getting this error.

Followed Questions

  1. https://stackoverflow.com/questions/61778066/runtimeerror-input-type-torch-cuda-floattensor-and-weight-type-torch-floatte

  2. https://stackoverflow.com/questions/59013109/runtimeerror-input-type-torch-floattensor-and-weight-type-torch-cuda-floatte

Code

  1. # Utilize multiple GPUS
  2. if &#39;cuda&#39; in device:
  3. print(device)
  4. print(&quot;using data parallel&quot;)
  5. net = torch.nn.DataParallel(model_ft) # make parallel
  6. cudnn.benchmark = True
  7. # Transfer the model to GPU
  8. #model_ft = model_ft.to(device)
  9. # # Print model summary
  10. # print(&#39;Model Summary:-\n&#39;)
  11. # for num, (name, param) in enumerate(model_ft.named_parameters()):
  12. # print(num, name, param.requires_grad)
  13. # summary(model_ft, input_size=(3, size, size))
  14. # print(model_ft)
  15. # Loss function
  16. criterion = nn.CrossEntropyLoss()
  17. # Optimizer
  18. optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.001, momentum=0.9)
  19. # Learning rate decay
  20. exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)
  21. # Model training routine
  22. print(&quot;\nTraining:-\n&quot;)
  23. def train_model(model, criterion, optimizer, scheduler, num_epochs=30):
  24. since = time.time()
  25. best_model_wts = copy.deepcopy(model.state_dict())
  26. best_acc = 0.0
  27. # Tensorboard summary
  28. writer = SummaryWriter()
  29. for epoch in range(num_epochs):
  30. print(&#39;Epoch {}/{}&#39;.format(epoch, num_epochs - 1))
  31. print(&#39;-&#39; * 10)
  32. # Each epoch has a training and validation phase
  33. for phase in [&#39;train&#39;, &#39;valid&#39;]:
  34. if phase == &#39;train&#39;:
  35. model.train() # Set model to training mode
  36. else:
  37. model.eval() # Set model to evaluate mode
  38. running_loss = 0.0
  39. running_corrects = 0
  40. # Iterate over data.
  41. for inputs, labels in dataloaders[phase]:
  42. inputs = inputs
  43. labels = labels
  44. inputs = inputs.to(device, non_blocking=True)
  45. labels = labels.to(device, non_blocking=True)
  46. # zero the parameter gradients
  47. optimizer.zero_grad()
  48. # forward
  49. # track history if only in train
  50. with torch.set_grad_enabled(phase == &#39;train&#39;):
  51. outputs = model(inputs)
  52. _, preds = torch.max(outputs, 1)
  53. loss = criterion(outputs, labels)
  54. # backward + optimize only if in training phase
  55. if phase == &#39;train&#39;:
  56. loss.backward()
  57. optimizer.step()
  58. # statistics
  59. running_loss += loss.item() * inputs.size(0)
  60. running_corrects += torch.sum(preds == labels.data)
  61. if phase == &#39;train&#39;:
  62. scheduler.step()
  63. epoch_loss = running_loss / dataset_sizes[phase]
  64. epoch_acc = running_corrects.double() / dataset_sizes[phase]
  65. print(&#39;{} Loss: {:.4f} Acc: {:.4f}&#39;.format(
  66. phase, epoch_loss, epoch_acc))
  67. # Record training loss and accuracy for each phase
  68. if phase == &#39;train&#39;:
  69. writer.add_scalar(&#39;Train/Loss&#39;, epoch_loss, epoch)
  70. writer.add_scalar(&#39;Train/Accuracy&#39;, epoch_acc, epoch)
  71. writer.flush()
  72. else:
  73. writer.add_scalar(&#39;Valid/Loss&#39;, epoch_loss, epoch)
  74. writer.add_scalar(&#39;Valid/Accuracy&#39;, epoch_acc, epoch)
  75. writer.flush()
  76. # deep copy the model
  77. if phase == &#39;valid&#39; and epoch_acc &gt; best_acc:
  78. best_acc = epoch_acc
  79. best_model_wts = copy.deepcopy(model.state_dict())
  80. print()
  81. time_elapsed = time.time() - since
  82. print(&#39;Training complete in {:.0f}m {:.0f}s&#39;.format(
  83. time_elapsed // 60, time_elapsed % 60))
  84. print(&#39;Best val Acc: {:4f}&#39;.format(best_acc))
  85. # load best model weights
  86. model.load_state_dict(best_model_wts)
  87. return model
  88. # Train the model
  89. model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler,
  90. num_epochs=num_epochs)
  91. # Save the entire model
  92. print(&quot;\nSaving the model...&quot;)
  93. torch.save(model_ft, PATH)

Traceback

  1. Traceback (most recent call last):
  2. File &quot;/home2/coremax/Documents/pytorch-image-classification/train.py&quot;, line 263, in &lt;module&gt;
  3. model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler,
  4. File &quot;/home2/coremax/Documents/pytorch-image-classification/train.py&quot;, line 214, in train_model
  5. outputs = model(inputs)
  6. File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py&quot;, line 1501, in _call_impl
  7. return forward_call(*args, **kwargs)
  8. File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/timm/models/resnet.py&quot;, line 730, in forward
  9. x = self.forward_features(x)
  10. File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/timm/models/resnet.py&quot;, line 709, in forward_features
  11. x = self.conv1(x)
  12. File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py&quot;, line 1501, in _call_impl
  13. return forward_call(*args, **kwargs)
  14. File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/container.py&quot;, line 217, in forward
  15. input = module(input)
  16. File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py&quot;, line 1501, in _call_impl
  17. return forward_call(*args, **kwargs)
  18. File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py&quot;, line 463, in forward
  19. return self._conv_forward(input, self.weight, self.bias)
  20. File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py&quot;, line 459, in _conv_forward
  21. return F.conv2d(input, weight, bias, self.stride,
  22. RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

答案1

得分: 1

错误消息显示,问题是因为您提供的输入和模型不是相同类型,第一个是 torch.cuda.FloatTensor,而第二个是 torch.FloatTensor。正如您所看到的,问题在于其中一个(输入)位于GPU上,而另一个(模型的权重)仍然位于CPU上。这个问题可以通过在代码开头将模型移动到GPU来解决。我看到您提供的代码开头有一行正确的注释,model_ft = model_ft.to(device)。取消注释这行代码应该可以解决这个问题。

英文:

As shown in the error, the issue comes from the fact that the input you provided and the model are not the same type, the first one being torch.cuda.FloatTensor and the second one torch.FloatTensor. As you can see, the issue is that one (the input) in on GPU while the other (the weights of the model) is still on CPU. This issue can be fixed by moving the model to GPU in the beginning. I see that the correct line is commented in the beginning of the code you provided, model_ft = model_ft.to(device). Uncommenting this line should fix this problem.

huangapple
  • 本文由 发表于 2023年5月10日 12:34:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76214922.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定