2023年5月10日 12:34:15go评论158阅读模式

英文:

Pytorch nn.DataParallel: RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

问题

I am implementing nn.DataParallel class to utilize multiple GPUs on a single machine. I have followed some Stack Overflow questions and answers but still get a simple error. I have no idea why I am getting this error.

Followed Questions

Code

# Utilize multiple GPUs
if 'cuda' in device:
    print(device)
    print("using data parallel")
    net = torch.nn.DataParallel(model_ft) # make parallel
    cudnn.benchmark = True

# Rest of your code...

Traceback

Traceback (most recent call last):
  File "/home2/coremax/Documents/pytorch-image-classification/train.py", line 263, in <module>
    model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler,
  File "/home2/coremax/Documents/pytorch-image-classification/train.py", line 214, in train_model
    outputs = model(inputs)
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/timm/models/resnet.py", line 730, in forward
    x = self.forward_features(x)
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/timm/models/resnet.py", line 709, in forward_features
    x = self.conv1(x)
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Please let me know if you need any further assistance.

英文:

I am implementing nn.DataParallel class to utilize multiple GPUs on single machine. I have followed some stack overflow questions and answers but still get a simple error. I have no idea why I am getting this error.

Followed Questions

Code

# Utilize multiple GPUS
if &#39;cuda&#39; in device:
    print(device)
    print(&quot;using data parallel&quot;)
    net = torch.nn.DataParallel(model_ft) # make parallel
    cudnn.benchmark = True

# Transfer the model to GPU
#model_ft = model_ft.to(device)

# # Print model summary
# print(&#39;Model Summary:-\n&#39;)
# for num, (name, param) in enumerate(model_ft.named_parameters()):
#     print(num, name, param.requires_grad)
# summary(model_ft, input_size=(3, size, size))
# print(model_ft)

# Loss function
criterion = nn.CrossEntropyLoss()

# Optimizer 
optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.001, momentum=0.9)

# Learning rate decay
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)

# Model training routine 
print(&quot;\nTraining:-\n&quot;)


def train_model(model, criterion, optimizer, scheduler, num_epochs=30):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    # Tensorboard summary
    writer = SummaryWriter()

    for epoch in range(num_epochs):
        print(&#39;Epoch {}/{}&#39;.format(epoch, num_epochs - 1))
        print(&#39;-&#39; * 10)

        # Each epoch has a training and validation phase
        for phase in [&#39;train&#39;, &#39;valid&#39;]:
            if phase == &#39;train&#39;:
                model.train()  # Set model to training mode
            else:
                model.eval()  # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs
                labels = labels

                inputs = inputs.to(device, non_blocking=True)
                labels = labels.to(device, non_blocking=True)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == &#39;train&#39;):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == &#39;train&#39;:
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
            if phase == &#39;train&#39;:
                scheduler.step()

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]

            print(&#39;{} Loss: {:.4f} Acc: {:.4f}&#39;.format(
                phase, epoch_loss, epoch_acc))

            # Record training loss and accuracy for each phase
            if phase == &#39;train&#39;:
                writer.add_scalar(&#39;Train/Loss&#39;, epoch_loss, epoch)
                writer.add_scalar(&#39;Train/Accuracy&#39;, epoch_acc, epoch)
                writer.flush()
            else:
                writer.add_scalar(&#39;Valid/Loss&#39;, epoch_loss, epoch)
                writer.add_scalar(&#39;Valid/Accuracy&#39;, epoch_acc, epoch)
                writer.flush()

            # deep copy the model
            if phase == &#39;valid&#39; and epoch_acc &gt; best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print(&#39;Training complete in {:.0f}m {:.0f}s&#39;.format(
        time_elapsed // 60, time_elapsed % 60))
    print(&#39;Best val Acc: {:4f}&#39;.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model


# Train the model
model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler,
                       num_epochs=num_epochs)
# Save the entire model
print(&quot;\nSaving the model...&quot;)
torch.save(model_ft, PATH)

Traceback

Traceback (most recent call last):
  File &quot;/home2/coremax/Documents/pytorch-image-classification/train.py&quot;, line 263, in &lt;module&gt;
    model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler,
  File &quot;/home2/coremax/Documents/pytorch-image-classification/train.py&quot;, line 214, in train_model
    outputs = model(inputs)
  File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py&quot;, line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/timm/models/resnet.py&quot;, line 730, in forward
    x = self.forward_features(x)
  File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/timm/models/resnet.py&quot;, line 709, in forward_features
    x = self.conv1(x)
  File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py&quot;, line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/container.py&quot;, line 217, in forward
    input = module(input)
  File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py&quot;, line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py&quot;, line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File &quot;/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py&quot;, line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

答案1

得分: 1

错误消息显示，问题是因为您提供的输入和模型不是相同类型，第一个是 torch.cuda.FloatTensor，而第二个是 torch.FloatTensor。正如您所看到的，问题在于其中一个（输入）位于GPU上，而另一个（模型的权重）仍然位于CPU上。这个问题可以通过在代码开头将模型移动到GPU来解决。我看到您提供的代码开头有一行正确的注释，model_ft = model_ft.to(device)。取消注释这行代码应该可以解决这个问题。

英文:

As shown in the error, the issue comes from the fact that the input you provided and the model are not the same type, the first one being torch.cuda.FloatTensor and the second one torch.FloatTensor. As you can see, the issue is that one (the input) in on GPU while the other (the weights of the model) is still on CPU. This issue can be fixed by moving the model to GPU in the beginning. I see that the correct line is commented in the beginning of the code you provided, model_ft = model_ft.to(device). Uncommenting this line should fix this problem.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pytorch nn.DataParallel: RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

问题

答案1

pypy pandas correlation slower than python

在多台服务器上部署调度应用程序，而不运行所有服务器。

使用Python包（spaCy）仅覆盖特定语言词汇的单词列表。

write a prog that inputs an integer 0-999 and then prints if the integer entered is a 1/2/3 digit number

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论