2023年8月5日 03:32:23go评论163阅读模式

英文:

Iterating over PyTorch DataLoader slower than direct dataset access

问题

I am using PyTorch to train a machine learning model, and I have encountered a significant issue where iterating over the DataLoader is noticeably slower than directly accessing the dataset. My main goal is to speed up the data loading process during training since it takes considerably more time to wait for the DataLoader to fetch the data.

For example, when I iterate over the DataLoader like this:

for inputs,labels in tqdm(dataloader):
  pass

It takes more than 15 seconds to complete.

However, when I iterate directly over the dataset:

for inputs,labels in tqdm(zip(dataloader.dataset.data, dataloader.dataset.targets)):
  pass

It completes in less than 1 second.

I have already disabled shuffling, and I've experimented with adjusting the num_workers parameter, but it didn't significantly reduce the time difference.

The issue I am facing is not related to resource constraints, as my CPU and memory utilization are well below their maximum capacities. The waiting time during training occurs specifically when using the PyTorch DataLoader, and I'm seeking solutions to speed up the data loading process for more efficient training.

Moreover, my data reading and writing operations are not the bottleneck since I/O performance is not the limiting factor. The problem is observed during the data loading process within the PyTorch DataLoader, which takes longer than expected despite sufficient I/O capabilities.

Reproducible example:

import torch
from tqdm import tqdm
from torchvision import datasets, transforms

transform = transforms.Compose([transforms.ToTensor(),
                              transforms.Normalize((0.5,), (0.5,)),
                              ])

trainset = datasets.MNIST('MNINST', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=1, shuffle=False)

for data,targets in tqdm(trainloader):
    pass

for data,targets in tqdm(zip(trainloader.dataset.data,trainloader.dataset.targets)):
    pass

result

Edit :
The impact becomes more pronounced as you increase the batch_size. For example, I replicated a dataloader with shuffle=True, batch_size=64 and taking into account the MarGenDo remarks:

import torch
from tqdm import tqdm
from torchvision import datasets, transforms

transform = transforms.Compose([transforms.ToTensor(),
                              transforms.Normalize((0.5,), (0.5,)),
                              ])

batch_size=64
trainset = datasets.MNIST('MNINST', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True)

for data,targets in tqdm(trainloader):
    pass

indices = torch.randperm(len(trainset))
for i in tqdm(range(0,len(indices),batch_size)):
    data = []
    targets = []

    for j in range(i,i+batch_size):
        if j < len(indices):
            data.append(trainset.data[indices[j]])
            targets.append(trainset.targets[indices[j]])

    data = torch.utils.data.default_collate(data)
    targets = torch.utils.data.default_collate(targets)

    tensor = (data.to(torch.float) / 256).unsqueeze(0)
    normalized = transforms.functional.normalize(tensor, (0.5,), (0.5,))

result2

Edit2 : taking into account the MarGenDo 2nd remark :

import torch
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm

class CustomDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx],self.labels[idx]
        return sample


n=100000
data = torch.randn(n, 3, 28, 28)
labels = torch.randint(0, 10, (n,))

custom_dataset = CustomDataset(data, labels)

batch_size = 1
dataloader = DataLoader(custom_dataset, batch_size=batch_size, shuffle=False)

for inputs, labels in tqdm(dataloader):
    pass

for inputs, labels in tqdm(zip(dataloader.dataset.data,dataloader.dataset.labels)):
    pass

result3

英文:

For example, when I iterate over the DataLoader like this:

for inputs,labels in tqdm(dataloader):
  pass

It takes more than 15 seconds to complete.

However, when I iterate directly over the dataset :

for inputs,labels in tqdm(zip(dataloader.dataset.data, dataloader.dataset.targets)):
  pass

It completes in less than 1 second.

I have already disabled shuffling, and I've experimented with adjusting the num_workers parameter, but it didn't significantly reduce the time difference.

Reproducible example:

import torch
from tqdm import tqdm
from torchvision import datasets, transforms

transform = transforms.Compose([transforms.ToTensor(),
                              transforms.Normalize((0.5,), (0.5,)),
                              ])

trainset = datasets.MNIST(&#39;MNINST&#39;, download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=1, shuffle=False)

for data,targets in tqdm(trainloader):
    pass

for data,targets in tqdm(zip(trainloader.dataset.data,trainloader.dataset.targets)):
    pass

result

Edit :
The impact becomes more pronounced as you increase the batch_size. For example, I replicated a dataloader with shuffle=True, batch_size=64 and taking into account the MarGenDo remarks :

import torch
from tqdm import tqdm
from torchvision import datasets, transforms

transform = transforms.Compose([transforms.ToTensor(),
                              transforms.Normalize((0.5,), (0.5,)),
                              ])

batch_size=64
trainset = datasets.MNIST(&#39;MNINST&#39;, download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True)

for data,targets in tqdm(trainloader):
    pass

indices = torch.randperm(len(trainset))
for i in tqdm(range(0,len(indices),batch_size)):
    data = []
    targets = []
    
    for j in range(i,i+batch_size):
        if j &lt; len(indices):
            data.append(trainset.data[indices[j]])
            targets.append(trainset.targets[indices[j]])
            
    data = torch.utils.data.default_collate(data)
    targets = torch.utils.data.default_collate(targets)
    
    tensor = (data.to(torch.float) / 256).unsqueeze(0)
    normalized = transforms.functional.normalize(tensor, (0.5,), (0.5,))

result2

Edit2 : taking into account the MarGenDo 2nd remark :

import torch
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm

class CustomDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx],self.labels[idx]
        return sample


n=100000
data = torch.randn(n, 3, 28, 28)  
labels = torch.randint(0, 10, (n,))  

custom_dataset = CustomDataset(data, labels)

batch_size = 1
dataloader = DataLoader(custom_dataset, batch_size=batch_size, shuffle=False)

for inputs, labels in tqdm(dataloader):
    pass

for inputs, labels in tqdm(zip(dataloader.dataset.data,dataloader.dataset.labels)):
    pass

result3

答案1

得分: 0

The significant time difference is caused by inefficient conversions between PIL images and torch tensors.

In the first case, the DataLoader internally iterates over the dataset (=trainset), which is a list of tuples of PIL images and targets. The PIL images are converted to tensors using your provided transforms pipeline (ToTensor, Normalize).

In the second case, you are directly iterating over trainloader.dataset.data which points to trainset.data. There is no conversion and normalization.

Comparing one sample from both cases, the first is a normalized tensor of the image with float datatype, the second is represented using uint8 (range 0-255).

This code replicates what DataLoader is doing in the background when iterating over the dataset. Both of these cases take roughly equally to run.

Now to actually improve the efficiency, you could start with your solution of iterating over the actual data of the dataset and normalize them manually:

This reduces running time by more than half on my machine.

英文:

The significant time difference is caused by inefficient conversions between PIL images and torch tensors.

In the second case, you are directly iterating over trainloader.dataset.data which points to trainset.data. There is no conversion and normalization.

Comparing one sample from both cases, the first is a normalized tensor of the image with float datatype, the second is representated using uint8 (range 0-255).

This code replicates what DataLoader is doing in the background when iterating over the dataset. Both of these cases take roughly equally to run.

# standard dataloader loop
for data, target in tqdm(trainloader):
    pass

# Same loop without dataloader
trainset = datasets.MNIST(&#39;MNINST&#39;, download=True, train=True) # Remove transforms from the trainset
for data, target in tqdm(trainset):
    tensor = transforms.functional.to_tensor(data)
    normalized = transforms.functional.normalize(tensor, (0.5,), (0.5,))
    # `normalized` now contains the same tensor as `data` in the previous case

Now to actually improve the efficiency, you could start with your solution of iterating over the actual data of the dataset and normalize them manually:

for data, target in tqdm(zip(trainloader.dataset.data,trainloader.dataset.targets)):
    tensor = (data.to(torch.float) / 256).unsqueeze(0)
    normalized = transforms.functional.normalize(tensor, (0.5,), (0.5,))
    # `normalized` now contains the same tensor as `data` in the previous case

This reduces running time by more than half on my machine.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在PyTorch DataLoader上迭代比直接访问数据集慢。

问题

答案1

HuggingFace trainer 中的步数是如何计算的？

Pycaret在变换前后导出训练和测试数据。

Remove element by value in list Python fastest 用Python最快速地按值删除列表中的元素

如何将数据临时存储，并在每隔10分钟将其存储到数据库中？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论