英文:
Iterating over PyTorch DataLoader slower than direct dataset access
问题
I am using PyTorch to train a machine learning model, and I have encountered a significant issue where iterating over the DataLoader is noticeably slower than directly accessing the dataset. My main goal is to speed up the data loading process during training since it takes considerably more time to wait for the DataLoader to fetch the data.
For example, when I iterate over the DataLoader like this:
for inputs,labels in tqdm(dataloader):
pass
It takes more than 15 seconds to complete.
However, when I iterate directly over the dataset:
for inputs,labels in tqdm(zip(dataloader.dataset.data, dataloader.dataset.targets)):
pass
It completes in less than 1 second.
I have already disabled shuffling, and I've experimented with adjusting the num_workers parameter, but it didn't significantly reduce the time difference.
The issue I am facing is not related to resource constraints, as my CPU and memory utilization are well below their maximum capacities. The waiting time during training occurs specifically when using the PyTorch DataLoader, and I'm seeking solutions to speed up the data loading process for more efficient training.
Moreover, my data reading and writing operations are not the bottleneck since I/O performance is not the limiting factor. The problem is observed during the data loading process within the PyTorch DataLoader, which takes longer than expected despite sufficient I/O capabilities.
Reproducible example:
import torch
from tqdm import tqdm
from torchvision import datasets, transforms
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
])
trainset = datasets.MNIST('MNINST', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=1, shuffle=False)
for data,targets in tqdm(trainloader):
pass
for data,targets in tqdm(zip(trainloader.dataset.data,trainloader.dataset.targets)):
pass
Edit :
The impact becomes more pronounced as you increase the batch_size. For example, I replicated a dataloader with shuffle=True, batch_size=64 and taking into account the MarGenDo remarks:
import torch
from tqdm import tqdm
from torchvision import datasets, transforms
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
])
batch_size=64
trainset = datasets.MNIST('MNINST', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True)
for data,targets in tqdm(trainloader):
pass
indices = torch.randperm(len(trainset))
for i in tqdm(range(0,len(indices),batch_size)):
data = []
targets = []
for j in range(i,i+batch_size):
if j < len(indices):
data.append(trainset.data[indices[j]])
targets.append(trainset.targets[indices[j]])
data = torch.utils.data.default_collate(data)
targets = torch.utils.data.default_collate(targets)
tensor = (data.to(torch.float) / 256).unsqueeze(0)
normalized = transforms.functional.normalize(tensor, (0.5,), (0.5,))
Edit2 : taking into account the MarGenDo 2nd remark :
import torch
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
class CustomDataset(Dataset):
def __init__(self, data, labels):
self.data = data
self.labels = labels
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
sample = self.data[idx],self.labels[idx]
return sample
n=100000
data = torch.randn(n, 3, 28, 28)
labels = torch.randint(0, 10, (n,))
custom_dataset = CustomDataset(data, labels)
batch_size = 1
dataloader = DataLoader(custom_dataset, batch_size=batch_size, shuffle=False)
for inputs, labels in tqdm(dataloader):
pass
for inputs, labels in tqdm(zip(dataloader.dataset.data,dataloader.dataset.labels)):
pass
英文:
I am using PyTorch to train a machine learning model, and I have encountered a significant issue where iterating over the DataLoader is noticeably slower than directly accessing the dataset. My main goal is to speed up the data loading process during training since it takes considerably more time to wait for the DataLoader to fetch the data.
For example, when I iterate over the DataLoader like this:
for inputs,labels in tqdm(dataloader):
pass
It takes more than 15 seconds to complete.
However, when I iterate directly over the dataset :
for inputs,labels in tqdm(zip(dataloader.dataset.data, dataloader.dataset.targets)):
pass
It completes in less than 1 second.
I have already disabled shuffling, and I've experimented with adjusting the num_workers parameter, but it didn't significantly reduce the time difference.
The issue I am facing is not related to resource constraints, as my CPU and memory utilization are well below their maximum capacities. The waiting time during training occurs specifically when using the PyTorch DataLoader, and I'm seeking solutions to speed up the data loading process for more efficient training.
Moreover, my data reading and writing operations are not the bottleneck since I/O performance is not the limiting factor. The problem is observed during the data loading process within the PyTorch DataLoader, which takes longer than expected despite sufficient I/O capabilities.
Reproducible example:
import torch
from tqdm import tqdm
from torchvision import datasets, transforms
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
])
trainset = datasets.MNIST('MNINST', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=1, shuffle=False)
for data,targets in tqdm(trainloader):
pass
for data,targets in tqdm(zip(trainloader.dataset.data,trainloader.dataset.targets)):
pass
Edit :
The impact becomes more pronounced as you increase the batch_size. For example, I replicated a dataloader with shuffle=True, batch_size=64 and taking into account the MarGenDo remarks :
import torch
from tqdm import tqdm
from torchvision import datasets, transforms
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
])
batch_size=64
trainset = datasets.MNIST('MNINST', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True)
for data,targets in tqdm(trainloader):
pass
indices = torch.randperm(len(trainset))
for i in tqdm(range(0,len(indices),batch_size)):
data = []
targets = []
for j in range(i,i+batch_size):
if j < len(indices):
data.append(trainset.data[indices[j]])
targets.append(trainset.targets[indices[j]])
data = torch.utils.data.default_collate(data)
targets = torch.utils.data.default_collate(targets)
tensor = (data.to(torch.float) / 256).unsqueeze(0)
normalized = transforms.functional.normalize(tensor, (0.5,), (0.5,))
Edit2 : taking into account the MarGenDo 2nd remark :
import torch
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
class CustomDataset(Dataset):
def __init__(self, data, labels):
self.data = data
self.labels = labels
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
sample = self.data[idx],self.labels[idx]
return sample
n=100000
data = torch.randn(n, 3, 28, 28)
labels = torch.randint(0, 10, (n,))
custom_dataset = CustomDataset(data, labels)
batch_size = 1
dataloader = DataLoader(custom_dataset, batch_size=batch_size, shuffle=False)
for inputs, labels in tqdm(dataloader):
pass
for inputs, labels in tqdm(zip(dataloader.dataset.data,dataloader.dataset.labels)):
pass
答案1
得分: 0
The significant time difference is caused by inefficient conversions between PIL images and torch tensors.
In the first case, the DataLoader internally iterates over the dataset (=trainset
), which is a list of tuples of PIL images and targets. The PIL images are converted to tensors using your provided transforms pipeline (ToTensor, Normalize).
In the second case, you are directly iterating over trainloader.dataset.data
which points to trainset.data
. There is no conversion and normalization.
Comparing one sample from both cases, the first is a normalized tensor of the image with float datatype, the second is represented using uint8 (range 0-255).
This code replicates what DataLoader is doing in the background when iterating over the dataset. Both of these cases take roughly equally to run.
Now to actually improve the efficiency, you could start with your solution of iterating over the actual data of the dataset and normalize them manually:
This reduces running time by more than half on my machine.
英文:
The significant time difference is caused by inefficient conversions between PIL images and torch tensors.
In the first case, the DataLoader internally iterates over the dataset (=trainset
), which is a list of tuples of PIL images and targets. The PIL images are converted to tensors using your provided transforms pipeline (ToTensor, Normalize).
In the second case, you are directly iterating over trainloader.dataset.data
which points to trainset.data
. There is no conversion and normalization.
Comparing one sample from both cases, the first is a normalized tensor of the image with float datatype, the second is representated using uint8 (range 0-255).
This code replicates what DataLoader is doing in the background when iterating over the dataset. Both of these cases take roughly equally to run.
# standard dataloader loop
for data, target in tqdm(trainloader):
pass
# Same loop without dataloader
trainset = datasets.MNIST('MNINST', download=True, train=True) # Remove transforms from the trainset
for data, target in tqdm(trainset):
tensor = transforms.functional.to_tensor(data)
normalized = transforms.functional.normalize(tensor, (0.5,), (0.5,))
# `normalized` now contains the same tensor as `data` in the previous case
Now to actually improve the efficiency, you could start with your solution of iterating over the actual data of the dataset and normalize them manually:
for data, target in tqdm(zip(trainloader.dataset.data,trainloader.dataset.targets)):
tensor = (data.to(torch.float) / 256).unsqueeze(0)
normalized = transforms.functional.normalize(tensor, (0.5,), (0.5,))
# `normalized` now contains the same tensor as `data` in the previous case
This reduces running time by more than half on my machine.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论