英文:
DataLoader error: RuntimeError: stack expects each tensor to be equal size, but got [1024] at entry 0 and [212] at entry 13
问题
我有一个数据集,其中有一个名为input_ids
的列,我正在使用DataLoader
加载它:
train_batch_size = 2
eval_dataloader = DataLoader(val_dataset, batch_size=train_batch_size)
eval_dataloader
的长度是:
print(len(eval_dataloader))
>>> 1623
当我运行以下代码时出现错误:
for step, batch in enumerate(eval_dataloader):
print(step)
每个批次的长度是1024
。如果我将train_batch_size
更改为1,错误就会消失。
我尝试使用以下代码删除最后一个批次:
eval_dataloader = DataLoader(val_dataset, batch_size=train_batch_size, drop_last=True)
但是,仍然会出现批次大小大于1的错误。
完整的堆栈跟踪信息如下:
RuntimeError: stack expects each tensor to be equal size, but got [212] at entry 0 and [1024] at entry 1
在train_dataloader
中也存在类似的问题:
RuntimeError: stack expects each tensor to be equal size, but got [930] at entry 0 and [1024] at entry 1
更新
通过@chro和这篇Reddit帖子解决了这个问题:“为了分离问题,使用批次大小为1,不进行洗牌,遍历数据加载器中的项目,并打印您获得的数组形状。然后调查具有不同大小的那些项目。”
似乎有一个序列的长度不是1024
,但是如果批次大小不为1,就无法看到这个问题。不太确定如何拥有具有不同长度的张量的张量,但无论如何。为了解决问题,我首先对数据集进行了筛选,并删除了长度不是1024
的一个序列。然后在其上调用了DataLoader
。
英文:
I have a dataset composed of a column name input_ids
that I'm loading with a DataLoader
:
train_batch_size = 2
eval_dataloader = DataLoader(val_dataset, batch_size=train_batch_size)
The length of eval_dataloader
is
print(len(eval_dataloader))
>>> 1623
I'm getting the error when I run:
for step, batch in enumerate(eval_dataloader):
print(step)
>>> 1,2... ,1621
Each batch length is 1024
. If I change train_batch_size
to 1 the error disappears.
I tried removing the last batch with
eval_dataloader = DataLoader(val_dataset, batch_size=train_batch_size, drop_last=True)
But the error still pops up with batch of size greater than 1.
The complete stack:
RuntimeError Traceback (most recent call last)
Cell In[34], line 2
1 eval_dataloader = DataLoader(val_dataset,shuffle=True,batch_size=2,drop_last=True)
----> 2 for step, batch in enumerate(eval_dataloader):
3 print(step, batch['input_ids'].shape)
File ~/anaconda3/envs/cilm/lib/python3.10/site-packages/torch/utils/data/dataloader.py:628, in _BaseDataLoaderIter.__next__(self)
625 if self._sampler_iter is None:
626 # TODO(https://github.com/pytorch/pytorch/issues/76750)
627 self._reset() # type: ignore[call-arg]
--> 628 data = self._next_data()
629 self._num_yielded += 1
630 if self._dataset_kind == _DatasetKind.Iterable and \
631 self._IterableDataset_len_called is not None and \
632 self._num_yielded > self._IterableDataset_len_called:
File ~/anaconda3/envs/cilm/lib/python3.10/site-packages/torch/utils/data/dataloader.py:671, in _SingleProcessDataLoaderIter._next_data(self)
669 def _next_data(self):
670 index = self._next_index() # may raise StopIteration
--> 671 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
672 if self._pin_memory:
673 data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)
File ~/anaconda3/envs/cilm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py:61, in _MapDatasetFetcher.fetch(self, possibly_batched_index)
59 else:
60 data = self.dataset[possibly_batched_index]
---> 61 return self.collate_fn(data)
File ~/anaconda3/envs/cilm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py:265, in default_collate(batch)
204 def default_collate(batch):
205 r"""
206 Function that takes in a batch of data and puts the elements within the batch
207 into a tensor with an additional outer dimension - batch size. The exact output type can be
(...)
263 >>> default_collate(batch) # Handle `CustomType` automatically
264 """
--> 265 return collate(batch, collate_fn_map=default_collate_fn_map)
File ~/anaconda3/envs/cilm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py:128, in collate(batch, collate_fn_map)
126 if isinstance(elem, collections.abc.Mapping):
127 try:
--> 128 return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
129 except TypeError:
130 # The mapping type may not support `__init__(iterable)`.
131 return {key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem}
File ~/anaconda3/envs/cilm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py:128, in <dictcomp>(.0)
126 if isinstance(elem, collections.abc.Mapping):
127 try:
--> 128 return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
129 except TypeError:
130 # The mapping type may not support `__init__(iterable)`.
131 return {key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem}
File ~/anaconda3/envs/cilm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py:120, in collate(batch, collate_fn_map)
118 if collate_fn_map is not None:
119 if elem_type in collate_fn_map:
--> 120 return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
122 for collate_type in collate_fn_map:
123 if isinstance(elem, collate_type):
File ~/anaconda3/envs/cilm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py:163, in collate_tensor_fn(batch, collate_fn_map)
161 storage = elem.storage()._new_shared(numel, device=elem.device)
162 out = elem.new(storage).resize_(len(batch), *list(elem.size()))
--> 163 return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [212] at entry 0 and [1024] at entry 1
I found other somewhat similar SO questions / regular questions, but they seem to be related to the stack
function in other settings (link, link, link, link)
Similar issue exist in the train_dataloader
:
RuntimeError: stack expects each tensor to be equal size, but got [930] at entry 0 and [1024] at entry 1
Update
Solved it thanks to @chro and this reddit post: "To isolate the problem loop over the items in the dataloader with batch size 1 without shuffle and print the shape of the array you got. Then investigate the ones with different sizes".
Seems like there was a sequence that wasn't of length 1024
, but this cannot be seen for some reason if the batch is not of size 1. Not entirely sure how you can have a tensor of tensors with varying lengths, but alas. To resolve the issue I filtered my dataset first and removed the 1 sequence that was not 1024
. Then called the DataLoader
on it.
答案1
得分: 1
以下是您要翻译的代码部分:
eval_dataloader = DataLoader(val_dataset,
batch_size=1)
for step, batch in enumerate(eval_dataloader):
if batch.shape[1]!=1024:
print(step, batch.shape)
请注意,我将代码部分提取出来进行翻译,不包括其他内容。
英文:
Could you debug it with (replace batch.shape
with relevant code to your data)
eval_dataloader = DataLoader(val_dataset,
batch_size=1)
for step, batch in enumerate(eval_dataloader):
if batch.shape[1]!=1024:
print(step, batch.shape)
My idea is to check the following:
- Does it fails on the same item in dataset?
- What is the shape of item it fails?
Usually I see this error when it stacks several elements in DataLoader, but some of the elements are in different size.
Please, also write a complete stack trace related to problem.
Update:
To resolve the issue filter dataset first and removed the 1 sequence that was not same with others. Then called the DataLoader on it
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论