2023年8月11日 02:21:06go评论156阅读模式

英文:

How does one create a pytoch data loader using an interleaved hugging face dataset?

问题

当我交错数据集、获取一个经过标记的批次，并将批次提供给PyTorch数据加载器时，我收到错误消息：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py in collate(batch, collate_fn_map)
    126         try:
--&gt; 127             return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
    128         except TypeError:
9 frames
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found &lt;class &#39;NoneType&#39;&gt;
During handling of the above exception, another exception occurred:
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py in collate(batch, collate_fn_map)
    148                 return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]
    149 
--&gt; 150     raise TypeError(default_collate_err_msg_format.format(elem_type))
    151 
    152 
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found &lt;class &#39;NoneType&#39;&gt;

为什么？为什么单独的数据集c4和wiki-text没有出现这个错误？只有交错的数据集会出现这个错误吗？

理想情况下，我不想编写自己的collate_function。

英文:

When I interleave data sets, get a tokenized batch, feed the batch to the pytorch data loader, I get errors:

# -*- coding: utf-8 -*-
&quot;&quot;&quot;issues with dataloader and custom data sets
Automatically generated by Colaboratory.
Original file is located at
    https://colab.research.google.com/drive/1sbs95as_66mtK9VK_vbaE9gLE-Tjof1-
&quot;&quot;&quot;
!pip install datasets
!pip install pytorch
!pip install transformers
token = None
batch_size = 10
from datasets import load_dataset
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained(&quot;gpt2&quot;)
if tokenizer.pad_token_id is None:
  tokenizer.pad_token = tokenizer.eos_token
probe_network = GPT2LMHeadModel.from_pretrained(&quot;gpt2&quot;)
device = torch.device(f&quot;cuda:{0}&quot; if torch.cuda.is_available() else &quot;cpu&quot;)
probe_network = probe_network.to(device)
# -- Get batch from dataset
from datasets import load_dataset
# path, name = &#39;brando/debug1_af&#39;, &#39;debug1_af&#39;
path, name = &#39;brando/debug0_af&#39;, &#39;debug0_af&#39;
remove_columns = []
dataset = load_dataset(path, name, streaming=True, split=&quot;train&quot;, token=token).with_format(&quot;torch&quot;)
print(f&#39;{dataset=}&#39;)
batch = dataset.take(batch_size)
# print(f&#39;{next(iter(batch))=}&#39;)
# - Prepare functions to tokenize batch
def preprocess(examples):  # gets the raw text batch according to the specific names in table in data set &amp; tokenize
    return tokenizer(examples[&quot;link&quot;], padding=&quot;max_length&quot;, max_length=128, truncation=True, return_tensors=&quot;pt&quot;)
def map(batch):  # apply preprocess to batch to all examples in batch represented as a dataset
    return batch.map(preprocess, batched=True, remove_columns=remove_columns)
tokenized_batch = batch.map(preprocess, batched=True, remove_columns=remove_columns)
tokenized_batch = map(batch)
# print(f&#39;{next(iter(tokenized_batch))=}&#39;)
from torch.utils.data import Dataset, DataLoader, SequentialSampler
dataset = tokenized_batch
print(f&#39;{type(dataset)=}&#39;)
print(f&#39;{dataset.__class__=}&#39;)
print(f&#39;{isinstance(dataset, Dataset)=}&#39;)
# for i, d in enumerate(dataset):
#     assert isinstance(d, dict)
#     # dd = dataset[i]
#     # assert isinstance(dd, dict)
loader_opts = {}
classifier_opts = {}
# data_loader = DataLoader(dataset, shuffle=False, batch_size=loader_opts.get(&#39;batch_size&#39;, 1),
#                         num_workers=loader_opts.get(&#39;num_workers&#39;, 0), drop_last=False, sampler=SequentialSampler(range(512))  )
data_loader = DataLoader(dataset, shuffle=False, batch_size=loader_opts.get(&#39;batch_size&#39;, 1),
                    num_workers=loader_opts.get(&#39;num_workers&#39;, 0), drop_last=False, sampler=None)
print(f&#39;{iter(data_loader)=}&#39;)
print(f&#39;{next(iter(data_loader))=}&#39;)
print(&#39;Done\a&#39;)

with error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py in collate(batch, collate_fn_map)
    126         try:
--&gt; 127             return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
    128         except TypeError:
9 frames
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found &lt;class &#39;NoneType&#39;&gt;
During handling of the above exception, another exception occurred:
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py in collate(batch, collate_fn_map)
    148                 return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]
    149 
--&gt; 150     raise TypeError(default_collate_err_msg_format.format(elem_type))
    151 
    152 
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found &lt;class &#39;NoneType&#39;&gt;

why? And why doesn't the single data set c4 and wiki-text give this error? Only interleaved data sets?

Ideally I don't want to write my own collate_function.

答案1

得分: 0

以下是您提供的代码的翻译部分：

For some reason when the data sets are intersected the collate function gets confused because there are extra rows so it doesn't know how to merge things? The way I fixed it is by only keeping the columns I want:

由于数据集交集时存在额外的行，导致整理函数感到困惑，因此不知道如何合并这些数据？我解决这个问题的方法是只保留我想要的列：


but also doing this in the collate works if you know the text field you want (assuming `"text"` due to how common it is):

但如果你知道你想要的文本字段（假设为 "text"，因为它很常见），在整理中这样做也有效：


full code:
```python
完整代码：

希望这对您有所帮助。如果您需要进一步的翻译或有其他问题，请随时提出。

英文:

For some reason when the data sets are intersected the collate function gets confused because there are extra rows so it doesn't know how to merge things? The way I fixed it is by only keeping the columns I want:

    # -- Get data set
    # remove_columns = [&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;]
    keep_col = [&#39;text&#39;]
    # keep the strings in dataaset.column_names that intersect with keep_col str list, one liner
    print(&#39;-- interleaving datasets&#39;)
    datasets = [load_dataset(path, name, streaming=True, split=&quot;train&quot;).with_format(&quot;torch&quot;) for path, name in zip(path, name)]
    [print(f&#39;{dataset.description=}&#39;) for dataset in datasets]
    dataset = interleave_datasets(datasets, probabilities)
    remove_columns = [col for col in dataset.column_names if col not in keep_col]
    print(f&#39;{dataset=}&#39;)
    batch = dataset.take(batch_size)

but also doing this in the collate works if you know the text field you want (assuming "text" due to how common it is):

    def collate_tokenize(data):
        print(f&#39;{data[0]=}&#39;)
        text_batch = [element[&quot;text&quot;] for element in data]
        tokenized = tokenizer(text_batch, padding=&#39;longest&#39;, truncation=True, return_tensors=&#39;pt&#39;)
        return tokenized
    data_loader = DataLoader(tokenized_batch, shuffle=False, batch_size=8, num_workers=0, drop_last=False, collate_fn=collate_tokenize)
    # data_loader = DataLoader(tokenized_batch, shuffle=False, batch_size=8, num_workers=0, drop_last=False)
    # num_batches = len(list(data_loader))
    batch = next(iter(data_loader))
    print(f&#39;{batch=}&#39;)
    print(&#39;Done!\a&#39;)

full code:

def test_interleaved_data_set_2_data_loader():
    &quot;&quot;&quot; https://colab.research.google.com/drive/1QWDhA6Q64qijXYnwIGn63Aq9Eg5qt8tQ#scrollTo=Wjyy6QYimvIm &quot;&quot;&quot;
    remove_columns = []
    # -- Get probe network
    from datasets import load_dataset
    import torch
    from transformers import GPT2Tokenizer, GPT2LMHeadModel
    tokenizer = GPT2Tokenizer.from_pretrained(&quot;gpt2&quot;)
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
    probe_network = GPT2LMHeadModel.from_pretrained(&quot;gpt2&quot;)
    device = torch.device(f&quot;cuda:{0}&quot; if torch.cuda.is_available() else &quot;cpu&quot;)
    probe_network = probe_network.to(device)
    from datasets import interleave_datasets
    path, name = [&#39;c4&#39;, &#39;wikitext&#39;], [&#39;en&#39;, &#39;wikitext-103-v1&#39;]
    probabilities = [1.0/len(path)] * len(path)
    batch_size = 512
    # -- Get data set
    # remove_columns = [&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;]
    keep_col = [&#39;text&#39;]
    # keep the strings in dataaset.column_names that intersect with keep_col str list, one liner
    print(&#39;-- interleaving datasets&#39;)
    datasets = [load_dataset(path, name, streaming=True, split=&quot;train&quot;).with_format(&quot;torch&quot;) for path, name in zip(path, name)]
    [print(f&#39;{dataset.description=}&#39;) for dataset in datasets]
    dataset = interleave_datasets(datasets, probabilities)
    remove_columns = [col for col in dataset.column_names if col not in keep_col]
    print(f&#39;{dataset=}&#39;)
    batch = dataset.take(batch_size)
    # - Prepare functions to tokenize batch
    def preprocess(examples):
        return tokenizer(examples[&quot;text&quot;], padding=&quot;max_length&quot;, max_length=128, truncation=True, return_tensors=&quot;pt&quot;)
    def map(batch):
        return batch.map(preprocess, batched=True, remove_columns=remove_columns)
    # tokenized_batch = batch.map(preprocess, batched=True, remove_columns=remove_columns)
    tokenized_batch = map(batch)
    print(f&#39;{next(iter(tokenized_batch))=}&#39;)
    # -- Get data loader
    from torch.utils.data import DataLoader, Dataset
    # def collate_tokenize(data):
    #     print(f&#39;{data[0]=}&#39;)
    #     text_batch = [element[&quot;text&quot;] for element in data]
    #     tokenized = tokenizer(text_batch, padding=&#39;longest&#39;, truncation=True, return_tensors=&#39;pt&#39;)
    #     return tokenized
    # data_loader = DataLoader(tokenized_batch, shuffle=False, batch_size=8, num_workers=0, drop_last=False, collate_fn=collate_tokenize)
    data_loader = DataLoader(tokenized_batch, shuffle=False, batch_size=8, num_workers=0, drop_last=False)
    # num_batches = len(list(data_loader))
    batch = next(iter(data_loader))
    print(f&#39;{batch=}&#39;)
    print(&#39;Done!\a&#39;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用交错的Hugging Face数据集创建一个PyTorch数据加载器？

问题

答案1

解析 CSV 文件时，使用 Python 按钮按下。

Python的dynaconf未正确合并多个文件的设置。

未找到XPath定位的元素。

Connecting to snowflake in Jupyter Notebook.

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。