如何使用交错的Hugging Face数据集创建一个PyTorch数据加载器?

huangapple go评论156阅读模式
英文:

How does one create a pytoch data loader using an interleaved hugging face dataset?

问题

当我交错数据集、获取一个经过标记的批次,并将批次提供给PyTorch数据加载器时,我收到错误消息:

  1. ---------------------------------------------------------------------------
  2. TypeError Traceback (most recent call last)
  3. /usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py in collate(batch, collate_fn_map)
  4. 126 try:
  5. --> 127 return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  6. 128 except TypeError:
  7. 9 frames
  8. TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'NoneType'>
  9. During handling of the above exception, another exception occurred:
  10. TypeError Traceback (most recent call last)
  11. /usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py in collate(batch, collate_fn_map)
  12. 148 return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]
  13. 149
  14. --> 150 raise TypeError(default_collate_err_msg_format.format(elem_type))
  15. 151
  16. 152
  17. TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'NoneType'>

为什么?为什么单独的数据集c4和wiki-text没有出现这个错误?只有交错的数据集会出现这个错误吗?

理想情况下,我不想编写自己的collate_function。

英文:

When I interleave data sets, get a tokenized batch, feed the batch to the pytorch data loader, I get errors:

  1. # -*- coding: utf-8 -*-
  2. """issues with dataloader and custom data sets
  3. Automatically generated by Colaboratory.
  4. Original file is located at
  5. https://colab.research.google.com/drive/1sbs95as_66mtK9VK_vbaE9gLE-Tjof1-
  6. """
  7. !pip install datasets
  8. !pip install pytorch
  9. !pip install transformers
  10. token = None
  11. batch_size = 10
  12. from datasets import load_dataset
  13. import torch
  14. from transformers import GPT2Tokenizer, GPT2LMHeadModel
  15. tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
  16. if tokenizer.pad_token_id is None:
  17. tokenizer.pad_token = tokenizer.eos_token
  18. probe_network = GPT2LMHeadModel.from_pretrained("gpt2")
  19. device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
  20. probe_network = probe_network.to(device)
  21. # -- Get batch from dataset
  22. from datasets import load_dataset
  23. # path, name = 'brando/debug1_af', 'debug1_af'
  24. path, name = 'brando/debug0_af', 'debug0_af'
  25. remove_columns = []
  26. dataset = load_dataset(path, name, streaming=True, split="train", token=token).with_format("torch")
  27. print(f'{dataset=}')
  28. batch = dataset.take(batch_size)
  29. # print(f'{next(iter(batch))=}')
  30. # - Prepare functions to tokenize batch
  31. def preprocess(examples): # gets the raw text batch according to the specific names in table in data set & tokenize
  32. return tokenizer(examples["link"], padding="max_length", max_length=128, truncation=True, return_tensors="pt")
  33. def map(batch): # apply preprocess to batch to all examples in batch represented as a dataset
  34. return batch.map(preprocess, batched=True, remove_columns=remove_columns)
  35. tokenized_batch = batch.map(preprocess, batched=True, remove_columns=remove_columns)
  36. tokenized_batch = map(batch)
  37. # print(f'{next(iter(tokenized_batch))=}')
  38. from torch.utils.data import Dataset, DataLoader, SequentialSampler
  39. dataset = tokenized_batch
  40. print(f'{type(dataset)=}')
  41. print(f'{dataset.__class__=}')
  42. print(f'{isinstance(dataset, Dataset)=}')
  43. # for i, d in enumerate(dataset):
  44. # assert isinstance(d, dict)
  45. # # dd = dataset[i]
  46. # # assert isinstance(dd, dict)
  47. loader_opts = {}
  48. classifier_opts = {}
  49. # data_loader = DataLoader(dataset, shuffle=False, batch_size=loader_opts.get('batch_size', 1),
  50. # num_workers=loader_opts.get('num_workers', 0), drop_last=False, sampler=SequentialSampler(range(512)) )
  51. data_loader = DataLoader(dataset, shuffle=False, batch_size=loader_opts.get('batch_size', 1),
  52. num_workers=loader_opts.get('num_workers', 0), drop_last=False, sampler=None)
  53. print(f'{iter(data_loader)=}')
  54. print(f'{next(iter(data_loader))=}')
  55. print('Done\a')

with error:

  1. ---------------------------------------------------------------------------
  2. TypeError Traceback (most recent call last)
  3. /usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py in collate(batch, collate_fn_map)
  4. 126 try:
  5. --> 127 return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  6. 128 except TypeError:
  7. 9 frames
  8. TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'NoneType'>
  9. During handling of the above exception, another exception occurred:
  10. TypeError Traceback (most recent call last)
  11. /usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py in collate(batch, collate_fn_map)
  12. 148 return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]
  13. 149
  14. --> 150 raise TypeError(default_collate_err_msg_format.format(elem_type))
  15. 151
  16. 152
  17. TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'NoneType'>

why? And why doesn't the single data set c4 and wiki-text give this error? Only interleaved data sets?

Ideally I don't want to write my own collate_function.

答案1

得分: 0

以下是您提供的代码的翻译部分:

  1. For some reason when the data sets are intersected the collate function gets confused because there are extra rows so it doesn't know how to merge things? The way I fixed it is by only keeping the columns I want:

由于数据集交集时存在额外的行,导致整理函数感到困惑,因此不知道如何合并这些数据?我解决这个问题的方法是只保留我想要的列:

  1. but also doing this in the collate works if you know the text field you want (assuming `"text"` due to how common it is):

但如果你知道你想要的文本字段(假设为 "text",因为它很常见),在整理中这样做也有效:

  1. full code:
  2. ```python
  3. 完整代码:

希望这对您有所帮助。如果您需要进一步的翻译或有其他问题,请随时提出。

英文:

For some reason when the data sets are intersected the collate function gets confused because there are extra rows so it doesn't know how to merge things? The way I fixed it is by only keeping the columns I want:

  1. # -- Get data set
  2. # remove_columns = ['text', 'timestamp', 'url']
  3. keep_col = ['text']
  4. # keep the strings in dataaset.column_names that intersect with keep_col str list, one liner
  5. print('-- interleaving datasets')
  6. datasets = [load_dataset(path, name, streaming=True, split="train").with_format("torch") for path, name in zip(path, name)]
  7. [print(f'{dataset.description=}') for dataset in datasets]
  8. dataset = interleave_datasets(datasets, probabilities)
  9. remove_columns = [col for col in dataset.column_names if col not in keep_col]
  10. print(f'{dataset=}')
  11. batch = dataset.take(batch_size)

but also doing this in the collate works if you know the text field you want (assuming "text" due to how common it is):

  1. def collate_tokenize(data):
  2. print(f'{data[0]=}')
  3. text_batch = [element["text"] for element in data]
  4. tokenized = tokenizer(text_batch, padding='longest', truncation=True, return_tensors='pt')
  5. return tokenized
  6. data_loader = DataLoader(tokenized_batch, shuffle=False, batch_size=8, num_workers=0, drop_last=False, collate_fn=collate_tokenize)
  7. # data_loader = DataLoader(tokenized_batch, shuffle=False, batch_size=8, num_workers=0, drop_last=False)
  8. # num_batches = len(list(data_loader))
  9. batch = next(iter(data_loader))
  10. print(f'{batch=}')
  11. print('Done!\a')

full code:

  1. def test_interleaved_data_set_2_data_loader():
  2. """ https://colab.research.google.com/drive/1QWDhA6Q64qijXYnwIGn63Aq9Eg5qt8tQ#scrollTo=Wjyy6QYimvIm """
  3. remove_columns = []
  4. # -- Get probe network
  5. from datasets import load_dataset
  6. import torch
  7. from transformers import GPT2Tokenizer, GPT2LMHeadModel
  8. tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
  9. if tokenizer.pad_token_id is None:
  10. tokenizer.pad_token = tokenizer.eos_token
  11. probe_network = GPT2LMHeadModel.from_pretrained("gpt2")
  12. device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
  13. probe_network = probe_network.to(device)
  14. from datasets import interleave_datasets
  15. path, name = ['c4', 'wikitext'], ['en', 'wikitext-103-v1']
  16. probabilities = [1.0/len(path)] * len(path)
  17. batch_size = 512
  18. # -- Get data set
  19. # remove_columns = ['text', 'timestamp', 'url']
  20. keep_col = ['text']
  21. # keep the strings in dataaset.column_names that intersect with keep_col str list, one liner
  22. print('-- interleaving datasets')
  23. datasets = [load_dataset(path, name, streaming=True, split="train").with_format("torch") for path, name in zip(path, name)]
  24. [print(f'{dataset.description=}') for dataset in datasets]
  25. dataset = interleave_datasets(datasets, probabilities)
  26. remove_columns = [col for col in dataset.column_names if col not in keep_col]
  27. print(f'{dataset=}')
  28. batch = dataset.take(batch_size)
  29. # - Prepare functions to tokenize batch
  30. def preprocess(examples):
  31. return tokenizer(examples["text"], padding="max_length", max_length=128, truncation=True, return_tensors="pt")
  32. def map(batch):
  33. return batch.map(preprocess, batched=True, remove_columns=remove_columns)
  34. # tokenized_batch = batch.map(preprocess, batched=True, remove_columns=remove_columns)
  35. tokenized_batch = map(batch)
  36. print(f'{next(iter(tokenized_batch))=}')
  37. # -- Get data loader
  38. from torch.utils.data import DataLoader, Dataset
  39. # def collate_tokenize(data):
  40. # print(f'{data[0]=}')
  41. # text_batch = [element["text"] for element in data]
  42. # tokenized = tokenizer(text_batch, padding='longest', truncation=True, return_tensors='pt')
  43. # return tokenized
  44. # data_loader = DataLoader(tokenized_batch, shuffle=False, batch_size=8, num_workers=0, drop_last=False, collate_fn=collate_tokenize)
  45. data_loader = DataLoader(tokenized_batch, shuffle=False, batch_size=8, num_workers=0, drop_last=False)
  46. # num_batches = len(list(data_loader))
  47. batch = next(iter(data_loader))
  48. print(f'{batch=}')
  49. print('Done!\a')

huangapple
  • 本文由 发表于 2023年8月11日 02:21:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76878387.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定