Tokenizing very large text datasets (cannot fit in RAM/GPU Memory) with Tensorflow

huangapple go评论71阅读模式
英文:

Tokenizing very large text datasets (cannot fit in RAM/GPU Memory) with Tensorflow

问题

如何在Tensorflow中对不适合内存的大文本数据集进行分词?对于图像数据集,有ImageDataGenerator可以按批次加载数据到模型,并对数据进行预处理。但是对于文本数据集,分词是在训练模型之前进行的。是否可以将数据集拆分成批次进行分词,或者是否已经存在Tensorflow的批处理分词函数?是否可以在不导入外部库的情况下完成?

我知道有外部库可以做到这一点。例如,可以参考https://github.com/huggingface/transformers/issues/3851。

英文:

How do we tokenize very large text datasets that don't fit into memory in Tensorflow? For image datasets, there is the ImageDataGenerator that loads the data per batch to the model, and preprocesses the data. However for text datasets, tokenization is performed before training the model. Can the dataset be split into batches for the tokenizer or is there a Tensorflow batch tokenizer function that already exist? Can this be done without having to import external libraries?

I know that there are external libraries that does this. For example from, https://github.com/huggingface/transformers/issues/3851

答案1

得分: 1

在您的情况下,您需要使用 tf.data 模块定义自己的数据处理流程。基于此模块,您可以定义自己的/定制的 tf.data.Dataset。这些数据集支持许多功能,例如将记录解析为特定格式(使用 map 函数)或进行 批处理

以下是如何使用 tf.data 模块构建自己的流程的完整示例:https://www.tensorflow.org/guide/data

英文:

In your case you need to define your own data processing pipeline using the tf.data module. Based on this module you can define an own/customized tf.data.Dataset. Those datasets support a lot of features like parsing of records into a specific format (using the map function) or batching.

Here is a complete example of how you could use the tf.data module for building your own pipeline: https://www.tensorflow.org/guide/data

huangapple
  • 本文由 发表于 2023年6月12日 19:37:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76456284.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定