2023年6月12日 19:37:41go评论97阅读模式

英文:

Tokenizing very large text datasets (cannot fit in RAM/GPU Memory) with Tensorflow

问题

如何在Tensorflow中对不适合内存的大文本数据集进行分词？对于图像数据集，有ImageDataGenerator可以按批次加载数据到模型，并对数据进行预处理。但是对于文本数据集，分词是在训练模型之前进行的。是否可以将数据集拆分成批次进行分词，或者是否已经存在Tensorflow的批处理分词函数？是否可以在不导入外部库的情况下完成？

我知道有外部库可以做到这一点。例如，可以参考https://github.com/huggingface/transformers/issues/3851。

英文:

How do we tokenize very large text datasets that don't fit into memory in Tensorflow? For image datasets, there is the ImageDataGenerator that loads the data per batch to the model, and preprocesses the data. However for text datasets, tokenization is performed before training the model. Can the dataset be split into batches for the tokenizer or is there a Tensorflow batch tokenizer function that already exist? Can this be done without having to import external libraries?

I know that there are external libraries that does this. For example from, https://github.com/huggingface/transformers/issues/3851

答案1

得分: 1

在您的情况下，您需要使用 tf.data 模块定义自己的数据处理流程。基于此模块，您可以定义自己的/定制的 tf.data.Dataset。这些数据集支持许多功能，例如将记录解析为特定格式（使用 map 函数）或进行批处理。

以下是如何使用 tf.data 模块构建自己的流程的完整示例：https://www.tensorflow.org/guide/data

英文:

In your case you need to define your own data processing pipeline using the tf.data module. Based on this module you can define an own/customized tf.data.Dataset. Those datasets support a lot of features like parsing of records into a specific format (using the map function) or batching.

Here is a complete example of how you could use the tf.data module for building your own pipeline: https://www.tensorflow.org/guide/data

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Tokenizing very large text datasets (cannot fit in RAM/GPU Memory) with Tensorflow

问题

答案1

“list”对象没有属性”shape”。

选择 pandas 中的 user_id 行。

有一个Python函数可以将数组中的每一行都除以该行的第一个值吗？

open model zoo multi_camera_multi_target_tracking_demo

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。