可以将`IterableDataset`转换为`Dataset`吗?

huangapple go评论81阅读模式
英文:

Can I convert an `IterableDataset` to ` Dataset`?

问题

以下是翻译好的部分:

"IterableDataset"没有"save_to_disk()"方法。这很合理,因为它由迭代器支持,但我期望有一种将可迭代对象转换为常规数据集的方法(通过遍历它并将其存储在内存/磁盘中,不需要太复杂的操作)。

我尝试使用"Dataset.from_generator()",并使用"IterableDataset"作为生成器(iter(ds)),但由于它试图序列化生成器对象,所以无法正常工作。

是否有一种简单的方法,就像"to_iterable_dataset()"一样,只是反过来的操作?

英文:

I want to load a large dataset, apply some transformations to some fields, sample a small section from the results and store as files so I can later on just load from there.

Basically something like this:

ds = datasets.load_dataset("XYZ", name="ABC", split="train", streaming=True)
ds = ds.map(_transform_record)
ds.shuffle()[:N].save_to_disk(...)

IterableDataset doesn't have a save_to_disk() method. Makes sense as it's backed by an iterator, but then I'd expect some way to convert an iterable to a regular dataset (by iterating over it all and store in memory/disk, nothing too fancy).

I tried to use Dataset.from_generator() and use the IterableDataset as the generator (iter(ds)), but it doesn't work as it's trying to serialize the generator object.

Is there an easy way, like to_iterable_dataset() just vice-versa?

答案1

得分: 1

你必须将IterableDataset缓存到磁盘上,然后才能将其加载为Dataset。一种方法是使用Dataset.from_generator

from functools import partial
from datasets import Dataset

def gen_from_iterable_dataset(iterable_ds):
    yield from iterable_ds

ds = Dataset.from_generator(partial(gen_from_iterable_dataset, iterable_ds), features=iterable_ds.features})
英文:

You must cache an IterableDataset to disk to load it as a Dataset. One way to do this is with Dataset.from_generator:

from functools import partial
from datasets import Dataset

def gen_from_iterable_dataset(iterable_ds)
    yield from iterable_ds

ds = Dataset.from_generator(partial(gen_from_iterable_dataset, iterable_ds), features=iterable_ds.features})

huangapple
  • 本文由 发表于 2023年5月11日 19:37:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76227219.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定