英文:
Can I convert an `IterableDataset` to ` Dataset`?
问题
以下是翻译好的部分:
"IterableDataset"没有"save_to_disk()"方法。这很合理,因为它由迭代器支持,但我期望有一种将可迭代对象转换为常规数据集的方法(通过遍历它并将其存储在内存/磁盘中,不需要太复杂的操作)。
我尝试使用"Dataset.from_generator()",并使用"IterableDataset"作为生成器(iter(ds)
),但由于它试图序列化生成器对象,所以无法正常工作。
是否有一种简单的方法,就像"to_iterable_dataset()"一样,只是反过来的操作?
英文:
I want to load a large dataset, apply some transformations to some fields, sample a small section from the results and store as files so I can later on just load from there.
Basically something like this:
ds = datasets.load_dataset("XYZ", name="ABC", split="train", streaming=True)
ds = ds.map(_transform_record)
ds.shuffle()[:N].save_to_disk(...)
IterableDataset
doesn't have a save_to_disk()
method. Makes sense as it's backed by an iterator, but then I'd expect some way to convert an iterable to a regular dataset (by iterating over it all and store in memory/disk, nothing too fancy).
I tried to use Dataset.from_generator()
and use the IterableDataset
as the generator (iter(ds)
), but it doesn't work as it's trying to serialize the generator object.
Is there an easy way, like to_iterable_dataset()
just vice-versa?
答案1
得分: 1
你必须将IterableDataset
缓存到磁盘上,然后才能将其加载为Dataset
。一种方法是使用Dataset.from_generator
:
from functools import partial
from datasets import Dataset
def gen_from_iterable_dataset(iterable_ds):
yield from iterable_ds
ds = Dataset.from_generator(partial(gen_from_iterable_dataset, iterable_ds), features=iterable_ds.features})
英文:
You must cache an IterableDataset
to disk to load it as a Dataset
. One way to do this is with Dataset.from_generator
:
from functools import partial
from datasets import Dataset
def gen_from_iterable_dataset(iterable_ds)
yield from iterable_ds
ds = Dataset.from_generator(partial(gen_from_iterable_dataset, iterable_ds), features=iterable_ds.features})
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论