如何将现有的tf.data数据集切片为新数据集的元素

huangapple go评论48阅读模式
英文:

How to slice an existing tf.data dataset into elements of a new dataset

问题

如何从现有的包含两个四维数组的tf.data数据集创建一个tf.data数据集?我有一个包含图像和相应分割掩模的数据集。所以我从图像和掩模路径创建了一个tf.data数据集,并对数据集应用了一些预处理函数。在这一步之后,图像和掩模的形状变为[x,h,w,c]和[x,h,w,c]。因此,当使用dataset.as_numpy_iterator()时,我获得了两个具有这些形状的数组。现在,我想创建一个数据集,其中的元素将是形状为[h,w,c]和[h,w,c]的两个数组,其中第一个维度中的每个切片现在变成数据集的一个单独元素。因此,如果最初我的数据集有10个元素,现在应该有10 * x个元素。但我无法从现有的数据集中切片出元素。以下是我尝试过的方法:

dataset = tf.data.Dataset.from_tensor_slices((imagepath, maskpath))
dataset = dataset.map(lambda imagepath, maskpath: tf.py_function(preprocessData, 
                                                inp=[imagepath, maskpath], 
                                                Tout=[tf.float64]*2))
datasetnew = tf.data.Dataset.from_tensor_slices(dataset)

你遇到的错误是:

ValueError: Unbatching a dataset is only supported for rank >= 1

在这里,"rank" 的含义是数据集的维度。数据集必须至少具有一维才能执行“unbatching”操作。为了实现你的目标,你可以考虑将数据集拆分为两个独立的数据集,然后使用tf.data.Dataset.zip将它们组合在一起,以获得期望的结果。这里是一种可能的方法:

image_dataset = tf.data.Dataset.from_tensor_slices(imagepath)
mask_dataset = tf.data.Dataset.from_tensor_slices(maskpath)

# Apply the same preprocessing to both datasets
image_dataset = image_dataset.map(lambda imagepath: tf.py_function(preprocessData, inp=[imagepath], Tout=tf.float64))
mask_dataset = mask_dataset.map(lambda maskpath: tf.py_function(preprocessData, inp=[maskpath], Tout=tf.float64))

# Combine the datasets
combined_dataset = tf.data.Dataset.zip((image_dataset, mask_dataset))

这将创建一个包含两个形状为[h,w,c]的数组的数据集,其中每个切片在第一个维度上都变成了数据集的一个独立元素。如果最初的数据集有10个元素,现在combined_dataset将有10 * x个元素。

英文:

How to create a tf.data dataset out of an existing tf.data dataset whose elements consist of 2 four dimensional arrays? I have a dataset of images and corresponding segmentation masks. So I create a tf.dataset from the image and mask paths and apply some functions to the dataset for preprocessing. After this step the img and masks have shapes as [x,h,w,c] and [x,h,w,c]. So when using dataset.as_numpy_iterator() I get two arrays of these shapes. Now, I want to create a dataset whose element will be 2 arrays of shape [h,w,c] and [h,w,c] where each of the slice in the first dimension now becomes a separate element of the dataset. So if initially my dataset had 10 elements, it should now have 10 * x elements. But I am not able to slice out the elements from the existing dataset. This is what I have tried:

dataset = tf.data.Dataset.from_tensor_slices((imagepath, maskpath))
dataset = dataset.map(lambda imagepath, maskpath: tf.py_function(preprocessData, 
                                                inp=[imagepath, maskpath], 
                                                Tout=[tf.float64]*2))
datasetnew = tf.data.Dataset.from_tensor_slices(dataset)

Where the error I get is:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_34/343231121.py in <module>
      3                                                 inp=[flairimg_val, msk_val],
      4                                                 Tout=[tf.float64]*2))
----> 5 datasetnew = tf.data.Dataset.from_tensor_slices(datasetval)
      6 # datasetval = datasetval.map(lambda flairimg_val, msk_val, path: get_2p5D_repre(flairimg_val, msk_val, path))
      7 # datasetval = datasetval.map(lambda flairimg_val, msk_val, path: try_return(flairimg_val, msk_val, path))

/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py in from_tensor_slices(tensors)
    758       Dataset: A `Dataset`.
    759     """
--> 760     return TensorSliceDataset(tensors)
    761 
    762   class _GeneratorState(object):

/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py in __init__(self, element)
   3320     element = structure.normalize_element(element)
   3321     batched_spec = structure.type_spec_from_value(element)
-> 3322     self._tensors = structure.to_batched_tensor_list(batched_spec, element)
   3323     self._structure = nest.map_structure(
   3324         lambda component_spec: component_spec._unbatch(), batched_spec)  # pylint: disable=protected-access

/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/util/structure.py in to_batched_tensor_list(element_spec, element)
    362   # pylint: disable=protected-access
    363   # pylint: disable=g-long-lambda
--> 364   return _to_tensor_list_helper(
    365       lambda state, spec, component: state + spec._to_batched_tensor_list(
    366           component), element_spec, element)

/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/util/structure.py in _to_tensor_list_helper(encode_fn, element_spec, element)
    337     return encode_fn(state, spec, component)
    338 
--> 339   return functools.reduce(
    340       reduce_fn, zip(nest.flatten(element_spec), nest.flatten(element)), [])
    341 

/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/util/structure.py in reduce_fn(state, value)
    335   def reduce_fn(state, value):
    336     spec, component = value
--> 337     return encode_fn(state, spec, component)
    338 
    339   return functools.reduce(

/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/util/structure.py in <lambda>(state, spec, component)
    363   # pylint: disable=g-long-lambda
    364   return _to_tensor_list_helper(
--> 365       lambda state, spec, component: state + spec._to_batched_tensor_list(
    366           component), element_spec, element)
    367 

/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py in _to_batched_tensor_list(self, value)
   3492   def _to_batched_tensor_list(self, value):
   3493     if self._dataset_shape.ndims == 0:
-> 3494       raise ValueError("Unbatching a dataset is only supported for rank >= 1")
   3495     return self._to_tensor_list(value)
   3496 

ValueError: Unbatching a dataset is only supported for rank >= 1

Not sure what the rank part means here for a dataset? How to achieve this?

答案1

得分: 1

你正在寻找 unbatch 方法

> 将数据集的元素拆分为多个元素。
>
> 例如,如果数据集的元素具有形状 [B, a0, a1, ...],
> 其中 B 可能因每个输入元素而异,则对于数据集中的每个元素,
> 未批处理的数据集将包含形状为 [a0, a1, ...] 的连续 B 个元素。

英文:

You're looking for the unbatch method

> Splits elements of a dataset into multiple elements.
>
> For example, if elements of the dataset are shaped [B, a0, a1, ...],
> where B may vary for each input element, then for each element in the
> dataset, the unbatched dataset will contain B consecutive elements of
> shape [a0, a1, ...].

huangapple
  • 本文由 发表于 2023年5月22日 17:58:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76305007.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定