英文:
Tensorflow.data.Dataset.rejection_resample modifies my dataset's element_spec
问题
我尝试使用tf.data.Dataset.rejection_resample
来平衡我的数据集,但我遇到了一个问题,该方法会修改我的数据集的element_spec
,使其与我的模型不兼容。
我的数据集的原始element_spec
如下:
({'input_A': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None),
'input_B': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None)},
TensorSpec(shape=(None, 1, 1), dtype=tf.int64, name=None))
这是在批处理之后的element_spec
。
但是,如果我在批处理之前运行rejection_resample
,最终的element_spec
变成了:
(TensorSpec(shape=(None,), dtype=tf.int64, name=None),
({'input_A': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None),
'input_B': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None)},
TensorSpec(shape=(None, 1, 1), dtype=tf.int64, name=None))
因此,rejection_resample
在我的数据的开头添加了另一个tf.int64
张量,我无法弄清楚它的作用是什么。我的问题是,这会破坏输入数据和我的模型之间的兼容性,因为模型依赖于原始输入元组。
此外,这还会导致训练数据和验证数据之间的不一致性。我原本期望只在训练数据上应用rejection_resample
,但如果我这样做,训练数据集将包含添加的张量,而验证数据集则不会包含。
因此,我的问题是,这个添加的张量是什么,是否有办法在构建数据集之后删除数据元素。谢谢。
英文:
I am trying to use tf.data.Dataset.rejection_resample
to balance my dataset, but I am running into an issue in which the method modifies the element_spec
of my dataset, making it incompatible with my models.
The original element spec of my dataset is:
({'input_A': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None),
'input_B': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None)},
TensorSpec(shape=(None, 1, 1), dtype=tf.int64, name=None))
This is the element spec after batching.
However, if I run rejection_resample
(before batching), the element spec at the end becomes:
(TensorSpec(shape=(None,), dtype=tf.int64, name=None),
({'input_A': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None),
'input_B': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None)},
TensorSpec(shape=(None, 1, 1), dtype=tf.int64, name=None)))
So rejection_resample
is adding another tf.int64
tensor in the beginning of my data, which I can't find out what is it for. My problem is that this breaks compatibility between the input data and my model, since it depends on the original input tuple.
Furthermore, it also causes an inconsistency between the training and validation data. I was expecting to apply rejection_resample
only on training data, but if I do that, the training dataset will have the added tensor, while the validation one won't.
So my question is what is this added tensor to the element spec, and if there is any way to drop an element from the dataset after building it. Thank you.
答案1
得分: 1
我无法告诉你添加的张量来自何处,但以下是如何从数据集中删除/丢弃它的示例:
```python
import tensorflow as tf
import numpy as np
# 创建一个类似于您的“错误”输出的示例数据集
ds = tf.data.Dataset.from_tensor_slices((np.arange(-10, 0), (tf.constant(np.arange(10)), tf.constant(np.arange(10, 20))))
# 移除新的“错误”张量
dds = ds.map(lambda x, y: y)
# 检查新数据集
for i in dds.take(2):
print(i)
请注意,这只是一种解决方法,不会移除导致添加张量的源。
<details>
<summary>英文:</summary>
I can't tell you where the added tensor comes from but here would be an example how to remove/drop it from your dataset:
import tensorflow as tf
import numpy as np
creating a sample dataset that's similar to your 'wrong' output
ds = tf.data.Dataset.from_tensor_slices((np.arange(-10, 0),(tf.constant(np.arange(10)), tf.constant(np.arange(10,20)))))
remove the new 'wrong' tensor
dds = ds.map(lambda x, y: y)
check new dataset
for i in dds.take(2):
print(i)
Keep in mind that this is a workaround and doesn't remove the source that causes the additional tensor
</details>
# 答案2
**得分**: 1
请看以下翻译:
```python
假设我创建了与您相同的数据集,
x = tf.random.normal((7000, 900, 1))
y = tf.random.normal((7000, 900, 1))
z = tf.random.uniform((7000, 1, 1), 1, 2, dtype=tf.int32)
# 现在将其转换为 Tf.Dataset 对象
dataset = tf.data.Dataset.from_tensor_slices(((x, y), z))
func = lambda x, y: ({'input_A': x[0], 'input_B': x[1]}, y)
dataset = dataset.map(func)
映射之后,我的 dataset
将与您的数据集完全相同:
<MapDataset element_spec=({'input_A': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None), 'input_B': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None)}, TensorSpec(shape=(1, 1), dtype=tf.int32, name=None)>
现在,我需要移除最后的 Tensor
:
disjoint_func = lambda x, y: (x)
dataset = dataset.map(disjoint_func)
现在,已经移除了额外的 Tensor
:
<MapDataset element_spec={'input_A': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None), 'input_B': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None)}>
希望这有助于您理解代码。
英文:
Let suppose I have created the same dataset as yours,
x = tf.random.normal((7000, 900,1))
y = tf.random.normal((7000, 900,1))
z = tf.random.uniform((7000, 1,1), 1, 2, dtype=tf.int32)
#Now converting it to Tf.Dataset object
dataset = tf.data.Dataset.from_tensor_slices(((x,y),z))
func = lambda x , y : (({'input_A' : x[0], 'input_B' : x[1]}), y)
dataset = dataset.map(func)
After mapping my dataset
will look exactly like yours
<MapDataset element_spec=({'input_A': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None), 'input_B': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None)}, TensorSpec(shape=(1, 1), dtype=tf.int32, name=None))>
Now, I have to remove this last Tensor
disjoint_func = lambda x , y :(x)
dataset = dataset.map(disjoint_func)
Now, the extra Tensor
has been removed
<MapDataset element_spec={'input_A': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None), 'input_B': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None)}>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论