2023年2月6日 09:50:46go评论104阅读模式

英文:

Tensorflow.data.Dataset.rejection_resample modifies my dataset's element_spec

问题

我尝试使用tf.data.Dataset.rejection_resample来平衡我的数据集，但我遇到了一个问题，该方法会修改我的数据集的element_spec，使其与我的模型不兼容。

我的数据集的原始element_spec如下：

({'input_A': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None),
  'input_B': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None)},
 TensorSpec(shape=(None, 1, 1), dtype=tf.int64, name=None))

这是在批处理之后的element_spec。

但是，如果我在批处理之前运行rejection_resample，最终的element_spec变成了：

(TensorSpec(shape=(None,), dtype=tf.int64, name=None),
 ({'input_A': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None),
   'input_B': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None)},
  TensorSpec(shape=(None, 1, 1), dtype=tf.int64, name=None))

因此，rejection_resample在我的数据的开头添加了另一个tf.int64张量，我无法弄清楚它的作用是什么。我的问题是，这会破坏输入数据和我的模型之间的兼容性，因为模型依赖于原始输入元组。

此外，这还会导致训练数据和验证数据之间的不一致性。我原本期望只在训练数据上应用rejection_resample，但如果我这样做，训练数据集将包含添加的张量，而验证数据集则不会包含。

因此，我的问题是，这个添加的张量是什么，是否有办法在构建数据集之后删除数据元素。谢谢。

英文:

I am trying to use tf.data.Dataset.rejection_resample to balance my dataset, but I am running into an issue in which the method modifies the element_spec of my dataset, making it incompatible with my models.

The original element spec of my dataset is:

({&#39;input_A&#39;: TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None),
  &#39;input_B&#39;: TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None)},
 TensorSpec(shape=(None, 1, 1), dtype=tf.int64, name=None))

This is the element spec after batching.

However, if I run rejection_resample (before batching), the element spec at the end becomes:

(TensorSpec(shape=(None,), dtype=tf.int64, name=None),
 ({&#39;input_A&#39;: TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None),
   &#39;input_B&#39;: TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None)},
  TensorSpec(shape=(None, 1, 1), dtype=tf.int64, name=None)))

So rejection_resample is adding another tf.int64 tensor in the beginning of my data, which I can't find out what is it for. My problem is that this breaks compatibility between the input data and my model, since it depends on the original input tuple.

Furthermore, it also causes an inconsistency between the training and validation data. I was expecting to apply rejection_resample only on training data, but if I do that, the training dataset will have the added tensor, while the validation one won't.

So my question is what is this added tensor to the element spec, and if there is any way to drop an element from the dataset after building it. Thank you.

答案1

得分: 1

我无法告诉你添加的张量来自何处，但以下是如何从数据集中删除/丢弃它的示例：
```python
import tensorflow as tf
import numpy as np
# 创建一个类似于您的“错误”输出的示例数据集
ds = tf.data.Dataset.from_tensor_slices((np.arange(-10, 0), (tf.constant(np.arange(10)), tf.constant(np.arange(10, 20))))
# 移除新的“错误”张量
dds = ds.map(lambda x, y: y)
# 检查新数据集
for i in dds.take(2):
    print(i)

请注意，这只是一种解决方法，不会移除导致添加张量的源。


<details>
<summary>英文:</summary>
I can&#39;t tell you where the added tensor comes from but here would be an example how to remove/drop it from your dataset:

import tensorflow as tf
import numpy as np

creating a sample dataset that's similar to your 'wrong' output

ds = tf.data.Dataset.from_tensor_slices((np.arange(-10, 0),(tf.constant(np.arange(10)), tf.constant(np.arange(10,20)))))

remove the new 'wrong' tensor

dds = ds.map(lambda x, y: y)

check new dataset

for i in dds.take(2):
print(i)

Keep in mind that this is a workaround and doesn&#39;t remove the source that causes the additional tensor
</details>
# 答案2
**得分**: 1
请看以下翻译：
```python
假设我创建了与您相同的数据集，
x = tf.random.normal((7000, 900, 1))
y = tf.random.normal((7000, 900, 1))
z = tf.random.uniform((7000, 1, 1), 1, 2, dtype=tf.int32)
# 现在将其转换为 Tf.Dataset 对象
dataset = tf.data.Dataset.from_tensor_slices(((x, y), z))
func = lambda x, y: ({'input_A': x[0], 'input_B': x[1]}, y)
dataset = dataset.map(func)

映射之后，我的 dataset 将与您的数据集完全相同：

<MapDataset element_spec=({'input_A': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None), 'input_B': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None)}, TensorSpec(shape=(1, 1), dtype=tf.int32, name=None)>

现在，我需要移除最后的 Tensor：

disjoint_func = lambda x, y: (x)
dataset = dataset.map(disjoint_func)

现在，已经移除了额外的 Tensor：

<MapDataset element_spec={'input_A': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None), 'input_B': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None)}>

希望这有助于您理解代码。

英文:

Let suppose I have created the same dataset as yours,

x = tf.random.normal((7000, 900,1))
y = tf.random.normal((7000, 900,1))
z = tf.random.uniform((7000, 1,1), 1, 2, dtype=tf.int32)
#Now converting it to Tf.Dataset object
dataset = tf.data.Dataset.from_tensor_slices(((x,y),z))
func = lambda x , y : (({&#39;input_A&#39; : x[0], &#39;input_B&#39; : x[1]}), y)
dataset = dataset.map(func)

After mapping my dataset will look exactly like yours

&lt;MapDataset element_spec=({&#39;input_A&#39;: TensorSpec(shape=(900, 1), dtype=tf.float32, name=None), &#39;input_B&#39;: TensorSpec(shape=(900, 1), dtype=tf.float32, name=None)}, TensorSpec(shape=(1, 1), dtype=tf.int32, name=None))&gt;

Now, I have to remove this last Tensor

disjoint_func = lambda x , y :(x)
dataset = dataset.map(disjoint_func)

Now, the extra Tensor has been removed

&lt;MapDataset element_spec={&#39;input_A&#39;: TensorSpec(shape=(900, 1), dtype=tf.float32, name=None), &#39;input_B&#39;: TensorSpec(shape=(900, 1), dtype=tf.float32, name=None)}&gt;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

TensorFlow.data.Dataset.rejection_resample修改了我的数据集的element_spec。

问题

答案1

creating a sample dataset that's similar to your 'wrong' output

remove the new 'wrong' tensor

check new dataset

如何将一个 C++ 数组复制到 Eigen 张量中

无法在Google Colab中安装支持GPU的Tensorflow 2.0。

Django应用程序出现错误：”TypeError: ‘dict_keys’对象不可订阅”

重新排列表格中的行 Customtkinter

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。