2023年3月23日 12:36:41go评论110阅读模式

英文:

Keras: time per step increases with a filter on the number of samples, epoch time continues the same

问题

我正在Keras上实现一个简单的健全性检查模型，用于一些我拥有的数据。我的训练数据集由约550个文件组成，每个文件贡献约150个样本。每个训练样本具有以下签名：

({'input_a': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None),
  'input_b': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None)},
   TensorSpec(shape=(None, 1), dtype=tf.int64, name=None)
)

基本上，每个训练样本由两个形状为(900, 1)的输入组成，目标是单个（二进制）标签。我的模型的第一步是将输入连接成一个(900, 2)的张量。

总训练样本数约为70000个。

作为模型的输入，我正在创建一个tf.data.Dataset，并应用一些准备步骤：

tf.Dataset.filter：过滤一些带有无效标签的样本
tf.Dataset.shuffle
tf.Dataset.filter：对训练数据集进行欠采样
tf.Dataset.batch

第3步是我的问题中最重要的部分。为了对数据集进行欠采样，我应用了一个简单的函数：

def undersampling(dataset: tf.data.Dataset, drop_proba: Iterable[float]) -> tf.data.Dataset:
    def undersample_function(x, y):
        drop_prob_ = tf.constant(drop_proba)
        idx = y[0]
        p = drop_prob_[idx]
        v = tf.random.uniform(shape=(), dtype=tf.float32)
        return tf.math.greater_equal(v, p)
    return dataset.filter(undersample_function)

基本上，该函数接受一个概率向量drop_prob，其中drop_prob[l]是丢弃标签l的样本的概率（该函数有点复杂，但这是我找到的实现Dataset.filter的方法）。使用相等的概率，比如drop_prob=[0.9, 0.9]，我将丢弃约90%的样本。

现在的问题是，我一直在尝试不同的欠采样方法，以找到性能和训练时间之间的平衡，但当我进行欠采样时，时代的持续时间是相同的，每步的时间却增加了。

保持我的batch_size固定为20000，在完整数据集中，我一共有4个批次，平均每个时代的时间如下：

Epoch 4/1000
1/4 [======>.......................] - ETA: 9s
2/4 [==============>...............] - ETA: 5s
3/4 [====================>........] - ETA: 2s
4/4 [==============================] - ETA: 0s
4/4 [==============================] - 21s 6s/step

而如果我使用drop_prob = [0.9, 0.9]对数据集进行欠采样（即，我要摆脱大约90%的数据集），并保持相同的batch_size为20000，我只有1个批次，平均每个时代的时间如下：

Epoch 4/1000
1/1 [==============================] - ETA: 0s
1/1 [==============================] - 22s 22s/step

请注意，虽然批次的数量只有1，时代时间仍然相同！只是处理批次所需的时间更长。

现在，作为一个健全性检查，我尝试了另一种欠采样的方法，通过筛选文件来实现。因此，我选择了约55个训练文件（10%），以获得单个批次中相似数量的样本，并从tf.Dataset中删除欠采样。时代时间按预期减少：

Epoch 4/1000
1/1 [==============================] - ETA: 0s
1/1 [==============================] - 2s 2s/step

请注意，原始数据集有70014个训练样本，通过tf.Dataset.filter进行欠采样后有6995个样本，通过文件筛选进行欠采样后有7018个样本，因此这些数字是一致的。

快得多。事实上，时代所需的时间约为使用完整数据集时的10%。因此，当我在创建tf.Dataset时执行欠采样（使用tf.data.Dataset.filter）时存在问题，我希望得到帮助来找出问题所在。谢谢。

英文:

I'm implementing a simple sanity check model on Keras for some data I have. My training dataset is comprised of about 550 files, and each contributes to about 150 samples. Each training sample has the following signature:

({&#39;input_a&#39;: TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None),
  &#39;input_b&#39;: TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None)},
   TensorSpec(shape=(None, 1), dtype=tf.int64, name=None)
)

Essentially, each training sample is made up of two inputs with shape (900, 1), and the target is a single (binary) label. The first step of my model is a concatenation of inputs into a (900, 2) Tensor.

The total number of training samples is about 70000.

As input to the model, I'm creating a tf.data.Dataset, and applying a few preparation steps:

tf.Dataset.filter: to filter some samples with invalid labels
tf.Dataset.shuffle
tf.Dataset.filter: to undersample my training dataset
tf.Dataset.batch

Step 3 is the most important in my question. To undersample my dataset I apply a simple function:

def undersampling(dataset: tf.data.Dataset, drop_proba: Iterable[float]) -&gt; tf.data.Dataset:
    def undersample_function(x, y):
        drop_prob_ = tf.constant(drop_proba)
        idx = y[0]
        p = drop_prob_[idx]
        v = tf.random.uniform(shape=(), dtype=tf.float32)
        return tf.math.greater_equal(v, p)
    return dataset.filter(undersample_function)

Essentially, the function accepts a a vector of probabilities drop_prob such that drop_prob[l] is the probability of dropping a sample with label l (the function is a bit convoluted, but it's the way I found to implement it as Dataset.filter). Using equal probabilities, say drop_prob=[0.9, 0.9], I`ll be dropping about 90% of my samples.

Now, the thing is, I've been experimenting with different undersamplings for my dataset, in order to find a sweet spot between performance and training time, but when I undersample, the epoch duration is the same, with time/step increasing instead.

Keeping my batch_size fixed at 20000, for the complete dataset I have a total of 4 batches, and the following time for an average epoch:

Epoch 4/1000
1/4 [======&gt;.......................] - ETA: 9s
2/4 [==============&gt;...............] - ETA: 5s
3/4 [=====================&gt;........] - ETA: 2s
4/4 [==============================] - ETA: 0s
4/4 [==============================] - 21s 6s/step

While if I undersample my dataset with a drop_prob = [0.9, 0.9] (That is, I'm getting rid of about 90% of the dataset), and keeping the same batch_size of 20000, I have 1 batch, and the following time for an average epoch:

Epoch 4/1000
1/1 [==============================] - ETA: 0s
1/1 [==============================] - 22s 22s/step

Notice that while the number of batches is only 1, the epoch time is the same! It just takes longer to process the batch.

Now, as a sanity check, I tried a different way of undersampling, by filtering the files instead. So I selected about 55 of the training files (10%), to have a similar number of samples in a single batch, and removed the undersampling from the tf.Dataset. The epoch time decreates as expected:

Epoch 4/1000
1/1 [==============================] - ETA: 0s
1/1 [==============================] - 2s 2s/step

Note that the original dataset has 70014 training samples, while the undersampled dataset by means of tf.Dataset.filter had 6995 samples and the undersampled dataset by means of file filtering had 7018 samples, thus the numbers are consistent.

Much faster. In fact, it takes about 10% of the time as the epoch takes with the full dataset. So there is an issue with the way I'm performing undersampling (by using tf.data.Dataset.filter) when creating the tf.Dataset, I would like to ask for help to figure it out what is the issue. Thanks.

答案1

得分: 1

以下是翻译好的部分：

似乎大部分时间都花在数据集操作上，而不是网络本身。通过检查证据，我的理论是，如果在GPU上执行（数据集操作无论如何都在CPU上执行），那么GPU必须等待批次之间的数据集。
因此，由于数据集操作始终需要相同的时间，这就是为什么在进度条上批次似乎需要更长时间的原因。

如果在GPU上执行，验证这个理论是否正确的正确方法是观察GPU利用率（您可以在运行时使用watch -n 0.5 nvidia-smi，或者更好地使用nvtop或任何其他GPU监控工具）。如果有时候利用率（不是内存！而是利用率）没有接近100％，那么这确实是问题的指标。请注意，它甚至不应该在半秒内降到90％以下。

要解决这个问题，您应该在代码中将Dataset.prefetch作为最后一个数据集操作使用，这将导致CPU过多获取批次，以便网络可以使用批次，从而不必等待。

英文:

It seems that most of the time is spent on the dataset operations rather than the network itself. From examining the evidence, my theory would be that if this is executed on GPU (dataset operations are executed on the CPU regardless) then the GPU has to wait for the dataset between batches.
So as the dataset operation always takes the same time, this is why on the progress bar it would seem that batches take longer.

If executed on a GPU, the right way to assert if this theory is correct is to observe the GPU utilization (you can use watch -n 0.5 nvidia-smi as it runs, or better yet use nvtop or any other GPU monitoring tool). If there are times where the utilization (not memory! but utilization) is not close to 100%, then that would be an indicator that this is indeed the problem. Notice it should never drop from 90% even not for half a second.

To solve this, you should use the Dataset.prefetch as the last dataset operation in your code, this will cause the CPU to over-fetch batches so it has batches available for the network to use so it won't wait.

答案2

得分: 0

我建议尝试在第二次过滤后对数据集进行缓存。根据文档所述，你可以将数据集存储在内存中或写入文件。基本上，在第一次迭代后，tf会保存数据集，然后将其重复使用：这也意味着第一次的随机过滤将决定剩余的样本，在每个时期都将相同。

另外，你可以尝试使用rejection_resample函数：我从未尝试过，但据我理解，它实现了与你自定义重新采样函数类似的行为（增加或减少数据集的大小），可能更快。

作为一种附注：请考虑第一个训练时期总是最慢的，因为tf必须编译模型以获得静态计算图（或者至少，它会编译每个包装在tf.function中的代码）。

英文:

I can suggest to try to cache the dataset after the second filtering. As the docs say, you can either store it in memory or to a file. Basically, after the first iteration tf will save the dataset, which will be then reused: this should also imply that the first random filtering will determine the remaining samples, that will be the same for each epoch.

Otherwise, you can try the rejection_resample function: I never tried it, but as far as I understand it implements a behaviour similar to you custom resampling function (increasing or decreasing the size of the dataset), perhaps faster.

As a side note: consider that the first training epoch is always the slowest because tf has to compile the model to obtain a static computational graph (or, at least, it compiles every piece of code that is wrapped in a tf.function.)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Keras：每步时间随样本数过滤增加，时代时间保持不变。

问题

答案1

答案2

我在for循环中得到相同的输出

在Python中导入嵌套包时出现问题。

错误从S3存储桶加载数据到Databricks外部表

math.tan在Python中输出奇怪的值

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。