Keras: time per step increases with a filter on the number of samples, epoch time continues the same



({'input_a': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None),
  'input_b': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None)},
   TensorSpec(shape=(None, 1), dtype=tf.int64, name=None)

基本上,每个训练样本由两个形状为(900, 1)的输入组成,目标是单个(二进制)标签。我的模型的第一步是将输入连接成一个(900, 2)的张量。



  1. tf.Dataset.filter:过滤一些带有无效标签的样本
  2. tf.Dataset.shuffle
  3. tf.Dataset.filter对训练数据集进行欠采样
  4. tf.Dataset.batch


def undersampling(dataset:, drop_proba: Iterable[float]) ->
    def undersample_function(x, y):

        drop_prob_ = tf.constant(drop_proba)

        idx = y[0]

        p = drop_prob_[idx]
        v = tf.random.uniform(shape=(), dtype=tf.float32)

        return tf.math.greater_equal(v, p)

    return dataset.filter(undersample_function)

基本上,该函数接受一个概率向量drop_prob,其中drop_prob[l]是丢弃标签l的样本的概率(该函数有点复杂,但这是我找到的实现Dataset.filter的方法)。使用相等的概率,比如drop_prob=[0.9, 0.9],我将丢弃约90%的样本。



Epoch 4/1000
1/4 [======>.......................] - ETA: 9s
2/4 [==============>...............] - ETA: 5s
3/4 [====================>........] - ETA: 2s
4/4 [==============================] - ETA: 0s
4/4 [==============================] - 21s 6s/step

而如果我使用drop_prob = [0.9, 0.9]对数据集进行欠采样(即,我要摆脱大约90%的数据集),并保持相同的batch_size为20000,我只有1个批次,平均每个时代的时间如下:

Epoch 4/1000
1/1 [==============================] - ETA: 0s
1/1 [==============================] - 22s 22s/step 



Epoch 4/1000
1/1 [==============================] - ETA: 0s
1/1 [==============================] - 2s 2s/step 




I'm implementing a simple sanity check model on Keras for some data I have. My training dataset is comprised of about 550 files, and each contributes to about 150 samples. Each training sample has the following signature:

({'input_a': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None),
  'input_b': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None)},
   TensorSpec(shape=(None, 1), dtype=tf.int64, name=None)

Essentially, each training sample is made up of two inputs with shape (900, 1), and the target is a single (binary) label. The first step of my model is a concatenation of inputs into a (900, 2) Tensor.

The total number of training samples is about 70000.

As input to the model, I'm creating a, and applying a few preparation steps:

  1. tf.Dataset.filter: to filter some samples with invalid labels
  2. tf.Dataset.shuffle
  3. tf.Dataset.filter: to undersample my training dataset
  4. tf.Dataset.batch

Step 3 is the most important in my question. To undersample my dataset I apply a simple function:

def undersampling(dataset:, drop_proba: Iterable[float]) ->
    def undersample_function(x, y):

        drop_prob_ = tf.constant(drop_proba)

        idx = y[0]

        p = drop_prob_[idx]
        v = tf.random.uniform(shape=(), dtype=tf.float32)

        return tf.math.greater_equal(v, p)

    return dataset.filter(undersample_function)

Essentially, the function accepts a a vector of probabilities drop_prob such that drop_prob[l] is the probability of dropping a sample with label l (the function is a bit convoluted, but it's the way I found to implement it as Dataset.filter). Using equal probabilities, say drop_prob=[0.9, 0.9], I`ll be dropping about 90% of my samples.

Now, the thing is, I've been experimenting with different undersamplings for my dataset, in order to find a sweet spot between performance and training time, but when I undersample, the epoch duration is the same, with time/step increasing instead.

Keeping my batch_size fixed at 20000, for the complete dataset I have a total of 4 batches, and the following time for an average epoch:

Epoch 4/1000
1/4 [======>.......................] - ETA: 9s
2/4 [==============>...............] - ETA: 5s
3/4 [=====================>........] - ETA: 2s
4/4 [==============================] - ETA: 0s
4/4 [==============================] - 21s 6s/step

While if I undersample my dataset with a drop_prob = [0.9, 0.9] (That is, I'm getting rid of about 90% of the dataset), and keeping the same batch_size of 20000, I have 1 batch, and the following time for an average epoch:

Epoch 4/1000
1/1 [==============================] - ETA: 0s
1/1 [==============================] - 22s 22s/step 

Notice that while the number of batches is only 1, the epoch time is the same! It just takes longer to process the batch.

Now, as a sanity check, I tried a different way of undersampling, by filtering the files instead. So I selected about 55 of the training files (10%), to have a similar number of samples in a single batch, and removed the undersampling from the tf.Dataset. The epoch time decreates as expected:

Epoch 4/1000
1/1 [==============================] - ETA: 0s
1/1 [==============================] - 2s 2s/step 

Note that the original dataset has 70014 training samples, while the undersampled dataset by means of tf.Dataset.filter had 6995 samples and the undersampled dataset by means of file filtering had 7018 samples, thus the numbers are consistent.

Much faster. In fact, it takes about 10% of the time as the epoch takes with the full dataset. So there is an issue with the way I'm performing undersampling (by using when creating the tf.Dataset, I would like to ask for help to figure it out what is the issue. Thanks.


得分: 1



如果在GPU上执行,验证这个理论是否正确的正确方法是观察GPU利用率(您可以在运行时使用watch -n 0.5 nvidia-smi,或者更好地使用nvtop或任何其他GPU监控工具)。如果有时候利用率(不是内存!而是利用率)没有接近100%,那么这确实是问题的指标。请注意,它甚至不应该在半秒内降到90%以下。



It seems that most of the time is spent on the dataset operations rather than the network itself. From examining the evidence, my theory would be that if this is executed on GPU (dataset operations are executed on the CPU regardless) then the GPU has to wait for the dataset between batches.
So as the dataset operation always takes the same time, this is why on the progress bar it would seem that batches take longer.

If executed on a GPU, the right way to assert if this theory is correct is to observe the GPU utilization (you can use watch -n 0.5 nvidia-smi as it runs, or better yet use nvtop or any other GPU monitoring tool). If there are times where the utilization (not memory! but utilization) is not close to 100%, then that would be an indicator that this is indeed the problem. Notice it should never drop from 90% even not for half a second.

To solve this, you should use the Dataset.prefetch as the last dataset operation in your code, this will cause the CPU to over-fetch batches so it has batches available for the network to use so it won't wait.


得分: 0





I can suggest to try to cache the dataset after the second filtering. As the docs say, you can either store it in memory or to a file. Basically, after the first iteration tf will save the dataset, which will be then reused: this should also imply that the first random filtering will determine the remaining samples, that will be the same for each epoch.

Otherwise, you can try the rejection_resample function: I never tried it, but as far as I understand it implements a behaviour similar to you custom resampling function (increasing or decreasing the size of the dataset), perhaps faster.

As a side note: consider that the first training epoch is always the slowest because tf has to compile the model to obtain a static computational graph (or, at least, it compiles every piece of code that is wrapped in a tf.function.)

