英文:
Tensorboard histogram onehot operation causing ResourceExhauseError: OOM
问题
I'm trying to train a VGG16 model. I'm using a sample dataset of 4000 300x300 images in 14 classes, and running my code on a Google VM using an Nvidia L4 GPU with 20gb of memory. I am running python 3.7, tf version 2.11, and cuda version 12.1. My data is stored in GCS.
当我使用以下的TensorBoard回调运行模型时:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
我在第一个epoch结束时得到以下错误:
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
ResourceExhaustedError: {{function_node __wrapped__OneHot_device_/job:localhost/replica:0/task:0/device:CPU:0}} OOM when allocating tensor with shape[102760448,30] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu [Op:OneHot]
这个错误追溯到了TensorBoard直方图对象:
ResourceExhaustedError Traceback (most recent call last)
/var/tmp/ipykernel_5723/1753739100.py in <module>
1 # Fit model
----> 2 history = model.fit(train_ds, validation_data=val_ds, epochs=5, callbacks=[tensorboard_callback])
/opt/conda/lib/python3.7/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
68 # To get the full stack trace, call:
69 # `tf.debugging.disable_traceback_filtering()`
--> 70 raise e.with_traceback(filtered_tb) from None
71 finally:
72 del filtered_tb
/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in histogram(name, data, step, buckets, description)
198 tensor=lazy_tensor,
199 step=step,
--> 200 metadata=summary_metadata,
201 )
202
/opt/conda/lib/python3.7/site-packages/tensorboard/util/lazy_tensor_creator.py in __call__(self)
64 elif self._tensor is None:
65 self._tensor = _CALL_IN_PROGRESS_SENTINEL
--> 66 self._tensor = self._tensor_callable()
67 return self._tensor
68
/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in lazy_tensor()
192 @lazy_tensor_creator.LazyTensorCreator
193 def lazy_tensor():
--> 194 return _buckets(data, buckets)
195
196 return tf.summary.write(
/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in _buckets(data, bucket_count)
291 )
292
--> 293 return tf.cond(is_empty, when_empty, when_nonempty)
/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in when_nonempty()
288
289 return tf.cond(
--> 290 has_single_value, when_single_value, when_multiple_values
291 )
292
/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in when_multiple_values()
257 # See https://github.com/tensorflow/tensorflow/issues/51419 for details.
258 one_hots = tf.one_hot(
--> 259 clamped_indices, depth=bucket_count, dtype=tf.float64
260 )
261 bucket_counts = tf.cast(
ResourceExhaustedError: {{function_node __wrapped__OneHot_device_/job:localhost/replica:0/task:0/device:CPU:0}} OOM when allocating tensor with shape[102760448,30] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu [Op:OneHot]```
有趣的是,它似乎在调用tf.one_hot时会爆炸GPU内存,生成一个庞大的张量,而不管我是使用整数标签和稀疏分类交叉熵来训练模型,还是使用独热标签和交叉熵来训练模型。我不太理解这个张量包含什么,因为它的维度既不相关于我使用的训练样本数量,也不相关于类别数量。
有关如何解决这个问题的任何想法吗?
<details>
<summary>英文:</summary>
I'm trying to train a VGG16 model. I'm using a sample dataset of 4000 300x300 images in 14 classes, and running my code on a Google VM using an Nvidia L4 GPU with 20gb of memory. I am running python 3.7, tf version 2.11, and cuda version 12.1. My data is stored in GCS.
When I run the model with the following TensorBoard callback:
```tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)```
I get this error at the end of the first epoch:
2023-06-14 19:51:21.248476: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (mklcpu) ran out of memory trying to allocate 22.97GiB (rounded to 24662507520)requested by op OneHot
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
`ResourceExhaustedError: {{function_node _`_`wrapped__OneHot_device`_`/job:localhost/replica:0/task:0/device:CPU:0}} OOM when allocating tensor with shape[102760448,30] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu [Op:OneHot]`
The error traces back to the tensorboard histogram object:
ResourceExhaustedError Traceback (most recent call last)
/var/tmp/ipykernel_5723/1753739100.py in <module>
1 # Fit model
----> 2 history = model.fit(train_ds, validation_data=val_ds, epochs=5, callbacks=[tensorboard_callback])
/opt/conda/lib/python3.7/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
68 # To get the full stack trace, call:
69 # tf.debugging.disable_traceback_filtering()
---> 70 raise e.with_traceback(filtered_tb) from None
71 finally:
72 del filtered_tb
/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in histogram(name, data, step, buckets, description)
198 tensor=lazy_tensor,
199 step=step,
--> 200 metadata=summary_metadata,
201 )
202
/opt/conda/lib/python3.7/site-packages/tensorboard/util/lazy_tensor_creator.py in call(self)
64 elif self._tensor is None:
65 self._tensor = _CALL_IN_PROGRESS_SENTINEL
---> 66 self._tensor = self._tensor_callable()
67 return self._tensor
68
/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in lazy_tensor()
192 @lazy_tensor_creator.LazyTensorCreator
193 def lazy_tensor():
--> 194 return _buckets(data, buckets)
195
196 return tf.summary.write(
/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in _buckets(data, bucket_count)
291 )
292
--> 293 return tf.cond(is_empty, when_empty, when_nonempty)
/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in when_nonempty()
288
289 return tf.cond(
--> 290 has_single_value, when_single_value, when_multiple_values
291 )
292
/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in when_multiple_values()
257 # See https://github.com/tensorflow/tensorflow/issues/51419 for details.
258 one_hots = tf.one_hot(
--> 259 clamped_indices, depth=bucket_count, dtype=tf.float64
260 )
261 bucket_counts = tf.cast(
ResourceExhaustedError: {{function_node _wrapped__OneHot_device/job:localhost/replica:0/task:0/device:CPU:0}} OOM when allocating tensor with shape[102760448,30] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu [Op:OneHot]
Interestingly it seems to be calling tf.one_hot and blowing up the gpu memory with a massive tensor regardless of whether I train the model with integer labels and spare categorical cross entropy or if I train it with one hot labels and cross entropy. I don't really understand what the tensor contains because its dimensions neither relate to the number of training examples or classes that I am using.
Any ideas about how to fix this?
</details>
# 答案1
**得分**: 1
问题似乎与内存资源有关,而不是与Tensorflow有关的问题。如果使用独热编码,它会创建一个非常大的稀疏张量,可能需要更多的内存资源。由于您已将histogram_freq设置为1,它将为每个层的权重直方图创建额外的计算,这需要更高的内存资源。
您可以尝试将histogram_freq设置为0,并检查问题是否仍然存在,然后我们需要检查导致大张量计算的您的代码。如果没有问题,那么这显然是由于直方图计算导致的更高内存需求。
OOM错误取决于输入大小和内存资源。Tensorflow无法控制这一点,必须由用户来处理。
您可以尝试减小model.fit中的batch_size,它默认为32(您可以尝试batch_size=16)。还可以在导入Tensorflow之前使用以下代码块。如果OOM是由于内存碎片化导致的,这可能会有所帮助。
```python
import os
os.environ['TF_GPU_ALLOCATOR'] = 'cuda_malloc_async'
英文:
The issue seems related to Memory resources and not problem with Tensorflow. If using one hot encoding it creates a very large sparse tensor which may require higher memory resources. As you have set histogram_freq=1 it will create additional computations for weight histograms of each layer which needs higher memory resources.
You may try setting histogram_freq=0 and check if the problem still exists then we need to check your code which is causing the large tensor computations.If no problem then its clear case of Higher memory requirement due to Histogram computations .
OOM errors are depends upon the input sizes and the Memory resources. Tensorflow can't have control on this. It has to be taken care by the users.
May be you can reduce batch_size in model.fit which is by default 32(you may try batch_size=16) . Also use the below code before importing Tensorflow in your code block.This may help if the OOM is due to Memory fragmentation.
import os
os.environ[‘TF_GPU_ALLOCATOR’] = ‘cuda_malloc_async’
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论