2023年6月15日 04:53:55go评论105阅读模式

英文:

Tensorboard histogram onehot operation causing ResourceExhauseError: OOM

问题

I'm trying to train a VGG16 model. I'm using a sample dataset of 4000 300x300 images in 14 classes, and running my code on a Google VM using an Nvidia L4 GPU with 20gb of memory. I am running python 3.7, tf version 2.11, and cuda version 12.1. My data is stored in GCS.

当我使用以下的TensorBoard回调运行模型时：

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

我在第一个epoch结束时得到以下错误：

If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.

ResourceExhaustedError: {{function_node __wrapped__OneHot_device_/job:localhost/replica:0/task:0/device:CPU:0}} OOM when allocating tensor with shape[102760448,30] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu [Op:OneHot]

这个错误追溯到了TensorBoard直方图对象：

ResourceExhaustedError                    Traceback (most recent call last)
/var/tmp/ipykernel_5723/1753739100.py in &lt;module&gt;
      1 # Fit model
----&gt; 2 history = model.fit(train_ds, validation_data=val_ds, epochs=5, callbacks=[tensorboard_callback])
/opt/conda/lib/python3.7/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
     68             # To get the full stack trace, call:
     69             # `tf.debugging.disable_traceback_filtering()`
--&gt; 70             raise e.with_traceback(filtered_tb) from None
     71         finally:
     72             del filtered_tb
/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in histogram(name, data, step, buckets, description)
    198             tensor=lazy_tensor,
    199             step=step,
--&gt; 200             metadata=summary_metadata,
    201         )
    202 
/opt/conda/lib/python3.7/site-packages/tensorboard/util/lazy_tensor_creator.py in __call__(self)
     64                 elif self._tensor is None:
     65                     self._tensor = _CALL_IN_PROGRESS_SENTINEL
--&gt; 66                     self._tensor = self._tensor_callable()
     67         return self._tensor
     68 
/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in lazy_tensor()
    192         @lazy_tensor_creator.LazyTensorCreator
    193         def lazy_tensor():
--&gt; 194             return _buckets(data, buckets)
    195 
    196         return tf.summary.write(
/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in _buckets(data, bucket_count)
    291             )
    292 
--&gt; 293         return tf.cond(is_empty, when_empty, when_nonempty)
/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in when_nonempty()
    288 
    289             return tf.cond(
--&gt; 290                 has_single_value, when_single_value, when_multiple_values
    291             )
    292 
/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in when_multiple_values()
    257                 # See https://github.com/tensorflow/tensorflow/issues/51419 for details.
    258                 one_hots = tf.one_hot(
--&gt; 259                     clamped_indices, depth=bucket_count, dtype=tf.float64
    260                 )
    261                 bucket_counts = tf.cast( 
ResourceExhaustedError: {{function_node __wrapped__OneHot_device_/job:localhost/replica:0/task:0/device:CPU:0}} OOM when allocating tensor with shape[102760448,30] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu [Op:OneHot]```
有趣的是，它似乎在调用tf.one_hot时会爆炸GPU内存，生成一个庞大的张量，而不管我是使用整数标签和稀疏分类交叉熵来训练模型，还是使用独热标签和交叉熵来训练模型。我不太理解这个张量包含什么，因为它的维度既不相关于我使用的训练样本数量，也不相关于类别数量。
有关如何解决这个问题的任何想法吗？
<details>
<summary>英文:</summary>
 I&#39;m trying to train a VGG16 model. I&#39;m using a sample dataset of 4000 300x300 images in 14 classes, and running my code on a Google VM using an Nvidia L4 GPU with 20gb of memory. I am running python 3.7, tf version 2.11, and cuda version 12.1. My data is stored in GCS.
When I run the model with the following TensorBoard callback: 
```tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)```
I get this error at the end of the first epoch:

2023-06-14 19:51:21.248476: W tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (mklcpu) ran out of memory trying to allocate 22.97GiB (rounded to 24662507520)requested by op OneHot
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.


`ResourceExhaustedError: {{function_node _`_`wrapped__OneHot_device`_`/job:localhost/replica:0/task:0/device:CPU:0}} OOM when allocating tensor with shape[102760448,30] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu [Op:OneHot]`
The error traces back to the tensorboard histogram object:

ResourceExhaustedError Traceback (most recent call last)
/var/tmp/ipykernel_5723/1753739100.py in <module>
1 # Fit model
----> 2 history = model.fit(train_ds, validation_data=val_ds, epochs=5, callbacks=[tensorboard_callback])

/opt/conda/lib/python3.7/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
68 # To get the full stack trace, call:
69 # tf.debugging.disable_traceback_filtering()
---> 70 raise e.with_traceback(filtered_tb) from None
71 finally:
72 del filtered_tb

/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in histogram(name, data, step, buckets, description)
198 tensor=lazy_tensor,
199 step=step,
--> 200 metadata=summary_metadata,
201 )
202

/opt/conda/lib/python3.7/site-packages/tensorboard/util/lazy_tensor_creator.py in call(self)
64 elif self._tensor is None:
65 self._tensor = _CALL_IN_PROGRESS_SENTINEL
---> 66 self._tensor = self._tensor_callable()
67 return self._tensor
68

/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in lazy_tensor()
192 @lazy_tensor_creator.LazyTensorCreator
193 def lazy_tensor():
--> 194 return _buckets(data, buckets)
195
196 return tf.summary.write(

/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in _buckets(data, bucket_count)
291 )
292
--> 293 return tf.cond(is_empty, when_empty, when_nonempty)

/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in when_nonempty()
288
289 return tf.cond(
--> 290 has_single_value, when_single_value, when_multiple_values
291 )
292

/opt/conda/lib/python3.7/site-packages/tensorboard/plugins/histogram/summary_v2.py in when_multiple_values()
257 # See https://github.com/tensorflow/tensorflow/issues/51419 for details.
258 one_hots = tf.one_hot(
--> 259 clamped_indices, depth=bucket_count, dtype=tf.float64
260 )
261 bucket_counts = tf.cast(

ResourceExhaustedError: {{function_node _wrapped__OneHot_device/job:localhost/replica:0/task:0/device:CPU:0}} OOM when allocating tensor with shape[102760448,30] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu [Op:OneHot]


Interestingly it seems to be calling tf.one_hot and blowing up the gpu memory with a massive tensor regardless of whether I train the model with integer labels and spare categorical cross entropy or if I train it with one hot labels and cross entropy. I don&#39;t really understand what the tensor contains because its dimensions neither relate to the number of training examples or classes that I am using.
Any ideas about how to fix this?
</details>
# 答案1
**得分**: 1
问题似乎与内存资源有关，而不是与Tensorflow有关的问题。如果使用独热编码，它会创建一个非常大的稀疏张量，可能需要更多的内存资源。由于您已将histogram_freq设置为1，它将为每个层的权重直方图创建额外的计算，这需要更高的内存资源。
您可以尝试将histogram_freq设置为0，并检查问题是否仍然存在，然后我们需要检查导致大张量计算的您的代码。如果没有问题，那么这显然是由于直方图计算导致的更高内存需求。
OOM错误取决于输入大小和内存资源。Tensorflow无法控制这一点，必须由用户来处理。
您可以尝试减小model.fit中的batch_size，它默认为32（您可以尝试batch_size=16）。还可以在导入Tensorflow之前使用以下代码块。如果OOM是由于内存碎片化导致的，这可能会有所帮助。
```python
import os
os.environ['TF_GPU_ALLOCATOR'] = 'cuda_malloc_async'

英文:

The issue seems related to Memory resources and not problem with Tensorflow. If using one hot encoding it creates a very large sparse tensor which may require higher memory resources. As you have set histogram_freq=1 it will create additional computations for weight histograms of each layer which needs higher memory resources.

You may try setting histogram_freq=0 and check if the problem still exists then we need to check your code which is causing the large tensor computations.If no problem then its clear case of Higher memory requirement due to Histogram computations .

OOM errors are depends upon the input sizes and the Memory resources. Tensorflow can't have control on this. It has to be taken care by the users.

May be you can reduce batch_size in model.fit which is by default 32(you may try batch_size=16) . Also use the below code before importing Tensorflow in your code block.This may help if the OOM is due to Memory fragmentation.

import os
os.environ[‘TF_GPU_ALLOCATOR’] = ‘cuda_malloc_async’

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Tensorboard直方图的onehot操作导致ResourceExhauseError：OOM

问题

如何在同一类中的Python中调用另一个函数？

Function App 部署成功，但在导入包（python、linux、消耗计划）时失败。

AWS CodeBuild 失败，因为 buildspec 无法安装 Python 版本。

如何在Python中的进程类的其他方法中使用run方法的变量

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。