2023年8月9日 12:22:30go评论120阅读模式

英文:

Why does tensorflow.function (without jit_compile) speed up forward passes of a Keras model?

问题

XLA 可以使用 model = tf.function(model, jit_compile=True) 启用。某些模型类型通过这种方式可以更快，而某些模型可能会更慢。到目前为止，一切都好。

但是为什么在某些情况下，model = tf.function(model, jit_compile=None) 可以显著加速（没有 TPU 的情况下）？

jit_compile 文档指出：

如果为 None（默认值），在 TPU 上运行时会使用 XLA 编译函数，并在其他设备上运行时走常规函数执行路径。

我正在两台非 TPU（甚至非 GPU）的机器上运行我的测试（安装了最新版本的 TensorFlow（2.13.0））。

import timeit
import numpy as np
import tensorflow as tf
model_plain = tf.keras.applications.efficientnet_v2.EfficientNetV2S()
model_jit_compile_true = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=True)
model_jit_compile_false = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=False)
model_jit_compile_none = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=None)
def run(model):
    model(np.random.random(size=(1, 384, 384, 3)))
# 预热
run(model_plain)
run(model_jit_compile_true)
run(model_jit_compile_false)
run(model_jit_compile_none)
runs = 10
duration_plain = timeit.timeit(lambda: run(model_plain), number=runs) / runs
duration_jit_compile_true = timeit.timeit(lambda: run(model_jit_compile_true), number=runs) / runs
duration_jit_compile_false = timeit.timeit(lambda: run(model_jit_compile_false), number=runs) / runs
duration_jit_compile_none = timeit.timeit(lambda: run(model_jit_compile_none), number=runs) / runs
print(f"{duration_plain=}")
print(f"{duration_jit_compile_true=}")
print(f"{duration_jit_compile_false=}")
print(f"{duration_jit_compile_none=}")

duration_plain=0.53095479644835
duration_jit_compile_true=1.5860380740836262
duration_jit_compile_false=0.09831228516995907
duration_jit_compile_none=0.09407951850444078

（翻译完毕，没有其他内容。）

英文:

XLA can be enabled using model = tf.function(model, jit_compile=True). Some model types are faster that way, some are slower. So far, so good.

But why can model = tf.function(model, jit_compile=None) speed things up significantly (without TPU) in some cases?

The jit_compile docs state:

> If None (default), compiles the function with XLA when running on TPU
> and goes through the regular function execution path when running on
> other devices.

I'm running my tests on two non-TPU (and even non-GPU) machines (with the latest TensorFlow (2.13.0) installed).

import timeit
import numpy as np
import tensorflow as tf
model_plain = tf.keras.applications.efficientnet_v2.EfficientNetV2S()
model_jit_compile_true = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=True)
model_jit_compile_false = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=False)
model_jit_compile_none = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=None)
def run(model):
    model(np.random.random(size=(1, 384, 384, 3)))
# warmup
run(model_plain)
run(model_jit_compile_true)
run(model_jit_compile_false)
run(model_jit_compile_none)
runs = 10
duration_plain = timeit.timeit(lambda: run(model_plain), number=runs) / runs
duration_jit_compile_true = timeit.timeit(lambda: run(model_jit_compile_true), number=runs) / runs
duration_jit_compile_false = timeit.timeit(lambda: run(model_jit_compile_false), number=runs) / runs
duration_jit_compile_none = timeit.timeit(lambda: run(model_jit_compile_none), number=runs) / runs
print(f&quot;{duration_plain=}&quot;)
print(f&quot;{duration_jit_compile_true=}&quot;)
print(f&quot;{duration_jit_compile_false=}&quot;)
print(f&quot;{duration_jit_compile_none=}&quot;)

duration_plain=0.53095479644835
duration_jit_compile_true=1.5860380740836262
duration_jit_compile_false=0.09831228516995907
duration_jit_compile_none=0.09407951850444078

答案1

得分: 2

但为什么 model = tf.function(model, jit_compile=None) 在某些情况下可以显著提高速度（没有 TPU 的情况下）？

速度提升主要归功于图模式，由 tf.function 启用，比 model_plain 中使用的即时执行要快得多。

此外，我们还有XLA 编译的次要影响，但它们非常依赖于计算架构。例如，在 GPU 加速器下编译时，结果将会有很大不同。

最后但同样重要的是，基准方法应该校正以考虑变异，对于 10 次运行和所讨论的用例来说，变异确实非常大（否则，结果将会具有误导性甚至矛盾，例如，由于高变异，XLA=None 平均看起来可能更快）。供将来参考，让我们明确指出**来自 TensorFlow 文档的此性能分析模式是不准确的**。

以下已更正和扩展的代码片段，在 Kaggle 笔记本上使用 GPU 执行，演示了大部分的性能提升主要来自图模式，而XLA 编译提供了一些额外的加速。

import timeit
import numpy as np
import tensorflow as tf
model_plain = tf.keras.applications.efficientnet_v2.EfficientNetV2S()
model_tffunc = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=None)
model_jit_compile_true = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=True)
model_jit_compile_false = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=False)
model_jit_compile_none = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=None)
x = np.random.random(size=(1, 384, 384, 3))
def run(model):
    model(x)
# 预热
run(model_plain)
run(model_tffunc)
run(model_jit_compile_true)
run(model_jit_compile_false)
run(model_jit_compile_none)
# 基准测试
duration_plain = %timeit -o run(model_plain)
duration_tffunc = %timeit -o run(model_tffunc)
duration_jit_compile_true = %timeit -o run(model_jit_compile_true)
duration_jit_compile_false = %timeit -o run(model_jit_compile_false)
duration_jit_compile_none = %timeit -o run(model_jit_compile_none)
print(f"{str(duration_plain)=}")
print(f"{str(duration_tffunc)=}")
print(f"{str(duration_jit_compile_true)=}")
print(f"{str(duration_jit_compile_false)=}")
print(f"{str(duration_jit_compile_none)=}")

统计数据表明：duration_plain > duration_jit_compile_false = duration_jit_compile_none = duration_tffunc > duration_jit_compile_true，如下所示的输出：

369 ms ± 3.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
16.1 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
11.6 ms ± 882 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.9 ms ± 508 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
15.5 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
str(duration_plain)='369 ms ± 3.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)'
str(duration_tffunc)='16.1 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)'
str(duration_jit_compile_true)='11.6 ms ± 882 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)'
str(duration_jit_compile_false)='15.9 ms ± 508 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)'
str(duration_jit_compile_none)='15.5 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)'

有关完整示例，请查看此公共笔记本。

注意：这种测量变异的方式是有用的，但不是完全准确的。

英文:

> But why can model = tf.function(model, jit_compile=None) speed things up significantly (without TPU) in some cases?

The speedup is mainly due to the graph mode
enabled by tf.function, much faster than the eager execution used in model_plain.

On top of that, we have secondary effects of XLA compilation with jit_compile flag, but they depend very much on the computing architecture. For instance, the numbers would look much different when compiled under the GPU accelerator.

Last but not least, the benchmarking methodology should be corrected to take into account variation which is indeed huge for 10 runs and the use-case in question (otherwise, findings will be misleading or even contradictory, e.g. due to high variation XLA=None can look faster on average).
For future reference, let's make it clear that this profiling pattern from Tensorflow docs is inaccurate

# average runtime on 10 repetitions without variance is inaccurate
print(&quot;Eager conv:&quot;, timeit.timeit(lambda: conv_layer(image), number=10))

The following corrected and extended snippet, executed on Kaggle notebooks with GPU, demonstrates that improvements come mostly from the graph mode and that XLA compilation gives some further speedup.

import timeit
import numpy as np
import tensorflow as tf
model_plain = tf.keras.applications.efficientnet_v2.EfficientNetV2S()
model_tffunc = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=None)
model_jit_compile_true = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=True)
model_jit_compile_false = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=False)
model_jit_compile_none = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=None)
x = np.random.random(size=(1, 384, 384, 3))
def run(model):
    model(x)
# warmup
run(model_plain)
run(model_tffunc)
run(model_jit_compile_true)
run(model_jit_compile_false)
run(model_jit_compile_none)
# benchmarking
duration_plain = %timeit -o run(model_plain)
duration_tffunc = %timeit -o run(model_tffunc)
duration_jit_compile_true = %timeit -o run(model_jit_compile_true)
duration_jit_compile_false = %timeit -o run(model_jit_compile_false)
duration_jit_compile_none = %timeit -o run(model_jit_compile_none)
print(f&quot;{str(duration_plain)=}&quot;)
print(f&quot;{str(duration_tffunc)=}&quot;)
print(f&quot;{str(duration_jit_compile_true)=}&quot;)
print(f&quot;{str(duration_jit_compile_false)=}&quot;)
print(f&quot;{str(duration_jit_compile_none)=}&quot;)

Statistically, we have: duration_plain > duration_jit_compile_false = duration_jit_compile_none = duration_tffunc > duration_jit_compile_true, as seen from the output:

369 ms &#177; 3.62 ms per loop (mean &#177; std. dev. of 7 runs, 1 loop each)
16.1 ms &#177; 2.13 ms per loop (mean &#177; std. dev. of 7 runs, 10 loops each)
11.6 ms &#177; 882 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)
15.9 ms &#177; 508 &#181;s per loop (mean &#177; std. dev. of 7 runs, 10 loops each)
15.5 ms &#177; 450 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)
str(duration_plain)=&#39;369 ms &#177; 3.62 ms per loop (mean &#177; std. dev. of 7 runs, 1 loop each)&#39;
str(duration_tffunc)=&#39;16.1 ms &#177; 2.13 ms per loop (mean &#177; std. dev. of 7 runs, 10 loops each)&#39;
str(duration_jit_compile_true)=&#39;11.6 ms &#177; 882 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)&#39;
str(duration_jit_compile_false)=&#39;15.9 ms &#177; 508 &#181;s per loop (mean &#177; std. dev. of 7 runs, 10 loops each)&#39;
str(duration_jit_compile_none)=&#39;15.5 ms &#177; 450 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)&#39;

For a complete example, see this public notebook.

NOTE: this way of measuring variation is useful but not fully accurate.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Why does tensorflow.function (without jit_compile) speed up forward passes of a Keras model?

问题

答案1

AttributeError: ‘Adam’对象没有’get_updates’属性

删除重复的HTML元素。

如何按列表元素分组

Python Flask: 如何通过不是路由的函数构建模板。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。