Why does tensorflow.function (without jit_compile) speed up forward passes of a Keras model?

huangapple go评论90阅读模式
英文:

Why does tensorflow.function (without jit_compile) speed up forward passes of a Keras model?

问题

XLA 可以使用 model = tf.function(model, jit_compile=True) 启用。某些模型类型通过这种方式可以更快,而某些模型可能会更慢。到目前为止,一切都好。

但是为什么在某些情况下,model = tf.function(model, jit_compile=None) 可以显著加速(没有 TPU 的情况下)?

jit_compile 文档 指出:

如果为 None(默认值),在 TPU 上运行时会使用 XLA 编译函数,并在其他设备上运行时走常规函数执行路径。

我正在两台非 TPU(甚至非 GPU)的机器上运行我的测试(安装了最新版本的 TensorFlow(2.13.0))。

import timeit
import numpy as np
import tensorflow as tf

model_plain = tf.keras.applications.efficientnet_v2.EfficientNetV2S()
model_jit_compile_true = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=True)
model_jit_compile_false = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=False)
model_jit_compile_none = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=None)

def run(model):
    model(np.random.random(size=(1, 384, 384, 3)))

# 预热
run(model_plain)
run(model_jit_compile_true)
run(model_jit_compile_false)
run(model_jit_compile_none)

runs = 10
duration_plain = timeit.timeit(lambda: run(model_plain), number=runs) / runs
duration_jit_compile_true = timeit.timeit(lambda: run(model_jit_compile_true), number=runs) / runs
duration_jit_compile_false = timeit.timeit(lambda: run(model_jit_compile_false), number=runs) / runs
duration_jit_compile_none = timeit.timeit(lambda: run(model_jit_compile_none), number=runs) / runs

print(f"{duration_plain=}")
print(f"{duration_jit_compile_true=}")
print(f"{duration_jit_compile_false=}")
print(f"{duration_jit_compile_none=}")
duration_plain=0.53095479644835
duration_jit_compile_true=1.5860380740836262
duration_jit_compile_false=0.09831228516995907
duration_jit_compile_none=0.09407951850444078

(翻译完毕,没有其他内容。)

英文:

XLA can be enabled using model = tf.function(model, jit_compile=True). Some model types are faster that way, some are slower. So far, so good.

But why can model = tf.function(model, jit_compile=None) speed things up significantly (without TPU) in some cases?

The jit_compile docs state:

> If None (default), compiles the function with XLA when running on TPU
> and goes through the regular function execution path when running on
> other devices.

I'm running my tests on two non-TPU (and even non-GPU) machines (with the latest TensorFlow (2.13.0) installed).

import timeit

import numpy as np
import tensorflow as tf

model_plain = tf.keras.applications.efficientnet_v2.EfficientNetV2S()
model_jit_compile_true = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=True)
model_jit_compile_false = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=False)
model_jit_compile_none = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=None)


def run(model):
    model(np.random.random(size=(1, 384, 384, 3)))


# warmup
run(model_plain)
run(model_jit_compile_true)
run(model_jit_compile_false)
run(model_jit_compile_none)

runs = 10
duration_plain = timeit.timeit(lambda: run(model_plain), number=runs) / runs
duration_jit_compile_true = timeit.timeit(lambda: run(model_jit_compile_true), number=runs) / runs
duration_jit_compile_false = timeit.timeit(lambda: run(model_jit_compile_false), number=runs) / runs
duration_jit_compile_none = timeit.timeit(lambda: run(model_jit_compile_none), number=runs) / runs

print(f"{duration_plain=}")
print(f"{duration_jit_compile_true=}")
print(f"{duration_jit_compile_false=}")
print(f"{duration_jit_compile_none=}")
duration_plain=0.53095479644835
duration_jit_compile_true=1.5860380740836262
duration_jit_compile_false=0.09831228516995907
duration_jit_compile_none=0.09407951850444078

答案1

得分: 2

但为什么 model = tf.function(model, jit_compile=None) 在某些情况下可以显著提高速度(没有 TPU 的情况下)?

速度提升主要归功于图模式,由 tf.function 启用,比 model_plain 中使用的即时执行要快得多。

此外,我们还有XLA 编译的次要影响,但它们非常依赖于计算架构。例如,在 GPU 加速器下编译时,结果将会有很大不同。

最后但同样重要的是,基准方法应该校正以考虑变异,对于 10 次运行和所讨论的用例来说,变异确实非常大(否则,结果将会具有误导性甚至矛盾,例如,由于高变异,XLA=None 平均看起来可能更快)。供将来参考,让我们明确指出**来自 TensorFlow 文档的此性能分析模式 是不准确的**。

以下已更正和扩展的代码片段,在 Kaggle 笔记本上使用 GPU 执行,演示了大部分的性能提升主要来自图模式,而XLA 编译提供了一些额外的加速

import timeit

import numpy as np
import tensorflow as tf

model_plain = tf.keras.applications.efficientnet_v2.EfficientNetV2S()
model_tffunc = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=None)
model_jit_compile_true = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=True)
model_jit_compile_false = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=False)
model_jit_compile_none = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=None)

x = np.random.random(size=(1, 384, 384, 3))

def run(model):
    model(x)

# 预热
run(model_plain)
run(model_tffunc)
run(model_jit_compile_true)
run(model_jit_compile_false)
run(model_jit_compile_none)

# 基准测试
duration_plain = %timeit -o run(model_plain)
duration_tffunc = %timeit -o run(model_tffunc)
duration_jit_compile_true = %timeit -o run(model_jit_compile_true)
duration_jit_compile_false = %timeit -o run(model_jit_compile_false)
duration_jit_compile_none = %timeit -o run(model_jit_compile_none)

print(f"{str(duration_plain)=}")
print(f"{str(duration_tffunc)=}")
print(f"{str(duration_jit_compile_true)=}")
print(f"{str(duration_jit_compile_false)=}")
print(f"{str(duration_jit_compile_none)=}")

统计数据表明:duration_plain > duration_jit_compile_false = duration_jit_compile_none = duration_tffunc > duration_jit_compile_true,如下所示的输出:

369 ms ± 3.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
16.1 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
11.6 ms ± 882 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.9 ms ± 508 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
15.5 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
str(duration_plain)='369 ms ± 3.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)'
str(duration_tffunc)='16.1 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)'
str(duration_jit_compile_true)='11.6 ms ± 882 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)'
str(duration_jit_compile_false)='15.9 ms ± 508 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)'
str(duration_jit_compile_none)='15.5 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)'

有关完整示例,请查看此公共笔记本

注意:这种测量变异的方式是有用的,但不是完全准确的

英文:

> But why can model = tf.function(model, jit_compile=None) speed things up significantly (without TPU) in some cases?

The speedup is mainly due to the graph mode
enabled by tf.function, much faster than the eager execution used in model_plain.

On top of that, we have secondary effects of XLA compilation with jit_compile flag, but they depend very much on the computing architecture. For instance, the numbers would look much different when compiled under the GPU accelerator.

Last but not least, the benchmarking methodology should be corrected to take into account variation which is indeed huge for 10 runs and the use-case in question (otherwise, findings will be misleading or even contradictory, e.g. due to high variation XLA=None can look faster on average).
For future reference, let's make it clear that this profiling pattern from Tensorflow docs is inaccurate

# average runtime on 10 repetitions without variance is inaccurate
print("Eager conv:", timeit.timeit(lambda: conv_layer(image), number=10))

The following corrected and extended snippet, executed on Kaggle notebooks with GPU, demonstrates that improvements come mostly from the graph mode and that XLA compilation gives some further speedup.

import timeit

import numpy as np
import tensorflow as tf

model_plain = tf.keras.applications.efficientnet_v2.EfficientNetV2S()
model_tffunc = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=None)
model_jit_compile_true = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=True)
model_jit_compile_false = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=False)
model_jit_compile_none = tf.function(tf.keras.applications.efficientnet_v2.EfficientNetV2S(), jit_compile=None)

x = np.random.random(size=(1, 384, 384, 3))

def run(model):
    model(x)

# warmup
run(model_plain)
run(model_tffunc)
run(model_jit_compile_true)
run(model_jit_compile_false)
run(model_jit_compile_none)

# benchmarking
duration_plain = %timeit -o run(model_plain)
duration_tffunc = %timeit -o run(model_tffunc)
duration_jit_compile_true = %timeit -o run(model_jit_compile_true)
duration_jit_compile_false = %timeit -o run(model_jit_compile_false)
duration_jit_compile_none = %timeit -o run(model_jit_compile_none)

print(f"{str(duration_plain)=}")
print(f"{str(duration_tffunc)=}")
print(f"{str(duration_jit_compile_true)=}")
print(f"{str(duration_jit_compile_false)=}")
print(f"{str(duration_jit_compile_none)=}")

Statistically, we have: duration_plain > duration_jit_compile_false = duration_jit_compile_none = duration_tffunc > duration_jit_compile_true, as seen from the output:

369 ms ± 3.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
16.1 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
11.6 ms ± 882 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.9 ms ± 508 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
15.5 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
str(duration_plain)='369 ms ± 3.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)'
str(duration_tffunc)='16.1 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)'
str(duration_jit_compile_true)='11.6 ms ± 882 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)'
str(duration_jit_compile_false)='15.9 ms ± 508 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)'
str(duration_jit_compile_none)='15.5 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)'

For a complete example, see this public notebook.

NOTE: this way of measuring variation is useful but not fully accurate.

huangapple
  • 本文由 发表于 2023年8月9日 12:22:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76864569-2.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定