Nvidia 4060 TI 8GB在分类任务中比CPU慢。

huangapple go评论105阅读模式
英文:

Nvidia 4060 TI 8GB slower than CPU in classification

问题

我正在使用这个教程:https://www.tensorflow.org/tutorials/images/classification

在测试中,CPU运行大约需要50秒,GPU大约需要7-8分钟。我猜我做错了什么。

我的CPU是一台拥有96GB内存的Intel i5第10代处理器。我希望GPU至少能运行快2倍。

我启用了混合精度(mixed precision),以确保使用了Tensor Cores。

  1. from tensorflow.keras import mixed_precision
  2. mixed_precision.set_global_policy('mixed_float16')

我错过了什么吗?使用分类算法时,RTX 4060 Ti 8GB VRAM为什么这么慢?

我有大约1000个类别,但这并不重要,因为CPU更快...

我使用的批次大小为512,VRAM占用6/8,CPU大部分时间占用50%左右。

我还使用了BatchNormalization。

  1. model = Sequential([
  2. data_augmentation,
  3. layers.Rescaling(1./255),
  4. layers.Conv2D(16, 3, padding='same', activation='relu'),
  5. BatchNormalization(),
  6. layers.MaxPooling2D(),
  7. layers.Conv2D(32, 3, padding='same', activation='relu'),
  8. BatchNormalization(),
  9. layers.MaxPooling2D(),
  10. layers.Conv2D(64, 3, padding='same', activation='relu'),
  11. BatchNormalization(),
  12. layers.MaxPooling2D(),
  13. layers.Dropout(0.2),
  14. BatchNormalization(),
  15. layers.Flatten(),
  16. layers.Dense(128, activation='relu'),
  17. layers.Dense(num_classes, name="outputs")
  18. ])

附注:我对AI还不熟悉。

我尝试使用不同的批次大小。

我尝试禁用GPU,只在CPU上运行测试。

我检查了RAM、磁盘和CPU是否存在瓶颈(没有一个达到100%)。当我在CPU上运行时,使用率为100%,而GPU的使用率为1%或更低。

这是我进行的批次测试:

  1. 批次大小 时间
  2. 4 377
  3. 8 304
  4. 16 317
  5. 32 335
  6. 64 446

这是模型的结构:

  1. Layer (type) Output Shape Param #
  2. =================================================================
  3. sequential_1 (Sequential) (None, 256, 256, 3) 0
  4. rescaling_2 (Rescaling) (None, 256, 256, 3) 0
  5. conv2d_3 (Conv2D) (None, 256, 256, 16) 448
  6. batch_normalization (BatchN (None, 256, 256, 16) 64
  7. ormalization)
  8. max_pooling2d_3 (MaxPooling (None, 128, 128, 16) 0
  9. 2D)
  10. conv2d_4 (Conv2D) (None, 128, 128, 32) 4640
  11. batch_normalization_1 (Batc (None, 128, 128, 32) 128
  12. hNormalization)
  13. max_pooling2d_4 (MaxPooling (None, 64, 64, 32) 0
  14. 2D)
  15. conv2d_5 (Conv2D) (None, 64, 64, 64) 18496
  16. batch_normalization_2 (Batc (None, 64, 64, 64) 256
  17. hNormalization)
  18. max_pooling2d_5 (MaxPooling (None, 32, 32, 64) 0
  19. 2D)
  20. dropout (Dropout) (None, 32, 32, 64) 0
  21. batch_normalization_3 (Batc (None, 32, 32, 64) 256
  22. hNormalization)
  23. flatten_1 (Flatten) (None, 65536) 0
  24. dense_2 (Dense) (None, 128) 8388736
  25. outputs (Dense) (None, 863) 111327
  26. =================================================================
  27. Total params: 8,524,351
  28. Trainable params: 8,523,999
  29. Non-trainable params: 352
  30. _________________________________________________________________

这是我加载数据的方式:

  1. folder = "some-folder"
  2. train_ds = tf.keras.utils.image_dataset_from_directory(
  3. folder,
  4. validation_split=0.2,
  5. subset="training",
  6. seed=1,
  7. image_size=image_size,
  8. batch_size=batch_size
  9. )
  10. val_ds = tf.keras.utils.image_dataset_from_directory(
  11. folder,
  12. validation_split=0.2,
  13. subset="validation",
  14. seed=1,
  15. image_size=image_size,
  16. batch_size=batch_size
  17. )

这是自动调整部分:

  1. AUTOTUNE = tf.data.AUTOTUNE
  2. train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
  3. val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
  4. normalization_layer = layers.Rescaling(1. / 255)
  5. normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
  6. niter = iter(normalized_ds)
  7. image_batch, labels_batch = next(niter)
  8. first_image = image_batch[0]
  9. # 注意像素值现在在[0,1]之间。
  10. print(np.min(first_image), np.max(first_image))
英文:

I'm using this tutorial:
https://www.tensorflow.org/tutorials/images/classification

In tests, the cpu takes about 50seconds to run and GPU about 7-8 minutes. I'm guessing I'm doing something wrong.

My cpu is a intel i5 10th generation with 96 ram. I would expect gpu to run at least 2x faster

I enabled mixed precision so that I make sure it is using tensorcores

from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')

What am I missing.... is rtx 4060 ti 8gb vram so slow when using classification algorithm?

I have about 1000 classes,but this is not relevant because cpu is alot faster ...

I'm using batches of 512, vram is at 6/8 and cpu most of the time is about 50%

I'm als doing BatchNormalization

  1. model = Sequential([
  2. data_augmentation,
  3. layers.Rescaling(1./255),
  4. layers.Conv2D(16, 3, padding='same', activation='relu'),
  5. BatchNormalization(),
  6. layers.MaxPooling2D(),
  7. layers.Conv2D(32, 3, padding='same', activation='relu'),
  8. BatchNormalization(),
  9. layers.MaxPooling2D(),
  10. layers.Conv2D(64, 3, padding='same', activation='relu'),
  11. BatchNormalization(),
  12. layers.MaxPooling2D(),
  13. layers.Dropout(0.2),
  14. BatchNormalization(),
  15. layers.Flatten(),
  16. layers.Dense(128, activation='relu'),
  17. layers.Dense(num_classes, name="outputs")
  18. ])

ps: i'm new to ai

I tried using different batch size

I tried disabling the gpu and running just on cpu to make the test

I check out the ram, disk and cpu for bottleneck(none are 100%). When I run on cpu the usage is 100% and gpu 1% or less

These are the batch test I've done

  1. Batch Time
  2. 4 377s
  3. 8 304s
  4. 16 317s
  5. 32 335s
  6. 64 446s
  7. And this is the model:
  8. Layer (type) Output Shape Param #
  9. =================================================================
  10. sequential_1 (Sequential) (None, 256, 256, 3) 0
  11. rescaling_2 (Rescaling) (None, 256, 256, 3) 0
  12. conv2d_3 (Conv2D) (None, 256, 256, 16) 448
  13. batch_normalization (BatchN (None, 256, 256, 16) 64
  14. ormalization)
  15. max_pooling2d_3 (MaxPooling (None, 128, 128, 16) 0
  16. 2D)
  17. conv2d_4 (Conv2D) (None, 128, 128, 32) 4640
  18. batch_normalization_1 (Batc (None, 128, 128, 32) 128
  19. hNormalization)
  20. max_pooling2d_4 (MaxPooling (None, 64, 64, 32) 0
  21. 2D)
  22. conv2d_5 (Conv2D) (None, 64, 64, 64) 18496
  23. batch_normalization_2 (Batc (None, 64, 64, 64) 256
  24. hNormalization)
  25. max_pooling2d_5 (MaxPooling (None, 32, 32, 64) 0
  26. 2D)
  27. dropout (Dropout) (None, 32, 32, 64) 0
  28. batch_normalization_3 (Batc (None, 32, 32, 64) 256
  29. hNormalization)
  30. flatten_1 (Flatten) (None, 65536) 0
  31. dense_2 (Dense) (None, 128) 8388736
  32. outputs (Dense) (None, 863) 111327
  33. =================================================================
  34. Total params: 8,524,351
  35. Trainable params: 8,523,999
  36. Non-trainable params: 352
  37. _________________________________________________________________

And this is how I load the data:

  1. folder = "some-folder"
  2. train_ds = tf.keras.utils.image_dataset_from_directory(
  3. folder,
  4. validation_split=0.2,
  5. subset="training",
  6. seed=1,
  7. image_size=image_size,
  8. batch_size=batch_size)
  9. val_ds = tf.keras.utils.image_dataset_from_directory(
  10. folder,
  11. validation_split=0.2,
  12. subset="validation",
  13. seed=1,
  14. image_size=image_size,
  15. batch_size=batch_size)

And this is the autotune part

  1. AUTOTUNE = tf.data.AUTOTUNE
  2. train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
  3. val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
  4. normalization_layer = layers.Rescaling(1. / 255)
  5. normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
  6. niter = iter(normalized_ds)
  7. image_batch, labels_batch = next(niter)
  8. first_image = image_batch[0]
  9. # Notice the pixel values are now in `[0,1]`.
  10. print(np.min(first_image), np.max(first_image))

答案1

得分: 0

RTX 4060系列不兼容CUDA工具包。

英文:

RTX 4060 SERIES is not CUDA toolkit compatible.

huangapple
  • 本文由 发表于 2023年7月27日 14:55:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76777171.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定