2023年7月27日 14:55:18go评论158阅读模式

英文:

Nvidia 4060 TI 8GB slower than CPU in classification

问题

我正在使用这个教程：https://www.tensorflow.org/tutorials/images/classification

在测试中，CPU运行大约需要50秒，GPU大约需要7-8分钟。我猜我做错了什么。

我的CPU是一台拥有96GB内存的Intel i5第10代处理器。我希望GPU至少能运行快2倍。

我启用了混合精度（mixed precision），以确保使用了Tensor Cores。

from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')

我错过了什么吗？使用分类算法时，RTX 4060 Ti 8GB VRAM为什么这么慢？

我有大约1000个类别，但这并不重要，因为CPU更快...

我使用的批次大小为512，VRAM占用6/8，CPU大部分时间占用50%左右。

我还使用了BatchNormalization。

model = Sequential([
  data_augmentation,
  layers.Rescaling(1./255),
  layers.Conv2D(16, 3, padding='same', activation='relu'),
  BatchNormalization(),
  layers.MaxPooling2D(),
  layers.Conv2D(32, 3, padding='same', activation='relu'),
  BatchNormalization(),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 3, padding='same', activation='relu'),
  BatchNormalization(),
  layers.MaxPooling2D(),
  layers.Dropout(0.2),
  BatchNormalization(),
  layers.Flatten(),
  layers.Dense(128, activation='relu'),
  layers.Dense(num_classes, name="outputs")
])

附注：我对AI还不熟悉。

我尝试使用不同的批次大小。

我尝试禁用GPU，只在CPU上运行测试。

我检查了RAM、磁盘和CPU是否存在瓶颈（没有一个达到100%）。当我在CPU上运行时，使用率为100%，而GPU的使用率为1%或更低。

这是我进行的批次测试：

批次大小 时间
4 377秒
8 304秒
16 317秒
32 335秒
64 446秒

这是模型的结构：

 Layer (type)                Output Shape              Param #   
=================================================================
 sequential_1 (Sequential)   (None, 256, 256, 3)       0         
                                                                 
 rescaling_2 (Rescaling)     (None, 256, 256, 3)       0         
                                                                 
 conv2d_3 (Conv2D)           (None, 256, 256, 16)      448       
                                                                 
 batch_normalization (BatchN  (None, 256, 256, 16)     64        
 ormalization)                                                   
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 128, 128, 16)     0         
 2D)                                                             
                                                                 
 conv2d_4 (Conv2D)           (None, 128, 128, 32)      4640      
                                                                 
 batch_normalization_1 (Batc  (None, 128, 128, 32)     128       
 hNormalization)                                                 
                                                                 
 max_pooling2d_4 (MaxPooling  (None, 64, 64, 32)       0         
 2D)                                                             
                                                                 
 conv2d_5 (Conv2D)           (None, 64, 64, 64)        18496     
                                                                 
 batch_normalization_2 (Batc  (None, 64, 64, 64)       256       
 hNormalization)                                                 
                                                                 
 max_pooling2d_5 (MaxPooling  (None, 32, 32, 64)       0         
 2D)                                                             
                                                                 
 dropout (Dropout)           (None, 32, 32, 64)        0         
                                                                 
 batch_normalization_3 (Batc  (None, 32, 32, 64)       256       
 hNormalization)                                                 
                                                                 
 flatten_1 (Flatten)         (None, 65536)             0         
                                                                 
 dense_2 (Dense)             (None, 128)               8388736   
                                                                 
 outputs (Dense)             (None, 863)               111327    
                                                                 
=================================================================
Total params: 8,524,351
Trainable params: 8,523,999
Non-trainable params: 352
_________________________________________________________________

这是我加载数据的方式：

folder = "some-folder"
train_ds = tf.keras.utils.image_dataset_from_directory(
    folder,
    validation_split=0.2,
    subset="training",
    seed=1,
    image_size=image_size,
    batch_size=batch_size
)

val_ds = tf.keras.utils.image_dataset_from_directory(
    folder,
    validation_split=0.2,
    subset="validation",
    seed=1,
    image_size=image_size,
    batch_size=batch_size
)

这是自动调整部分：

AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
normalization_layer = layers.Rescaling(1. / 255)

normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
niter = iter(normalized_ds)
image_batch, labels_batch = next(niter)
first_image = image_batch[0]
# 注意像素值现在在[0,1]之间。
print(np.min(first_image), np.max(first_image))

英文:

I'm using this tutorial:
https://www.tensorflow.org/tutorials/images/classification

In tests, the cpu takes about 50seconds to run and GPU about 7-8 minutes. I'm guessing I'm doing something wrong.

My cpu is a intel i5 10th generation with 96 ram. I would expect gpu to run at least 2x faster

I enabled mixed precision so that I make sure it is using tensorcores

from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')

What am I missing.... is rtx 4060 ti 8gb vram so slow when using classification algorithm?

I have about 1000 classes,but this is not relevant because cpu is alot faster ...

I'm using batches of 512, vram is at 6/8 and cpu most of the time is about 50%

I'm als doing BatchNormalization

model = Sequential([
  data_augmentation,
  layers.Rescaling(1./255),
  layers.Conv2D(16, 3, padding=&#39;same&#39;, activation=&#39;relu&#39;),
  BatchNormalization(),
  layers.MaxPooling2D(),
  layers.Conv2D(32, 3, padding=&#39;same&#39;, activation=&#39;relu&#39;),
  BatchNormalization(),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 3, padding=&#39;same&#39;, activation=&#39;relu&#39;),
  BatchNormalization(),
  layers.MaxPooling2D(),
  layers.Dropout(0.2),
  BatchNormalization(),
  layers.Flatten(),
  layers.Dense(128, activation=&#39;relu&#39;),
  layers.Dense(num_classes, name=&quot;outputs&quot;)
])

ps: i'm new to ai

I tried using different batch size

I tried disabling the gpu and running just on cpu to make the test

I check out the ram, disk and cpu for bottleneck(none are 100%). When I run on cpu the usage is 100% and gpu 1% or less

These are the batch test I've done

Batch Time 
4 377s 
8  304s
16  317s 
32 335s
64 446s

And this is the model:

 Layer (type)                Output Shape              Param #   
=================================================================
 sequential_1 (Sequential)   (None, 256, 256, 3)       0         
                                                                 
 rescaling_2 (Rescaling)     (None, 256, 256, 3)       0         
                                                                 
 conv2d_3 (Conv2D)           (None, 256, 256, 16)      448       
                                                                 
 batch_normalization (BatchN  (None, 256, 256, 16)     64        
 ormalization)                                                   
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 128, 128, 16)     0         
 2D)                                                             
                                                                 
 conv2d_4 (Conv2D)           (None, 128, 128, 32)      4640      
                                                                 
 batch_normalization_1 (Batc  (None, 128, 128, 32)     128       
 hNormalization)                                                 
                                                                 
 max_pooling2d_4 (MaxPooling  (None, 64, 64, 32)       0         
 2D)                                                             
                                                                 
 conv2d_5 (Conv2D)           (None, 64, 64, 64)        18496     
                                                                 
 batch_normalization_2 (Batc  (None, 64, 64, 64)       256       
 hNormalization)                                                 
                                                                 
 max_pooling2d_5 (MaxPooling  (None, 32, 32, 64)       0         
 2D)                                                             
                                                                 
 dropout (Dropout)           (None, 32, 32, 64)        0         
                                                                 
 batch_normalization_3 (Batc  (None, 32, 32, 64)       256       
 hNormalization)                                                 
                                                                 
 flatten_1 (Flatten)         (None, 65536)             0         
                                                                 
 dense_2 (Dense)             (None, 128)               8388736   
                                                                 
 outputs (Dense)             (None, 863)               111327    
                                                                 
=================================================================
Total params: 8,524,351
Trainable params: 8,523,999
Non-trainable params: 352
_________________________________________________________________

And this is how I load the data:

folder = &quot;some-folder&quot;
train_ds = tf.keras.utils.image_dataset_from_directory(

    folder,
 validation_split=0.2,
  subset=&quot;training&quot;,
  seed=1,
  image_size=image_size,
  batch_size=batch_size)

val_ds = tf.keras.utils.image_dataset_from_directory(
    folder,
    validation_split=0.2,
  subset=&quot;validation&quot;,
  seed=1,
  image_size=image_size,
  batch_size=batch_size)

And this is the autotune part

AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
normalization_layer = layers.Rescaling(1. / 255)

normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
niter = iter(normalized_ds)
image_batch, labels_batch = next(niter)
first_image = image_batch[0]
# Notice the pixel values are now in `[0,1]`.
print(np.min(first_image), np.max(first_image))

答案1

得分: 0

RTX 4060系列不兼容CUDA工具包。

英文:

RTX 4060 SERIES is not CUDA toolkit compatible.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Nvidia 4060 TI 8GB在分类任务中比CPU慢。

问题

答案1

TensorFlow多类和多标签分类与排名的正确损失函数

选择具有精确数值的2D张量的索引。

如何将tensorflow.keras模型移到GPU

使用TensorFlow-Datasets tfds从PyTorch文件夹结构中加载ImageNet。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论