英文:
Nvidia 4060 TI 8GB slower than CPU in classification
问题
我正在使用这个教程:https://www.tensorflow.org/tutorials/images/classification
在测试中,CPU运行大约需要50秒,GPU大约需要7-8分钟。我猜我做错了什么。
我的CPU是一台拥有96GB内存的Intel i5第10代处理器。我希望GPU至少能运行快2倍。
我启用了混合精度(mixed precision),以确保使用了Tensor Cores。
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')
我错过了什么吗?使用分类算法时,RTX 4060 Ti 8GB VRAM为什么这么慢?
我有大约1000个类别,但这并不重要,因为CPU更快...
我使用的批次大小为512,VRAM占用6/8,CPU大部分时间占用50%左右。
我还使用了BatchNormalization。
model = Sequential([
data_augmentation,
layers.Rescaling(1./255),
layers.Conv2D(16, 3, padding='same', activation='relu'),
BatchNormalization(),
layers.MaxPooling2D(),
layers.Conv2D(32, 3, padding='same', activation='relu'),
BatchNormalization(),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, padding='same', activation='relu'),
BatchNormalization(),
layers.MaxPooling2D(),
layers.Dropout(0.2),
BatchNormalization(),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dense(num_classes, name="outputs")
])
附注:我对AI还不熟悉。
我尝试使用不同的批次大小。
我尝试禁用GPU,只在CPU上运行测试。
我检查了RAM、磁盘和CPU是否存在瓶颈(没有一个达到100%)。当我在CPU上运行时,使用率为100%,而GPU的使用率为1%或更低。
这是我进行的批次测试:
批次大小 时间
4 377秒
8 304秒
16 317秒
32 335秒
64 446秒
这是模型的结构:
Layer (type) Output Shape Param #
=================================================================
sequential_1 (Sequential) (None, 256, 256, 3) 0
rescaling_2 (Rescaling) (None, 256, 256, 3) 0
conv2d_3 (Conv2D) (None, 256, 256, 16) 448
batch_normalization (BatchN (None, 256, 256, 16) 64
ormalization)
max_pooling2d_3 (MaxPooling (None, 128, 128, 16) 0
2D)
conv2d_4 (Conv2D) (None, 128, 128, 32) 4640
batch_normalization_1 (Batc (None, 128, 128, 32) 128
hNormalization)
max_pooling2d_4 (MaxPooling (None, 64, 64, 32) 0
2D)
conv2d_5 (Conv2D) (None, 64, 64, 64) 18496
batch_normalization_2 (Batc (None, 64, 64, 64) 256
hNormalization)
max_pooling2d_5 (MaxPooling (None, 32, 32, 64) 0
2D)
dropout (Dropout) (None, 32, 32, 64) 0
batch_normalization_3 (Batc (None, 32, 32, 64) 256
hNormalization)
flatten_1 (Flatten) (None, 65536) 0
dense_2 (Dense) (None, 128) 8388736
outputs (Dense) (None, 863) 111327
=================================================================
Total params: 8,524,351
Trainable params: 8,523,999
Non-trainable params: 352
_________________________________________________________________
这是我加载数据的方式:
folder = "some-folder"
train_ds = tf.keras.utils.image_dataset_from_directory(
folder,
validation_split=0.2,
subset="training",
seed=1,
image_size=image_size,
batch_size=batch_size
)
val_ds = tf.keras.utils.image_dataset_from_directory(
folder,
validation_split=0.2,
subset="validation",
seed=1,
image_size=image_size,
batch_size=batch_size
)
这是自动调整部分:
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
normalization_layer = layers.Rescaling(1. / 255)
normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
niter = iter(normalized_ds)
image_batch, labels_batch = next(niter)
first_image = image_batch[0]
# 注意像素值现在在[0,1]之间。
print(np.min(first_image), np.max(first_image))
英文:
I'm using this tutorial:
https://www.tensorflow.org/tutorials/images/classification
In tests, the cpu takes about 50seconds to run and GPU about 7-8 minutes. I'm guessing I'm doing something wrong.
My cpu is a intel i5 10th generation with 96 ram. I would expect gpu to run at least 2x faster
I enabled mixed precision so that I make sure it is using tensorcores
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')
What am I missing.... is rtx 4060 ti 8gb vram so slow when using classification algorithm?
I have about 1000 classes,but this is not relevant because cpu is alot faster ...
I'm using batches of 512, vram is at 6/8 and cpu most of the time is about 50%
I'm als doing BatchNormalization
model = Sequential([
data_augmentation,
layers.Rescaling(1./255),
layers.Conv2D(16, 3, padding='same', activation='relu'),
BatchNormalization(),
layers.MaxPooling2D(),
layers.Conv2D(32, 3, padding='same', activation='relu'),
BatchNormalization(),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, padding='same', activation='relu'),
BatchNormalization(),
layers.MaxPooling2D(),
layers.Dropout(0.2),
BatchNormalization(),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dense(num_classes, name="outputs")
])
ps: i'm new to ai
I tried using different batch size
I tried disabling the gpu and running just on cpu to make the test
I check out the ram, disk and cpu for bottleneck(none are 100%). When I run on cpu the usage is 100% and gpu 1% or less
These are the batch test I've done
Batch Time
4 377s
8 304s
16 317s
32 335s
64 446s
And this is the model:
Layer (type) Output Shape Param #
=================================================================
sequential_1 (Sequential) (None, 256, 256, 3) 0
rescaling_2 (Rescaling) (None, 256, 256, 3) 0
conv2d_3 (Conv2D) (None, 256, 256, 16) 448
batch_normalization (BatchN (None, 256, 256, 16) 64
ormalization)
max_pooling2d_3 (MaxPooling (None, 128, 128, 16) 0
2D)
conv2d_4 (Conv2D) (None, 128, 128, 32) 4640
batch_normalization_1 (Batc (None, 128, 128, 32) 128
hNormalization)
max_pooling2d_4 (MaxPooling (None, 64, 64, 32) 0
2D)
conv2d_5 (Conv2D) (None, 64, 64, 64) 18496
batch_normalization_2 (Batc (None, 64, 64, 64) 256
hNormalization)
max_pooling2d_5 (MaxPooling (None, 32, 32, 64) 0
2D)
dropout (Dropout) (None, 32, 32, 64) 0
batch_normalization_3 (Batc (None, 32, 32, 64) 256
hNormalization)
flatten_1 (Flatten) (None, 65536) 0
dense_2 (Dense) (None, 128) 8388736
outputs (Dense) (None, 863) 111327
=================================================================
Total params: 8,524,351
Trainable params: 8,523,999
Non-trainable params: 352
_________________________________________________________________
And this is how I load the data:
folder = "some-folder"
train_ds = tf.keras.utils.image_dataset_from_directory(
folder,
validation_split=0.2,
subset="training",
seed=1,
image_size=image_size,
batch_size=batch_size)
val_ds = tf.keras.utils.image_dataset_from_directory(
folder,
validation_split=0.2,
subset="validation",
seed=1,
image_size=image_size,
batch_size=batch_size)
And this is the autotune part
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
normalization_layer = layers.Rescaling(1. / 255)
normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
niter = iter(normalized_ds)
image_batch, labels_batch = next(niter)
first_image = image_batch[0]
# Notice the pixel values are now in `[0,1]`.
print(np.min(first_image), np.max(first_image))
答案1
得分: 0
RTX 4060系列不兼容CUDA工具包。
英文:
RTX 4060 SERIES is not CUDA toolkit compatible.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论