英文:
Tensorflow - Text Classification - Shapes (None,) and (None, 250, 100) are incompatible error
问题
我想对具有多个标签的文本进行分类。我使用TextVectorization层和CategoricalCrossEntropy函数。以下是我的模型代码:
Text Vectorizer:
def custom_standardization(input_data):
print(input_data[:5])
lowercase = tf.strings.lower(input_data)
stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
return tf.strings.regex_replace(stripped_html,
'[%s]' % re.escape(string.punctuation),
'')
max_features = 10000
sequence_length = 250
vectorize_layer = layers.TextVectorization(
standardize=custom_standardization,
max_tokens=max_features,
output_mode='int',
output_sequence_length=sequence_length)
模型生成:
MAX_TOKENS_NUM = 5000 # 最大词汇量大小。
MAX_SEQUENCE_LEN = 40 # 用于填充输出的序列长度。
EMBEDDING_DIMS = 100
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(tf.keras.layers.Embedding(MAX_TOKENS_NUM + 1, EMBEDDING_DIMS))
model.summary()
model.compile(loss=losses.CategoricalCrossentropy(from_logits=True),
optimizer='adam',
metrics=tf.metrics.CategoricalAccuracy())
FIT:
epochs = 10
history = model.fit(
x_train,
y=y_train,
epochs=epochs)
x_train
是一个文本列表,类似于['This is a text about science.', 'This is a text about art', ...]
y_train
也是一个文本列表,类似于['Science','Art', ...]
当我尝试运行拟合代码时,它会产生以下错误:
ValueError: Shapes (None,) and (None, 250, 100) are incompatible
我做错了什么?此外,我想了解这是否是用于多标签分类的好方法/模型?
编辑:
根据Frightera的回答,我编辑了我的代码。以下是我的模型:
MAX_TOKENS_NUM = 5000 # 最大词汇量大小。
MAX_SEQUENCE_LEN = 40 # 用于填充输出的序列长度。
EMBEDDING_DIMS = 100
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(tf.keras.layers.Embedding(MAX_TOKENS_NUM + 1, EMBEDDING_DIMS))
model.add(layers.Dropout(0.2))
model.add(layers.GlobalAveragePooling1D())
model.add(layers.Dropout(0.2))
model.add(layers.Dense(len(labels)))
model.summary()
model.compile(loss=losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer='adam',
metrics=tf.metrics.SparseCategoricalAccuracy())
我通过使用y_train_int = [get_label_index(label) for label in y_train]
将类别转换为索引,而不是使用y_train
,传递了y_train_int
。
epochs = 10
history = model.fit(
x_train,
y=y_train_int,
epochs=epochs)
现在模型适应了,但是当我使用plt.plot(history.history['loss'])
检查损失函数时,它是一条全为零的线,如下所示:
这个模型是否适合分类?我是否需要在输入层和最终Dense层之间添加这些层(例如Embedding)?我做错了什么?
编辑2:
我现在有上面的模型。我使用SparseCategoricalEntropy并将长度为78的标签传递给最后的Dense层,现在模型适应了。
现在,当我使用model.predict(x_test)
时,它给出以下结果:
array([[ 1.3232083 , 3.4263668 , 0.3206688 , ..., -1.9279423 ,
-0.83103067, -5.3442082 ],
[ 0.11507592, -2.0753977 , -0.07149621, ..., -0.27729607,
-1.132122 , -2.4074485 ],
[ 0.87828857, -0.5063573 , 1.5770453 , ..., 0.72519284,
0.50958884, 3.7006462 ],
...,
[ 0.35316354, -3.1919005 , -0.25520897, ..., -1.648859 ,
-2.2707412 , -4.321298 ],
[ 0.89357865, 1.3001428 , 0.17324057, ..., -0.8185719 ,
-1.4108973 , -3.674326 ],
[ 1.6258209 , -0.59622926, 0.7382731 , ..., -0.8473997 ,
-0.90670204, -4.043623 ]], dtype=float32)
如何将这些转换为标签?
英文:
I want to classify text with multiple labels. I use TextVectorization layer and CategoricalCrossEntropy function. Here is my model code:
Text Vectorizer:
def custom_standardization(input_data):
print(input_data[:5])
lowercase = tf.strings.lower(input_data)
stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
return tf.strings.regex_replace(stripped_html,
'[%s]' % re.escape(string.punctuation),
'')
max_features = 10000
sequence_length = 250
vectorize_layer = layers.TextVectorization(
standardize=custom_standardization,
max_tokens=max_features,
output_mode='int',
output_sequence_length=sequence_length)
Model generation:
MAX_TOKENS_NUM = 5000 # Maximum vocab size.
MAX_SEQUENCE_LEN = 40 # Sequence length to pad the outputs to.
EMBEDDING_DIMS = 100
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(tf.keras.layers.Embedding(MAX_TOKENS_NUM + 1, EMBEDDING_DIMS))
model.summary()
model.compile(loss=losses.CategoricalCrossentropy(from_logits=True),
optimizer='adam',
metrics=tf.metrics.CategoricalAccuracy())
FIT :
epochs = 10
history = model.fit(
x_train,
y=y_train,
epochs=epochs)
x_train
is a list of texts like ['This is a text about science.', 'This is a text about art',...]
y_train
also is a list of texts like ['Science','Art',...]
When I try to run fitting code it gives the following error:
ValueError: Shapes (None,) and (None, 250, 100) are incompatible
What am i doing wrong? And also I'd like to learn if it's a good approach/model for classifying test with multiple labels?
EDIT:
I edited my code according to Frightera's answer. Here is my model:
MAX_TOKENS_NUM = 5000 # Maximum vocab size.
MAX_SEQUENCE_LEN = 40 # Sequence length to pad the outputs to.
EMBEDDING_DIMS = 100
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(tf.keras.layers.Embedding(MAX_TOKENS_NUM + 1, EMBEDDING_DIMS))
model.add(layers.Dropout(0.2))
model.add(layers.GlobalAveragePooling1D())
model.add(layers.Dropout(0.2))
model.add(layers.Dense(len(labels)))
model.summary()
model.compile(loss=losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer='adam',
metrics=tf.metrics.SparseCategoricalAccuracy())
And I pass y_train_int
instead of y_train
by converting categories to indexes with y_train_int = [get_label_index(label) for label in y_train]
epochs = 10
history = model.fit(
x_train,
y=y_train_int,
epochs=epochs)
Now the model fits, but when I check loss function with plt.plot(history.history['loss'])
it's an all zero line like below:
Is this model good for classification. Do I need those layers between input layer and final Dense Layer(Embedding etc.)? What am I doing wrong?
EDIT 2:
I have the above model now. I am using SparseCategoricalEntropy and passing to the last Dense layer length of labels which is 78 and now it fits the model.
Now when I use model.predict(x_test)
, it gives following results:
array([[ 1.3232083 , 3.4263668 , 0.3206688 , ..., -1.9279423 ,
-0.83103067, -5.3442082 ],
[ 0.11507592, -2.0753977 , -0.07149621, ..., -0.27729607,
-1.132122 , -2.4074485 ],
[ 0.87828857, -0.5063573 , 1.5770453 , ..., 0.72519284,
0.50958884, 3.7006462 ],
...,
[ 0.35316354, -3.1919005 , -0.25520897, ..., -1.648859 ,
-2.2707412 , -4.321298 ],
[ 0.89357865, 1.3001428 , 0.17324057, ..., -0.8185719 ,
-1.4108973 , -3.674326 ],
[ 1.6258209 , -0.59622926, 0.7382731 , ..., -0.8473997 ,
-0.90670204, -4.043623 ]], dtype=float32)
How can I convert these to labels?
答案1
得分: 0
我根据评论如下解决了文本分类问题:
- 在模型末尾使用具有唯一标签数量的Dense层。
- 将字符串类别标签转换为索引,并在模型中使用SparseCategoricalCrossEntropy和SparseCategoricalAccuracy。
- 在将结果转换为字符串标签时,获取最大值的输出并在标签列表中获取其索引。
英文:
I resolved this according to the comments as follows for text classification:
- Use Dense layer with number of unique labels in the end of the model.
- Convert string category labels to indexes and use SparseCategoricalCrossEntropy and SparseCategoricalAccuracy in the model.
- When converting results to string labels, get the max valued output and get index of it in the labels list.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论