Tensorflow – 文本分类 – Shapes (None,) 和 (None, 250, 100) 不兼容错误

huangapple go评论78阅读模式
英文:

Tensorflow - Text Classification - Shapes (None,) and (None, 250, 100) are incompatible error

问题

我想对具有多个标签的文本进行分类。我使用TextVectorization层和CategoricalCrossEntropy函数。以下是我的模型代码:

Text Vectorizer:

def custom_standardization(input_data):
  print(input_data[:5])
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')

max_features = 10000
sequence_length = 250

vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)

模型生成:

MAX_TOKENS_NUM = 5000  # 最大词汇量大小。
MAX_SEQUENCE_LEN = 40  # 用于填充输出的序列长度。
EMBEDDING_DIMS = 100

model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(tf.keras.layers.Embedding(MAX_TOKENS_NUM + 1, EMBEDDING_DIMS))
model.summary()
model.compile(loss=losses.CategoricalCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=tf.metrics.CategoricalAccuracy())

FIT:

epochs = 10
history = model.fit(
    x_train,
    y=y_train,
    epochs=epochs)

x_train是一个文本列表,类似于['This is a text about science.', 'This is a text about art', ...]

y_train也是一个文本列表,类似于['Science','Art', ...]

当我尝试运行拟合代码时,它会产生以下错误:

ValueError: Shapes (None,) and (None, 250, 100) are incompatible

我做错了什么?此外,我想了解这是否是用于多标签分类的好方法/模型?

编辑

根据Frightera的回答,我编辑了我的代码。以下是我的模型:

MAX_TOKENS_NUM = 5000  # 最大词汇量大小。
MAX_SEQUENCE_LEN = 40  # 用于填充输出的序列长度。
EMBEDDING_DIMS = 100

model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(tf.keras.layers.Embedding(MAX_TOKENS_NUM + 1, EMBEDDING_DIMS))
model.add(layers.Dropout(0.2))
model.add(layers.GlobalAveragePooling1D())
model.add(layers.Dropout(0.2))
model.add(layers.Dense(len(labels)))
model.summary()
model.compile(loss=losses.SparseCategoricalCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=tf.metrics.SparseCategoricalAccuracy())

我通过使用y_train_int = [get_label_index(label) for label in y_train]将类别转换为索引,而不是使用y_train,传递了y_train_int

epochs = 10
history = model.fit(
    x_train,
    y=y_train_int,
    epochs=epochs)

现在模型适应了,但是当我使用plt.plot(history.history['loss'])检查损失函数时,它是一条全为零的线,如下所示:

Tensorflow – 文本分类 – Shapes (None,) 和 (None, 250, 100) 不兼容错误

这个模型是否适合分类?我是否需要在输入层和最终Dense层之间添加这些层(例如Embedding)?我做错了什么?

编辑2
我现在有上面的模型。我使用SparseCategoricalEntropy并将长度为78的标签传递给最后的Dense层,现在模型适应了。

现在,当我使用model.predict(x_test)时,它给出以下结果:

array([[ 1.3232083 ,  3.4263668 ,  0.3206688 , ..., -1.9279423 ,
        -0.83103067, -5.3442082 ],
       [ 0.11507592, -2.0753977 , -0.07149621, ..., -0.27729607,
        -1.132122  , -2.4074485 ],
       [ 0.87828857, -0.5063573 ,  1.5770453 , ...,  0.72519284,
         0.50958884,  3.7006462 ],
       ...,
       [ 0.35316354, -3.1919005 , -0.25520897, ..., -1.648859  ,
        -2.2707412 , -4.321298  ],
       [ 0.89357865,  1.3001428 ,  0.17324057, ..., -0.8185719 ,
        -1.4108973 , -3.674326  ],
       [ 1.6258209 , -0.59622926,  0.7382731 , ..., -0.8473997 ,
        -0.90670204, -4.043623  ]], dtype=float32)

如何将这些转换为标签?

英文:

I want to classify text with multiple labels. I use TextVectorization layer and CategoricalCrossEntropy function. Here is my model code:

Text Vectorizer:

def custom_standardization(input_data):
  print(input_data[:5])
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')

max_features = 10000
sequence_length = 250

vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)

Model generation:

MAX_TOKENS_NUM = 5000  # Maximum vocab size.
MAX_SEQUENCE_LEN = 40  # Sequence length to pad the outputs to.
EMBEDDING_DIMS = 100

model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(tf.keras.layers.Embedding(MAX_TOKENS_NUM + 1, EMBEDDING_DIMS))
model.summary()
model.compile(loss=losses.CategoricalCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=tf.metrics.CategoricalAccuracy())

FIT :

epochs = 10
history = model.fit(
    x_train,
    y=y_train,
    epochs=epochs)

x_train is a list of texts like ['This is a text about science.', 'This is a text about art',...]

y_train also is a list of texts like ['Science','Art',...]

When I try to run fitting code it gives the following error:

ValueError: Shapes (None,) and (None, 250, 100) are incompatible

What am i doing wrong? And also I'd like to learn if it's a good approach/model for classifying test with multiple labels?

EDIT:

I edited my code according to Frightera's answer. Here is my model:

MAX_TOKENS_NUM = 5000  # Maximum vocab size.
MAX_SEQUENCE_LEN = 40  # Sequence length to pad the outputs to.
EMBEDDING_DIMS = 100

model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(tf.keras.layers.Embedding(MAX_TOKENS_NUM + 1, EMBEDDING_DIMS))
model.add(layers.Dropout(0.2))
model.add(layers.GlobalAveragePooling1D())
model.add(layers.Dropout(0.2))
model.add(layers.Dense(len(labels)))
model.summary()
model.compile(loss=losses.SparseCategoricalCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=tf.metrics.SparseCategoricalAccuracy())

And I pass y_train_int instead of y_train by converting categories to indexes with y_train_int = [get_label_index(label) for label in y_train]

epochs = 10
history = model.fit(
    x_train,
    y=y_train_int,
    epochs=epochs)

Now the model fits, but when I check loss function with plt.plot(history.history['loss']) it's an all zero line like below:

Tensorflow – 文本分类 – Shapes (None,) 和 (None, 250, 100) 不兼容错误

Is this model good for classification. Do I need those layers between input layer and final Dense Layer(Embedding etc.)? What am I doing wrong?

EDIT 2:
I have the above model now. I am using SparseCategoricalEntropy and passing to the last Dense layer length of labels which is 78 and now it fits the model.

Now when I use model.predict(x_test), it gives following results:

array([[ 1.3232083 ,  3.4263668 ,  0.3206688 , ..., -1.9279423 ,
        -0.83103067, -5.3442082 ],
       [ 0.11507592, -2.0753977 , -0.07149621, ..., -0.27729607,
        -1.132122  , -2.4074485 ],
       [ 0.87828857, -0.5063573 ,  1.5770453 , ...,  0.72519284,
         0.50958884,  3.7006462 ],
       ...,
       [ 0.35316354, -3.1919005 , -0.25520897, ..., -1.648859  ,
        -2.2707412 , -4.321298  ],
       [ 0.89357865,  1.3001428 ,  0.17324057, ..., -0.8185719 ,
        -1.4108973 , -3.674326  ],
       [ 1.6258209 , -0.59622926,  0.7382731 , ..., -0.8473997 ,
        -0.90670204, -4.043623  ]], dtype=float32)

How can I convert these to labels?

答案1

得分: 0

我根据评论如下解决了文本分类问题:

  1. 在模型末尾使用具有唯一标签数量的Dense层。
  2. 将字符串类别标签转换为索引,并在模型中使用SparseCategoricalCrossEntropy和SparseCategoricalAccuracy。
  3. 在将结果转换为字符串标签时,获取最大值的输出并在标签列表中获取其索引。
英文:

I resolved this according to the comments as follows for text classification:

  1. Use Dense layer with number of unique labels in the end of the model.
  2. Convert string category labels to indexes and use SparseCategoricalCrossEntropy and SparseCategoricalAccuracy in the model.
  3. When converting results to string labels, get the max valued output and get index of it in the labels list.

huangapple
  • 本文由 发表于 2023年1月9日 01:14:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/75049828.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定