2023年1月9日 01:14:23go评论169阅读模式

英文:

Tensorflow - Text Classification - Shapes (None,) and (None, 250, 100) are incompatible error

问题

我想对具有多个标签的文本进行分类。我使用TextVectorization层和CategoricalCrossEntropy函数。以下是我的模型代码：

Text Vectorizer:

def custom_standardization(input_data):
  print(input_data[:5])
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '&lt;br /&gt;', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')

max_features = 10000
sequence_length = 250

vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)

模型生成：

MAX_TOKENS_NUM = 5000  # 最大词汇量大小。
MAX_SEQUENCE_LEN = 40  # 用于填充输出的序列长度。
EMBEDDING_DIMS = 100

model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(tf.keras.layers.Embedding(MAX_TOKENS_NUM + 1, EMBEDDING_DIMS))
model.summary()
model.compile(loss=losses.CategoricalCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=tf.metrics.CategoricalAccuracy())

FIT：

epochs = 10
history = model.fit(
    x_train,
    y=y_train,
    epochs=epochs)

x_train是一个文本列表，类似于['This is a text about science.', 'This is a text about art', ...]

y_train也是一个文本列表，类似于['Science','Art', ...]

当我尝试运行拟合代码时，它会产生以下错误：

ValueError: Shapes (None,) and (None, 250, 100) are incompatible

我做错了什么？此外，我想了解这是否是用于多标签分类的好方法/模型？

编辑：

根据Frightera的回答，我编辑了我的代码。以下是我的模型：

MAX_TOKENS_NUM = 5000  # 最大词汇量大小。
MAX_SEQUENCE_LEN = 40  # 用于填充输出的序列长度。
EMBEDDING_DIMS = 100

model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(tf.keras.layers.Embedding(MAX_TOKENS_NUM + 1, EMBEDDING_DIMS))
model.add(layers.Dropout(0.2))
model.add(layers.GlobalAveragePooling1D())
model.add(layers.Dropout(0.2))
model.add(layers.Dense(len(labels)))
model.summary()
model.compile(loss=losses.SparseCategoricalCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=tf.metrics.SparseCategoricalAccuracy())

我通过使用y_train_int = [get_label_index(label) for label in y_train]将类别转换为索引，而不是使用y_train，传递了y_train_int。

epochs = 10
history = model.fit(
    x_train,
    y=y_train_int,
    epochs=epochs)

现在模型适应了，但是当我使用plt.plot(history.history['loss'])检查损失函数时，它是一条全为零的线，如下所示：

这个模型是否适合分类？我是否需要在输入层和最终Dense层之间添加这些层（例如Embedding）？我做错了什么？

编辑2：
我现在有上面的模型。我使用SparseCategoricalEntropy并将长度为78的标签传递给最后的Dense层，现在模型适应了。

现在，当我使用model.predict(x_test)时，它给出以下结果：

array([[ 1.3232083 ,  3.4263668 ,  0.3206688 , ..., -1.9279423 ,
        -0.83103067, -5.3442082 ],
       [ 0.11507592, -2.0753977 , -0.07149621, ..., -0.27729607,
        -1.132122  , -2.4074485 ],
       [ 0.87828857, -0.5063573 ,  1.5770453 , ...,  0.72519284,
         0.50958884,  3.7006462 ],
       ...,
       [ 0.35316354, -3.1919005 , -0.25520897, ..., -1.648859  ,
        -2.2707412 , -4.321298  ],
       [ 0.89357865,  1.3001428 ,  0.17324057, ..., -0.8185719 ,
        -1.4108973 , -3.674326  ],
       [ 1.6258209 , -0.59622926,  0.7382731 , ..., -0.8473997 ,
        -0.90670204, -4.043623  ]], dtype=float32)

如何将这些转换为标签？

英文:

I want to classify text with multiple labels. I use TextVectorization layer and CategoricalCrossEntropy function. Here is my model code:

Text Vectorizer:

def custom_standardization(input_data):
  print(input_data[:5])
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, &#39;&lt;br /&gt;&#39;, &#39; &#39;)
  return tf.strings.regex_replace(stripped_html,
                                  &#39;[%s]&#39; % re.escape(string.punctuation),
                                  &#39;&#39;)

max_features = 10000
sequence_length = 250

vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode=&#39;int&#39;,
    output_sequence_length=sequence_length)

Model generation:

MAX_TOKENS_NUM = 5000  # Maximum vocab size.
MAX_SEQUENCE_LEN = 40  # Sequence length to pad the outputs to.
EMBEDDING_DIMS = 100

model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(tf.keras.layers.Embedding(MAX_TOKENS_NUM + 1, EMBEDDING_DIMS))
model.summary()
model.compile(loss=losses.CategoricalCrossentropy(from_logits=True),
              optimizer=&#39;adam&#39;,
              metrics=tf.metrics.CategoricalAccuracy())

FIT :

epochs = 10
history = model.fit(
    x_train,
    y=y_train,
    epochs=epochs)

x_train is a list of texts like ['This is a text about science.', 'This is a text about art',...]

y_train also is a list of texts like ['Science','Art',...]

When I try to run fitting code it gives the following error:

ValueError: Shapes (None,) and (None, 250, 100) are incompatible

What am i doing wrong? And also I'd like to learn if it's a good approach/model for classifying test with multiple labels?

EDIT:

I edited my code according to Frightera's answer. Here is my model:

MAX_TOKENS_NUM = 5000  # Maximum vocab size.
MAX_SEQUENCE_LEN = 40  # Sequence length to pad the outputs to.
EMBEDDING_DIMS = 100

model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(tf.keras.layers.Embedding(MAX_TOKENS_NUM + 1, EMBEDDING_DIMS))
model.add(layers.Dropout(0.2))
model.add(layers.GlobalAveragePooling1D())
model.add(layers.Dropout(0.2))
model.add(layers.Dense(len(labels)))
model.summary()
model.compile(loss=losses.SparseCategoricalCrossentropy(from_logits=True),
              optimizer=&#39;adam&#39;,
              metrics=tf.metrics.SparseCategoricalAccuracy())

And I pass y_train_int instead of y_train by converting categories to indexes with y_train_int = [get_label_index(label) for label in y_train]

epochs = 10
history = model.fit(
    x_train,
    y=y_train_int,
    epochs=epochs)

Now the model fits, but when I check loss function with plt.plot(history.history['loss']) it's an all zero line like below:

Is this model good for classification. Do I need those layers between input layer and final Dense Layer(Embedding etc.)? What am I doing wrong?

EDIT 2:
I have the above model now. I am using SparseCategoricalEntropy and passing to the last Dense layer length of labels which is 78 and now it fits the model.

Now when I use model.predict(x_test), it gives following results:

array([[ 1.3232083 ,  3.4263668 ,  0.3206688 , ..., -1.9279423 ,
        -0.83103067, -5.3442082 ],
       [ 0.11507592, -2.0753977 , -0.07149621, ..., -0.27729607,
        -1.132122  , -2.4074485 ],
       [ 0.87828857, -0.5063573 ,  1.5770453 , ...,  0.72519284,
         0.50958884,  3.7006462 ],
       ...,
       [ 0.35316354, -3.1919005 , -0.25520897, ..., -1.648859  ,
        -2.2707412 , -4.321298  ],
       [ 0.89357865,  1.3001428 ,  0.17324057, ..., -0.8185719 ,
        -1.4108973 , -3.674326  ],
       [ 1.6258209 , -0.59622926,  0.7382731 , ..., -0.8473997 ,
        -0.90670204, -4.043623  ]], dtype=float32)

How can I convert these to labels?

答案1

得分: 0

我根据评论如下解决了文本分类问题：

在模型末尾使用具有唯一标签数量的Dense层。
将字符串类别标签转换为索引，并在模型中使用SparseCategoricalCrossEntropy和SparseCategoricalAccuracy。
在将结果转换为字符串标签时，获取最大值的输出并在标签列表中获取其索引。

英文:

I resolved this according to the comments as follows for text classification:

Use Dense layer with number of unique labels in the end of the model.
Convert string category labels to indexes and use SparseCategoricalCrossEntropy and SparseCategoricalAccuracy in the model.
When converting results to string labels, get the max valued output and get index of it in the labels list.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Tensorflow – 文本分类 – Shapes (None,) 和 (None, 250, 100) 不兼容错误

问题

答案1

“.str[0]” 在 pandas 数据框中的用途是什么？

使用Pillow在Tkinter中如何插入图像

Python Polars在数据框中查找字符串的长度

Trying to understand ImportError: Couldn't import Django. Are you sure it's installed and available on your PYTHONPATH environment variable?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论