构建Tensorflow数据集并使用map方法应用TextVectorization层

huangapple go评论97阅读模式
英文:

Constructing Tensorflow Dataset and applying TextVectorization layer using map method

问题

我试图构建一个用于NLP模型嵌入层的输入。但是,我在将原始文本数据转换为嵌入层所需的数字输入时遇到了问题。

这里有一些示例数据,用于说明我希望提供给NLP模型的内容:

# 0 = negative
# 1 = positive
documents = [['拓扑结构真的很糟糕,简直是浪费时间!', 0], ['哇,兄弟,你是NLP迷吗?告诉我更多,我想知道', 1], 
['你知道,我最终会死的', 0], ['幸福的秘诀就是只有沮丧', 0], 
['没有脚的地板是什么', 1], ['在历史情况下,只有在特定情况下才允许弑君', 1],
['我不喜欢交付以小麦为基础的产品,因为我对小麦过敏', 0], 
['为什么他每个小时都敲大钟?', 0],
['智慧来自于认知,而不是经验', 1], 
['我们对猫的内心活动知之甚少', 1]]

每个文档包含一句话和一个标签。此数据格式受到我正在处理的教程提示的启发:

你的任务
在本课程中,你的任务是设计一个小型的文档分类问题,其中包括10个只有一句话的文档,以及与之相关的正面和负面结果的标签,并使用单词嵌入在这些数据上训练一个网络。

我使用了keras库中的TextVectorization函数:

# 创建预处理层
VOCAB_SIZE = 500  # 所有文档中的词汇量的最大值
MAX_SEQUENCE_LENGTH = 50  # 每个文档中将考虑的最大单词/标记数量
# 输出模式'int'会为每个标记分配唯一的整数值,因此在我们的示例中,'topology'被分配了值19。
# 请注意,这些整数是随机分配的,实际上充当了一个哈希映射
int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH
)

现在的问题是将此向量化层应用于原始数据documents。以下是我用于将原始数据转换为tensorflow Dataset对象的代码:

# 将调整后的层应用于tensorflow数据集
def int_vectorize_text(sentence, label):
  sentence = tf.expand_dims(sentence, -1)
  sentence = tf.squeeze(sentence, axis=-1)
  return int_vectorize_layer(sentence), label

# 将原始数据作为生成器传递给Dataset的from_generator构造函数
def generate_data(sentences, labels):
  for s, l in zip(sentences, labels):
    yield s, l

# 将原始数据分割为训练集和验证集
train_docs = documents[:8]
val_docs = documents[8:]

# 分离句子和标签
train_sentences = [d[0] for d in train_docs]
train_labels = [d[1] for d in train_docs]

val_sentences = [d[0] for d in val_docs]
val_labels = [d[1] for d in val_docs]

# 转换为张量
train_sentences_tensor = tf.convert_to_tensor(train_sentences)
train_labels_tensor = tf.convert_to_tensor(train_labels)

val_sentences_tensor = tf.convert_to_tensor(val_sentences)
val_labels_tensor = tf.convert_to_tensor(val_labels)

# 使用上述生成器函数在新构造的张量对象上构建tensorflow Dataset
train_dataset = tf.data.Dataset.from_generator(
    generate_data, (tf.string, tf.int32), args=(train_sentences_tensor, train_labels_tensor))
val_dataset = tf.data.Dataset.from_generator(
    generate_data, (tf.string, tf.int32), args=(val_sentences_tensor, val_labels_tensor))

# 使用训练句子调整层
int_vectorize_layer.adapt(train_sentences)

# 现在错误发生在这里
int_train_df = train_dataset.map(int_vectorize_text)  # 错误
int_val_df = val_dataset.map(int_vectorize_text)

如你所见,当我们尝试将int_vectorize_text映射到tensorflow数据集时发生了错误。具体来说,我得到了以下错误:

TypeError: unsupported operand type(s) for >: 'NoneType' and 'int'

这似乎意味着传递了一个NoneType。但是,我检查了train_dataset的构造,似乎是正确的。以下是它的样子:

(<tf.Tensor: shape=(), dtype=string, numpy=b'topology freaking sucks man, what a waste of time!'>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'wow bro you a NLP fan? Tell me more I want to know'>, <tf.Tensor: shape=(), dtype=int32, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'you know, I will eventually die'>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'the secret to happiness is to only be depresssed'>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'what is the floor without feet'>, <tf.Tensor: shape=(), dtype=int32, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'regicide is permissable only in historical situations'>, <tf.Tensor: shape=(), dtype=int32, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'I do not like delivering wehat based products for I am allergic to wheat'>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: shape=(), dtype=

<details>
<summary>英文:</summary>

I&#39;m attempting to construct input to an embedding layer for an NLP model. However, I am having problems with converting raw text data to the numerical input required by the embedding layer. 

Here is some example data to illustrate what I wish to feed to the NLP model:

0 = negative

1 = positive

documents = [['topology freaking sucks man, what a waste of time!', 0], ['wow bro you a NLP fan? Tell me more I want to know', 1],
['you know, I will eventually die',0], ['the secret to happiness is to only be depresssed',0],
['what is the floor without feet', 1], ['regicide is permissable only in historical situations',1],
['I do not like delivering wehat based products for I am allergic to wheat', 0],
['Why does he ring the large bell every hour?',0],
['Wisdom comes not from experience but from knowing',1],
['Little is known of the inner workings of the feline mind', 1]]

Each document contains one sentence and one label. This data format was inspired by the tutorial prompt I am working on:
&gt;Your Task
Your task in this lesson is to design a small document classification problem with 10 documents of one sentence each and associated labels of positive and negative outcomes and to train a network with word embedding on these data.

I utilize the TextVectorization function from the keras library:

create preprocessing layer

VOCAB_SIZE = 500 # max amount of vocabulary amongst all documents
MAX_SEQUENCE_LENGTH = 50 # maximum amount of words/tokens that will be considered in each document

output mode 'int' will assign unique integer per token, so in our example below, 'topology' is assigned the value

19. Notice that these integers are randomly assigned and essentially acts as a hashmap

int_vectorize_layer = TextVectorization(
max_tokens=VOCAB_SIZE,
output_mode='int',
output_sequence_length = MAX_SEQUENCE_LENGTH
)


The issue now becomes applying this vectorized layer to the raw data `documents`. Here is the following code I have to convert the raw data into a tensorflow `Dataset` object:

Applies adapted layer to tensorflow dataset

def int_vectorize_text(sentence, label):
sentence = tf.expand_dims(sentence, -1)
sentence = tf.squeeze(sentence, axis=-1)
return int_vectorize_layer(sentence), label

passes raw data as a generator to the Dataset from_generator constructor

def generate_data(sentences, labels):
for s, l in zip(sentences,labels):
yield s, l

split raw data between training and validation set

train_docs = documents[:8]
val_docs = documents[8:]

separate sentences and labels

train_sentences = [d[0] for d in train_docs]
train_labels = [d1 for d in train_docs]

val_sentences = [d[0] for d in val_docs]
val_labels = [d1 for d in val_docs]

convert to tensors

train_sentences_tensor = tf.convert_to_tensor(train_sentences)
train_labels_tensor = tf.convert_to_tensor(train_labels)

val_sentences_tensor = tf.convert_to_tensor(val_sentences)
val_labels_tensor = tf.convert_to_tensor(val_labels)

build tensorflow Dataset using the above generator function on the newly constructed tensor objects

train_dataset = tf.data.Dataset.from_generator(
generate_data, (tf.string, tf.int32), args=(train_sentences_tensor, train_labels_tensor))
val_dataset = tf.data.Dataset.from_generator(
generate_data, (tf.string, tf.int32), args=(val_sentences_tensor, val_labels_tensor))

adapt layer using training sentences

int_vectorize_layer.adapt(train_sentences)

now here is where the error occurs

int_train_df = train_dataset.map(int_vectorize_text) # ERROR
int_val_df = val_dataset.map(int_vectorize_text)


As you can see, an error occurs when we attempt to map the `int_vectorize_text` to the tensorflow dataset. Specifically, I get the following error:

TypeError Traceback (most recent call last)
/home/akagi/Documents/Projects/MLMastery NLP Tutorial/Lesson 5 - Learned Embedding.ipynb Cell 7 in <cell line: 21>()
19 # Use the map method to apply the int_vectorize_text function to each element of the dataset
20 int_vectorize_layer.adapt(train_sentences)
---> 21 int_train_df = train_dataset.map(int_vectorize_text)
22 int_val_df = val_dataset.map(int_vectorize_text)

File ~/Documents/Projects/.venv/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py:2294, in DatasetV2.map(self, map_func, num_parallel_calls, deterministic, name)
2291 if deterministic is not None and not DEBUG_MODE:
2292 warnings.warn("The deterministic argument has no effect unless the "
2293 "num_parallel_calls argument is specified.")
-> 2294 return MapDataset(self, map_func, preserve_cardinality=True, name=name)
2295 else:
2296 return ParallelMapDataset(
2297 self,
2298 map_func,
(...)
2301 preserve_cardinality=True,
2302 name=name)

File ~/Documents/Projects/.venv/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py:5499, in MapDataset.init(self, input_dataset, map_func, use_inter_op_parallelism, preserve_cardinality, use_legacy_function, name)
5497 self._use_inter_op_parallelism = use_inter_op_parallelism
5498 self._preserve_cardinality = preserve_cardinality
-> 5499 self._map_func = structured_function.StructuredFunctionWrapper(
...
'>' not supported between instances of 'NoneType' and 'int'

Call arguments received by layer &#39;text_vectorization&#39; (type TextVectorization):
  • inputs=tf.Tensor(shape=&lt;unknown&gt;, dtype=string)

Which seems to imply that a `NoneType` is being passed. However, I checked the construction of `train_dataset` and it appears to be correct. Here is what it looks like:

(<tf.Tensor: shape=(), dtype=string, numpy=b'topology freaking sucks man, what a waste of time!'>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'wow bro you a NLP fan? Tell me more I want to know'>, <tf.Tensor: shape=(), dtype=int32, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'you know, I will eventually die'>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'the secret to happiness is to only be depresssed'>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'what is the floor without feet'>, <tf.Tensor: shape=(), dtype=int32, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'regicide is permissable only in historical situations'>, <tf.Tensor: shape=(), dtype=int32, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'I do not like delivering wehat based products for I am allergic to wheat'>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'Why does he ring the large bell every hour?'>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)


Furthermore, if I apply `int_vectorize_text` manually in a loop like so:

for x in train_dataset:
print(int_vectorize_text(x[0], x1))

No error occurs and I get the desired output. What is going on here?

</details>


# 答案1
**得分**: 2

在审查了@AloneTogether提供的干净、更合适的解决方案后,看起来你的问题源于`train_dataset`和`val_dataset`的定义。[tf.data.Dataset.from_generator][1]函数的文档建议使用`output_signature`参数。

> ... 使用`output_signature`参数。在这种情况下,输出将被假定为由`output_signature`参数中的`tf.TypeSpec`对象定义的类、形状和类型的对象组成。

由于你没有使用`output_signature`参数,它默认使用了已弃用的方式,该方式要么单独使用`output_types`参数,要么与`output_shapes`一起使用。在你的情况下,`output_types`被设置为`(tf.string, tf.int32)`,但因为你未填写`output_shapes`参数,它默认为“未知”。

稍后在进行[int_vectorize_text][2]函数的映射时,它尝试检查输入形状的秩是否大于1,然而,它接收到的是类型为`NoneType`的“shape=<unknown>”,因此在与类型`int`比较时会出现TypeError。

知道了这一切,你可以在`from_generator`函数调用后,在输出类型`(tf.string, tf.int32)`之后简单地添加`((), ())`作为输出形状。因此,请将这些行替换为:

```python
train_dataset = tf.data.Dataset.from_generator(
    generate_data, output_types=(tf.string, tf.int32), output_shapes=((), ()), args=(train_sentences_tensor, train_labels_tensor))

val_dataset = tf.data.Dataset.from_generator(
    generate_data, output_types=(tf.string, tf.int32), output_shapes=((), ()), args=(val_sentences_tensor, val_labels_tensor))

或者,按照@AloneTogether展示的TensorFlow推荐方式:

train_dataset = tf.data.Dataset.from_generator(
    generate_data, output_signature=(
         tf.TensorSpec(shape=(), dtype=tf.string),
         tf.TensorSpec(shape=(), dtype=tf.int32)), args=(train_sentences_tensor, train_labels_tensor))

val_dataset = tf.data.Dataset.from_generator(
    generate_data, output_signature=(
         tf.TensorSpec(shape=(), dtype=tf.string),
         tf.TensorSpec(shape=(), dtype=tf.int32)), args=(val_sentences_tensor, val_labels_tensor))

我删除了我的原始解决方案,因为我不相信传播次优的代码。完全归功于@AloneTogether,他展示了应该如何做。我编辑的目的是希望解释错误及其发生原因,以便你和未来的读者能更好地理解。

英文:

After reviewing @AloneTogether's clean and more appropriate solution, it appears your issue is stemming from train_dataset and val_dataset definitions. The documentation for the tf.data.Dataset.from_generator function recommends that one

> ... use the output_signature argument. In this case the output will be assumed to consist of objects with the classes, shapes and types defined by tf.TypeSpec objects from output_signature argument

As you didn't use the output_signature argument, it defaulted to using the deprecated way which uses either the output_types argument alone, or together with output_shapes. In your case, output_types was set to (tf.string, tf.int32) but because you left the output_shapes argument empty, it defaulted to "unknown".

Later when you go to map the int_vectorize_text function, it attempts to check if the input shape rank is greater than 1, however, it receives "shape=&lt;unknown&gt;" which is of type NoneType and so the TypeError manifests when comparing with type int.

Knowing all this, you can simply add ((), ()) as the output shape in your from_generator function call after the output type (tf.string, tf.int32). Hence, replace these lines:

train_dataset = tf.data.Dataset.from_generator(
    generate_data, (tf.string, tf.int32), args=(train_sentences_tensor, train_labels_tensor))

val_dataset = tf.data.Dataset.from_generator(
    generate_data, (tf.string, tf.int32), args=(val_sentences_tensor, val_labels_tensor))

With:

train_dataset = tf.data.Dataset.from_generator(
    generate_data, output_types=(tf.string, tf.int32), output_shapes=((), ()), args=(train_sentences_tensor, train_labels_tensor))

val_dataset = tf.data.Dataset.from_generator(
    generate_data, output_types=(tf.string, tf.int32), output_shapes=((), ()), args=(val_sentences_tensor, val_labels_tensor))

Or, the TensorFlow recommended way as @AloneTogether demonstrated:

train_dataset = tf.data.Dataset.from_generator(
    generate_data, output_signature=(
         tf.TensorSpec(shape=(), dtype=tf.string),
         tf.TensorSpec(shape=(), dtype=tf.int32)), args=(train_sentences_tensor, train_labels_tensor))

val_dataset = tf.data.Dataset.from_generator(
    generate_data, output_signature=(
         tf.TensorSpec(shape=(), dtype=tf.string),
         tf.TensorSpec(shape=(), dtype=tf.int32)), args=(val_sentences_tensor, val_labels_tensor))

I've removed my original solution as I don't believe in propagating code that is suboptimal. Full credit to @AloneTogether for showing how it's supposed to be done. My intent with this edit is to hopefully explain the error and why it occurred so that you and future readers have a better understanding.

答案2

得分: 2

这是一个不使用 tf.py_function 的示例,如 @KyleFHartzenberg 要求的:

import tensorflow as tf

# 0 = negative
# 1 = positive
documents = [['拓扑结构真的很糟糕,简直浪费时间!', 0], ['哇,兄弟,你是NLP的粉丝吗?告诉我更多,我想知道', 1], 
['你知道,我最终会死的', 0], ['幸福的秘密是只有沮丧', 0], 
['没有脚的地板是什么', 1], ['弑君只在历史情况下被允许', 1],
['我不喜欢提供基于小麦的产品,因为我对小麦过敏', 0], 
['为什么他每个小时都敲大钟?', 0],
['智慧不来自经验,而来自知识', 1], 
['关于猫的内心活动知之甚少', 1]]

VOCAB_SIZE = 500 # 所有文档中的最大词汇量
MAX_SEQUENCE_LENGTH = 50 # 每个文档中考虑的最大词/标记数量
# 输出模式'int'将为每个标记分配唯一的整数,因此在我们的示例中,'topology' 被赋予值 19。
# 请注意,这些整数是随机分配的,实际上充当哈希映射
int_vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH,
)

def int_vectorize_text(sentence, label):
  return int_vectorize_layer(sentence), label


def generate_data(sentences, labels):
  for s, l in zip(sentences, labels):
    yield s, l

train_docs = documents[:8]
val_docs = documents[8:]

train_sentences = [d[0] for d in train_docs]
train_labels = [d[1] for d in train_docs]

val_sentences = [d[0] for d in val_docs]
val_labels = [d[1] for d in val_docs]

train_sentences_tensor = tf.convert_to_tensor(train_sentences)
train_labels_tensor = tf.convert_to_tensor(train_labels)

train_dataset = tf.data.Dataset.from_generator(
    generate_data, output_signature=(
         tf.TensorSpec(shape=(), dtype=tf.string),
         tf.TensorSpec(shape=(), dtype=tf.int32)), args=(train_sentences_tensor, train_labels_tensor))

# 使用训练句子调整层
int_vectorize_layer.adapt(train_dataset.map(lambda x, y: x))
int_train_df = train_dataset.map(int_vectorize_text)

for x, y in int_train_df:
  print(x, y)
  break
tf.Tensor(
[19 42 22 34  7 10 17 29 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0], shape=(50,), dtype=int64) tf.Tensor(0, shape=(), dtype=int32)
英文:

Here is an example without tf.py_function, as requested by @KyleFHartzenberg:

import tensorflow as tf

# 0 = negative
# 1 = positive
documents = [[&#39;topology freaking sucks man, what a waste of time!&#39;, 0], [&#39;wow bro you a NLP fan? Tell me more I want to know&#39;, 1], 
[&#39;you know, I will eventually die&#39;,0], [&#39;the secret to happiness is to only be depresssed&#39;,0], 
[&#39;what is the floor without feet&#39;, 1], [&#39;regicide is permissable only in historical situations&#39;,1],
[&#39;I do not like delivering wehat based products for I am allergic to wheat&#39;, 0], 
[&#39;Why does he ring the large bell every hour?&#39;,0],
[&#39;Wisdom comes not from experience but from knowing&#39;,1], 
[&#39;Little is known of the inner workings of the feline mind&#39;, 1]]

VOCAB_SIZE = 500 # max amount of vocabulary amongst all documents
MAX_SEQUENCE_LENGTH = 50 # maximum amount of words/tokens that will be considered in each document
# output mode &#39;int&#39; will assign unique integer per token, so in our example below, &#39;topology&#39; is assigned the value
# 19. Notice that these integers are randomly assigned and essentially acts as a hashmap
int_vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode=&#39;int&#39;,
    output_sequence_length = MAX_SEQUENCE_LENGTH,
)

def int_vectorize_text(sentence, label):
  return int_vectorize_layer(sentence), label


def generate_data(sentences, labels):
  for s, l in zip(sentences,labels):
    yield s, l

train_docs = documents[:8]
val_docs = documents[8:]

train_sentences = [d[0] for d in train_docs]
train_labels = [d[1] for d in train_docs]

val_sentences = [d[0] for d in val_docs]
val_labels = [d[1] for d in val_docs]

train_sentences_tensor = tf.convert_to_tensor(train_sentences)
train_labels_tensor = tf.convert_to_tensor(train_labels)

train_dataset = tf.data.Dataset.from_generator(
    generate_data, output_signature=(
         tf.TensorSpec(shape=(), dtype=tf.string),
         tf.TensorSpec(shape=(), dtype=tf.int32)), args=(train_sentences_tensor, train_labels_tensor))

# adapt layer using training sentences
int_vectorize_layer.adapt(train_dataset.map(lambda x, y: x))
int_train_df = train_dataset.map(int_vectorize_text)

for x, y in int_train_df:
  print(x, y)
  break
tf.Tensor(
[19 42 22 34  7 10 17 29 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0], shape=(50,), dtype=int64) tf.Tensor(0, shape=(), dtype=int32)

huangapple
  • 本文由 发表于 2023年1月6日 10:43:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75026436.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定