如何允许文本输入到 TensorFlow 模型?

huangapple go评论68阅读模式
英文:

How do I allow a text input to a TensorFlow model?

问题

我正在使用TensorFlow构建一个自定义的文本分类模型,并希望通过TensorFlow Serving进行生产部署。该模型基于通过单独的模型计算的文本嵌入进行预测,而该模型需要将原始文本编码为向量。

目前,我已经以某种分离的方式使其工作,其中一个服务负责所有文本预处理,然后计算嵌入,然后将嵌入的文本向量发送到文本分类器。如果我们能将所有这些步骤都捆绑到一个TensorFlow Serving模型中,特别是最初的文本预处理步骤,那将会很好。

这就是我卡住的地方。如何构建一个Tensor(或其他TensorFlow原语),它是原始文本输入?是否需要采取任何特殊措施来标记令牌-向量组件映射的查找表,以便它作为模型捆绑的一部分保存下来?

供参考,以下是我目前的粗略近似:

input = tf.placeholder(tf.float32, [None, 510], name='input')

# 省略了很多步骤以简洁明了为目的

outputs = tf.linalg.matmul(outputs, terminal_layer, transpose_b=True, name='output')

sess = tf.Session()
tf.saved_model.simple_save(sess,
                           'model.pb',
                           inputs={'input': input}, outputs={'output': outputs})
英文:

I'm working on a custom text classification model in TensorFlow, and would now like to set it up with TensorFlow serving for production deployment. The model predicts on the basis of a text embedding that's computed via a separate model, and that model requires the raw text to be encoded as a vector.

I have this working in a somewhat disjointed way right now, where one service does all the text preprocessing and then computes the embedding, which is then sent to the text classifier as the embedded text vector. It would be nice if we could bundle this all into one TensorFlow serving model, especially the initial text preprocessing step.

And that's where I'm stuck. How do you construct a Tensor (or other TensorFlow primitive) that is a raw text input? And do you need to do anything special to earmark the lookup table for the token-vector component mapping so that it gets saved out as part of the model bundle?

For reference, here's a rough approximation of what I have now:

input = tf.placeholder(tf.float32, [None, 510], name='input')

# lots of steps omitted for brevity/clarity

outputs = tf.linalg.matmul(outputs, terminal_layer, transpose_b=True, name='output')

sess = tf.Session()
tf.saved_model.simple_save(sess,
                           'model.pb',
                           inputs={'input': input}, outputs={'output': outputs})

答案1

得分: 0

这个问题是相对直接的,多亏了TensorFlow标准库中可用的tf.lookup.StaticVocabularyTable

我的模型使用的是词袋模型,而不是保留顺序,尽管将代码更改为保留顺序会相当简单。

假设你有一个编码词汇的列表对象(我称之为vocab),以及你想要使用的相应术语/标记嵌入的矩阵(我称之为raw_term_embeddings,因为我将其强制转换为Tensor),代码会看起来像这样:

initalizer = tf.lookup.KeyValueTensorInitializer(vocab, np.arange(len(vocab)))
lut = tf.lookup.StaticVocabularyTable(initalizer, 1) # 这里的1是指未知标记的大小
lut.initializer.run(session=sess) # 将LUT推送到会话中

input = tf.placeholder(tf.string, [None, None], name='input')

ones_at = lut.lookup(input)
encoded_text = tf.math.reduce_sum(tf.one_hot(ones_at, tf.dtypes.cast(lut.size(), np.int32)), axis=0, keepdims=True)

# 我没有为未知标记构建嵌入
term_embeddings = tf.convert_to_tensor(np.vstack([raw_term_embeddings]), dtype=tf.float32)
embedded_text = tf.linalg.matmul(encoded_text, term_embeddings)

# 然后在模型的其余部分中使用embedded_text

还有一个小技巧是确保在保存函数中传递legacy_init_op=tf.tables_initializer(),以提示TensorFlow Serving在加载模型时初始化文本编码的查找表。

英文:

This turns out to be relatively straightforward, thanks to the tf.lookup.StaticVocabularyTable that's available as part of the TensorFlow standard library.

My model is using a bag of words approach, rather than preserving order, though that would be a pretty simple change to the code.

Assuming you have a list object that encodes your vocabulary (which I've called vocab) and a matrix of corresponding term/token embeddings you want to use (which I've called raw_term_embeddings, since I'm coercing that into a Tensor), the code will look something like this:

initalizer = tf.lookup.KeyValueTensorInitializer(vocab, np.arange(len(vocab)))
lut = tf.lookup.StaticVocabularyTable(initalizer, 1) # the one here is the out of vocab size
lut.initializer.run(session=sess) # pushes the LUT onto the session

input = tf.placeholder(tf.string, [None, None], name='input')

ones_at = lut.lookup(input)
encoded_text = tf.math.reduce_sum(tf.one_hot(ones_at, tf.dtypes.cast(lut.size(), np.int32)), axis=0, keepdims=True)

# I didn't build an embedding for the out of vocabulary token
term_embeddings = tf.convert_to_tensor(np.vstack([raw_term_embeddings]), dtype=tf.float32)
embedded_text = tf.linalg.matmul(encoded_text, term_embeddings)

# then use embedded_text for the remainder of the model

The one small trick is also making sure to pass legacy_init_op=tf.tables_initializer() to the save function to hint TensorFlow Serving to initialize the lookup table for the text encoding when the model is loaded.

huangapple
  • 本文由 发表于 2020年1月4日 01:05:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/59582516.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定