如何使用T5模型的输出替换输入序列中的掩码标记。

huangapple go评论60阅读模式
英文:

How to use output from T5 model to replace masked tokens in input sequence

问题

I'm working with the T5 model from the Hugging Face Transformers library and I have an input sequence with masked tokens that I want to replace with the output generated by the model. Here's the code.

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_data = "The <extra_id_0> walks in <extra_id_1> park"
input_ids = tokenizer(input_data, return_tensors="pt").input_ids

sequence_ids = model.generate(input_ids)
output_sequences = tokenizer.batch_decode(sequence_ids)
output_sequences

This code produces the following output:

['<pad><extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>']

What I want to do is replace the masked tokens <extra_id_0> and <extra_id_1> in the input sequence with the corresponding output tokens from the model so that the final output is:

The park offers walks in the park.

I'm hoping someone can help me with the code to achieve this.

Notice that this is the correspondence:

mask in input_data -> answer in output_sequences
<extra_id_0> -> <extra_id_0> park offers (so we extract 'park offers' only)
<extra_id_1> -> <extra_id_1> the  (so we extract 'the' only)
英文:

I'm working with the T5 model from the Hugging Face Transformers library and I have an input sequence with masked tokens that I want to replace with the output generated by the model. Here's the code.

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained(&quot;t5-small&quot;)
model = T5ForConditionalGeneration.from_pretrained(&quot;t5-small&quot;)

input_data = &quot;The &lt;extra_id_0&gt; walks in &lt;extra_id_1&gt; park&quot;
input_ids = tokenizer(input_data, return_tensors=&quot;pt&quot;).input_ids

sequence_ids = model.generate(input_ids)
output_sequences = tokenizer.batch_decode(sequence_ids)
output_sequences

This code produces the following output:

[&#39;&lt;pad&gt;&lt;extra_id_0&gt; park offers&lt;extra_id_1&gt; the&lt;extra_id_2&gt; park.&lt;/s&gt;&#39;]

What I want to do is replace the masked tokens &lt;extra_id_0&gt; and &lt;extra_id_1&gt; in the input sequence with the corresponding output tokens from the model, so that the final output is:

The park offers walks in the park.

I'm hoping someone can help me with the code to achieve this.

Notice that this is the correspondence:

mask in input_data -&gt; answer in output_sequences
&lt;extra_id_0&gt; -&gt; &lt;extra_id_0&gt; park offers (so we extract &#39;park offers&#39; only)
&lt;extra_id_1&gt; -&gt; &lt;extra_id_1&gt; the  (so we extract &#39;the&#39; only)

答案1

得分: 3

以下是翻译好的部分:

"t5模型将以<extra_id开头的标记视为潜在的掩码标记。正如在文档中所写:"
"每个sentinel标记代表这个句子的一个唯一掩码标记,并应以<extra_id_0>、<extra_id_1>等方式开始,一直到<extra_id_99>为止。"
"在输出中,你可以将位于<extra_id_0>和<extra_id_1>之间的文本视为mask_0的输出,将位于<extra_id_1>和<extra_id_2>之间的文本视为mask_1的输出,依此类推。"
"要从生成的输出中提取这些内容,你可以使用以下代码片段。它将以掩码数作为输入,并返回一个字符串列表作为输出,其中每个元素表示相应掩码的预测文本。"

def extract_text(text, num_masks=1):
    list_of_text = []
    for i in range(num_masks):
        prev_id = '<extra_id_' + str(i) + '>'
        curr_id = '<extra_id_' + str(i+1) + '>'
        st_token_index = text.index(prev_id)
        end_token_index = text.index(curr_id)
        list_of_text.append(text[st_token_index+12:end_token_index])
    return list_of_text

"此外,你应该注意,t5实际上并不是用于掩码语言建模任务的最佳选择,如在此处讨论的那样。像BERT这样的模型是专门为这类任务进行训练的,可以直接与huggingface的填充掩码管道一起使用。"

from transformers import pipeline
nlp_fill = pipeline('fill-mask')
英文:

The t5 model considers tokens which begin with <extra_id as potential mask tokens. As written in the documentation

"Each sentinel token represents a unique mask token for this sentence and should start with <extra_id_0>, <extra_id_1>, … up to <extra_id_99>"

In the output, you can consider the text between <extra_id_0> and <extra_id_1> as your output for the mask_0, the text between <extra_id_1> and <extra_id_2> as your output for the mask 1 and so on.

To extract this from your generated output, you can use the following code snippet. This will take the number of masks as input and return a list of string as output where each element represents the text predicted for the corresponding mask.

def extract_text(text,num_masks=1):
    list_of_text = []
    for i in range(num_masks):
        prev_id = &#39;&lt;extra_id_&#39; + str(i) + &#39;&gt;&#39;
        curr_id = &#39;&lt;extra_id_&#39; + str(i+1) + &#39;&gt;&#39;
        st_token_index = text.index(prev_id)
        end_token_index = text.index(curr_id)
        list_of_text.append(text[st_token_index+12:end_token_index])
    return list_of_text

Also, you should note that t5 is not really the best choice for the masked language modelling task as discussed here. Models like BERT are specifically trained for these type of tasks and can directly be used with the fill mask pipeline from huggingface

from transformers import pipeline
nlp_fill = pipeline(&#39;fill-mask&#39;)

huangapple
  • 本文由 发表于 2023年4月10日 20:54:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75977316.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定