2023年4月10日 20:54:14go评论71阅读模式

英文:

How to use output from T5 model to replace masked tokens in input sequence

问题

I'm working with the T5 model from the Hugging Face Transformers library and I have an input sequence with masked tokens that I want to replace with the output generated by the model. Here's the code.

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_data = "The <extra_id_0> walks in <extra_id_1> park"
input_ids = tokenizer(input_data, return_tensors="pt").input_ids

sequence_ids = model.generate(input_ids)
output_sequences = tokenizer.batch_decode(sequence_ids)
output_sequences

This code produces the following output:

['<pad><extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>']

What I want to do is replace the masked tokens <extra_id_0> and <extra_id_1> in the input sequence with the corresponding output tokens from the model so that the final output is:

The park offers walks in the park.

I'm hoping someone can help me with the code to achieve this.

Notice that this is the correspondence:

mask in input_data -> answer in output_sequences
<extra_id_0> -> <extra_id_0> park offers (so we extract 'park offers' only)
<extra_id_1> -> <extra_id_1> the  (so we extract 'the' only)

英文:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained(&quot;t5-small&quot;)
model = T5ForConditionalGeneration.from_pretrained(&quot;t5-small&quot;)

input_data = &quot;The &lt;extra_id_0&gt; walks in &lt;extra_id_1&gt; park&quot;
input_ids = tokenizer(input_data, return_tensors=&quot;pt&quot;).input_ids

sequence_ids = model.generate(input_ids)
output_sequences = tokenizer.batch_decode(sequence_ids)
output_sequences

This code produces the following output:

[&#39;&lt;pad&gt;&lt;extra_id_0&gt; park offers&lt;extra_id_1&gt; the&lt;extra_id_2&gt; park.&lt;/s&gt;&#39;]

What I want to do is replace the masked tokens <extra_id_0> and <extra_id_1> in the input sequence with the corresponding output tokens from the model, so that the final output is:

The park offers walks in the park.

I'm hoping someone can help me with the code to achieve this.

Notice that this is the correspondence:

mask in input_data -&gt; answer in output_sequences
&lt;extra_id_0&gt; -&gt; &lt;extra_id_0&gt; park offers (so we extract &#39;park offers&#39; only)
&lt;extra_id_1&gt; -&gt; &lt;extra_id_1&gt; the  (so we extract &#39;the&#39; only)

答案1

得分: 3

以下是翻译好的部分：

"t5模型将以<extra_id开头的标记视为潜在的掩码标记。正如在文档中所写："
"每个sentinel标记代表这个句子的一个唯一掩码标记，并应以<extra_id_0>、<extra_id_1>等方式开始，一直到<extra_id_99>为止。"
"在输出中，你可以将位于<extra_id_0>和<extra_id_1>之间的文本视为mask_0的输出，将位于<extra_id_1>和<extra_id_2>之间的文本视为mask_1的输出，依此类推。"
"要从生成的输出中提取这些内容，你可以使用以下代码片段。它将以掩码数作为输入，并返回一个字符串列表作为输出，其中每个元素表示相应掩码的预测文本。"

def extract_text(text, num_masks=1):
    list_of_text = []
    for i in range(num_masks):
        prev_id = '<extra_id_' + str(i) + '>'
        curr_id = '<extra_id_' + str(i+1) + '>'
        st_token_index = text.index(prev_id)
        end_token_index = text.index(curr_id)
        list_of_text.append(text[st_token_index+12:end_token_index])
    return list_of_text

"此外，你应该注意，t5实际上并不是用于掩码语言建模任务的最佳选择，如在此处讨论的那样。像BERT这样的模型是专门为这类任务进行训练的，可以直接与huggingface的填充掩码管道一起使用。"

from transformers import pipeline
nlp_fill = pipeline('fill-mask')

英文:

The t5 model considers tokens which begin with <extra_id as potential mask tokens. As written in the documentation

"Each sentinel token represents a unique mask token for this sentence and should start with <extra_id_0>, <extra_id_1>, … up to <extra_id_99>"

In the output, you can consider the text between <extra_id_0> and <extra_id_1> as your output for the mask_0, the text between <extra_id_1> and <extra_id_2> as your output for the mask 1 and so on.

To extract this from your generated output, you can use the following code snippet. This will take the number of masks as input and return a list of string as output where each element represents the text predicted for the corresponding mask.

def extract_text(text,num_masks=1):
    list_of_text = []
    for i in range(num_masks):
        prev_id = &#39;&lt;extra_id_&#39; + str(i) + &#39;&gt;&#39;
        curr_id = &#39;&lt;extra_id_&#39; + str(i+1) + &#39;&gt;&#39;
        st_token_index = text.index(prev_id)
        end_token_index = text.index(curr_id)
        list_of_text.append(text[st_token_index+12:end_token_index])
    return list_of_text

Also, you should note that t5 is not really the best choice for the masked language modelling task as discussed here. Models like BERT are specifically trained for these type of tasks and can directly be used with the fill mask pipeline from huggingface

from transformers import pipeline
nlp_fill = pipeline(&#39;fill-mask&#39;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用T5模型的输出替换输入序列中的掩码标记。

问题

答案1

将Python中的交叉表导出到Excel – 我丢失了行索引/名称。

Is there a way to look for a part of a string ('USD' in 'USDSEK') in the keys of a dictionary and if found return the value?

Numpy数组剪裁多行字符串

为什么如果我在任务内发送HTTP请求，Airflow 会挂起？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论