2023年6月1日 02:44:27go评论95阅读模式

英文:

Transformers tokenizer attention mask for pytorch

问题

In my code I have:

output = self.decoder(output, embedded, tgt_mask=attention_mask)

where

decoder_layer = TransformerDecoderLayer(embedding_size, num_heads, hidden_size, dropout, batch_first=True)
self.decoder = TransformerDecoder(decoder_layer, 1)

I generate the attention mask using a huggingface's tokenizer:

batch = tokenizer(example['text'], return_tensors="pt", truncation=True, max_length=1024, padding='max_length')
inputs = batch['input_ids']
attention_mask = batch['attention_mask']

Running it through the models fails on

AssertionError: only bool and floating types of attn_mask are supported

Changing the attention mask to attention_mask = batch['attention_mask'].bool()

Causes

RuntimeError: The shape of the 2D attn_mask is torch.Size([4, 1024]), but should be (1024, 1024)

Any idea how I can use a huggingface tokenizer with my own pytorch module?

英文:

In my code I have:

output = self.decoder(output, embedded, tgt_mask=attention_mask)

where

decoder_layer = TransformerDecoderLayer(embedding_size, num_heads, hidden_size, dropout, batch_first=True)
self.decoder = TransformerDecoder(decoder_layer, 1)

I generate the attention mask using a huggingface's tokenizer:

batch = tokenizer(example[&#39;text&#39;], return_tensors=&quot;pt&quot;, truncation=True, max_length=1024, padding=&#39;max_length&#39;)
inputs = batch[&#39;input_ids&#39;]
attention_mask = batch[&#39;attention_mask&#39;]

Running it through the models fails on

AssertionError: only bool and floating types of attn_mask are supported

Changing the attention mask to attention_mask = batch['attention_mask'] .bool()

Causes

RuntimeError: The shape of the 2D attn_mask is torch.Size([4, 1024]), but should be (1024, 1024)

Any idea how I can use a huggingface tokenizer with my own pytorch module?

答案1

得分: 2

Pytorch的 tgt_mask 不同于hf的 attention_mask。后者指示哪些标记是填充的：

from transformers import BertTokenizer

t = BertTokenizer.from_pretrained("bert-base-cased")

encoded = t("this is a test", max_length=10, padding="max_length")
print(t.pad_token_id)
print(encoded.input_ids)
print(encoded.attention_mask)

输出：

0
[101, 1142, 1110, 170, 2774, 102, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

Pytorch的等价部分是tgt_key_padding_mask。

另一方面，tgt_mask 有不同的用途，它定义了哪个标记应该参与到其他标记中。对于自然语言处理变换器解码器来说，通常用于防止标记参与到未来的标记中（因果掩码）。如果这是你的用例，你也可以简单地传递 tgt_is_causal=True，PyTorch将为你创建 tgt_mask。

英文:

Pytorchs tgt_mask is not the same as hf attention_mask. The latter indicates which tokens are padded:

from transformers import BertTokenizer

t= BertTokenizer.from_pretrained(&quot;bert-base-cased&quot;)

encoded = t(&quot;this is a test&quot;, max_length=10, padding=&quot;max_length&quot;)
print(t.pad_token_id)
print(encoded.input_ids)
print(encoded.attention_mask)

Output:

0
[101, 1142, 1110, 170, 2774, 102, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

Pytorchs equivalent to that is tgt_key_padding_mask.

The tgt_mask on the other hand serves a different purpose, it defines which token should attend to other tokens. For an NLP transformer decoder, this is usually used to prevent tokens to attend to future tokens (causal mask). In case this is your use case, you could also simply pass tgt_is_causal=True and PyTorch will create the tgt_mask for you.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Transformers分词器的PyTorch注意力掩码

问题

答案1

基本的Python角色扮演游戏战斗

TypeVar在Python中是一个类型提示工具，用于定义通用类型变量。

优化后的素数生成器程序为什么需要更多时间？

Python中的唯一子列表

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论