如何加载使用transformers中的tokenizers训练的WordLevel Tokenizer。

huangapple go评论55阅读模式
英文:

How to load a WordLevel Tokenizer trained with tokenizers in transformers

问题

Sure, here is the translated code portion:

我想使用WordLevel编码方法来建立自己的词汇表然后将模型保存在my_word2_token文件夹下的vocab.json文件中以下是代码它可以正常工作

```python
import pandas as pd
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
from transformers import BertTokenizerFast
from tokenizers.pre_tokenizers import Whitespace
import os
tokenizer = Tokenizer(models.WordLevel())
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordLevelTrainer(vocab_size=1400, special_tokens=special_tokens)

tokenizer.train(files=["./data/material.txt"], trainer=trainer)
# 最终得到该语料的Tonkernize,查看下词汇大小
print("Trained vocab size: {}".format(tokenizer.get_vocab_size()))
# 保存训练的tokenizer
tokenizer.model.save('./my_word2_token/')

但是,当我尝试使用BartTokenizer或BertTokenizer来加载我的vocab.json时,它不起作用。特别是对于BertTokenizer,标记化的结果都是[UNK],如下所示。
如何加载使用transformers中的tokenizers训练的WordLevel Tokenizer。

至于BartTokenizer,它会报错:

ValueError: 对于此分词器,不支持使用单个文件或URL路径调用BartTokenizer.from_pretrained()。请改用模型标识符或目录路径。

有人能帮助我吗?

我想使用WordLevel编码方法来建立自己的词汇表,并使用WordLevel编码对它们进行标记化,而不是BEP编码。


<details>
<summary>英文:</summary>

I would like to use WordLevel encoding method to establish my own wordlists, and it saves the model with a vocab.json under the my_word2_token folder. The code is below and it works.

import pandas as pd
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
from transformers import BertTokenizerFast
from tokenizers.pre_tokenizers import Whitespace
import os
tokenizer = Tokenizer(models.WordLevel())
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordLevelTrainer(vocab_size=1400, special_tokens=special_tokens)

tokenizer.train(files=["./data/material.txt"], trainer=trainer)

最终得到该语料的Tonkernize,查看下词汇大小

print("Trained vocab size: {}".format(tokenizer.get_vocab_size()))

保存训练的tokenizer

tokenizer.model.save('./my_word2_token/')

But when I try to use BartTokenizer or BertTokenizer to load my `vocab.json`, it does not work. Especially, in terms of BertTokenizer, the tokenized result are all [UNK], as below.
![UNK pic](https://i.stack.imgur.com/LdD8F.png)
As for BartTokenizer, it errors as

&gt;  ValueError: Calling BartTokenizer.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead.

Could anyone help me out?

I would like to use WordLevel encoding method to establish my own wordlists and tokenize them using WordLevel encoding but not BEP encoding

</details>


# 答案1
**得分**: 2

BartTokenizer和BertTokenizer是transformer库的类,你不能直接加载你生成的标记器。transformer库提供了一个包装器叫做[PreTrainedTokenizerFast](https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast)来加载它:

```python
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(models.WordLevel(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
special tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordLevelTrainer(vocab_size=1400, special_tokens=special_tokens)

tokenizer.train(files=["material.txt"], trainer=trainer)

from transformers import PreTrainedTokenizerFast

transformer_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
print(transformer_tokenizer("马云 祖籍浙江嵊县,生于浙江杭州,中国大陆企业家,中国共产党党员。").input_ids)

输出:

[0, 0, 0, 0, 0, 261, 0, 0, 0, 56, 0, 0, 261, 0, 221, 0, 345, 133, 28, 0, 357, 0, 448, 0, 345, 133, 127, 0, 377, 377, 0, 5]

附注:请注意,我在下面的代码中添加了unk参数:

tokenizer = Tokenizer(models.WordLevel(unk_token="[UNK]"))
英文:

BartTokenizer and BertTokenizer are classes of the transformer library and you can't directly load the tokenizer you generated with it. The transformer library offers you a wrapper called PreTrainedTokenizerFast to load it:

from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(models.WordLevel(unk_token=&quot;[UNK]&quot;))
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
special_tokens = [&#39;[UNK]&#39;, &quot;[PAD]&quot;, &quot;[CLS]&quot;, &quot;[SEP]&quot;, &quot;[MASK]&quot;]
trainer = trainers.WordLevelTrainer(vocab_size=1400, special_tokens=special_tokens)

            
tokenizer.train(files=[&quot;material.txt&quot;], trainer=trainer)

from transformers import PreTrainedTokenizerFast

transformer_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
print(transformer_tokenizer(&quot;马云 祖籍浙江嵊县生于浙江杭州中国大陆企业家中国共产党党员&quot;).input_ids)

Output:

[0, 0, 0, 0, 0, 261, 0, 0, 0, 56, 0, 0, 261, 0, 221, 0, 345, 133, 28, 0, 357, 0, 448, 0, 345, 133, 127, 0, 377, 377, 0,5]

P.S.: Please note that I added the unk parameter to:

tokenizer = Tokenizer(models.WordLevel(unk_token=&quot;[UNK]&quot;))

huangapple
  • 本文由 发表于 2023年4月11日 10:52:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/75982078.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定