英文:
How to load a WordLevel Tokenizer trained with tokenizers in transformers
问题
Sure, here is the translated code portion:
我想使用WordLevel编码方法来建立自己的词汇表,然后将模型保存在my_word2_token文件夹下的vocab.json文件中。以下是代码,它可以正常工作。
```python
import pandas as pd
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
from transformers import BertTokenizerFast
from tokenizers.pre_tokenizers import Whitespace
import os
tokenizer = Tokenizer(models.WordLevel())
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordLevelTrainer(vocab_size=1400, special_tokens=special_tokens)
tokenizer.train(files=["./data/material.txt"], trainer=trainer)
# 最终得到该语料的Tonkernize,查看下词汇大小
print("Trained vocab size: {}".format(tokenizer.get_vocab_size()))
# 保存训练的tokenizer
tokenizer.model.save('./my_word2_token/')
但是,当我尝试使用BartTokenizer或BertTokenizer来加载我的vocab.json
时,它不起作用。特别是对于BertTokenizer,标记化的结果都是[UNK],如下所示。
至于BartTokenizer,它会报错:
ValueError: 对于此分词器,不支持使用单个文件或URL路径调用BartTokenizer.from_pretrained()。请改用模型标识符或目录路径。
有人能帮助我吗?
我想使用WordLevel编码方法来建立自己的词汇表,并使用WordLevel编码对它们进行标记化,而不是BEP编码。
<details>
<summary>英文:</summary>
I would like to use WordLevel encoding method to establish my own wordlists, and it saves the model with a vocab.json under the my_word2_token folder. The code is below and it works.
import pandas as pd
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
from transformers import BertTokenizerFast
from tokenizers.pre_tokenizers import Whitespace
import os
tokenizer = Tokenizer(models.WordLevel())
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordLevelTrainer(vocab_size=1400, special_tokens=special_tokens)
tokenizer.train(files=["./data/material.txt"], trainer=trainer)
最终得到该语料的Tonkernize,查看下词汇大小
print("Trained vocab size: {}".format(tokenizer.get_vocab_size()))
保存训练的tokenizer
tokenizer.model.save('./my_word2_token/')
But when I try to use BartTokenizer or BertTokenizer to load my `vocab.json`, it does not work. Especially, in terms of BertTokenizer, the tokenized result are all [UNK], as below.
![UNK pic](https://i.stack.imgur.com/LdD8F.png)
As for BartTokenizer, it errors as
> ValueError: Calling BartTokenizer.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead.
Could anyone help me out?
I would like to use WordLevel encoding method to establish my own wordlists and tokenize them using WordLevel encoding but not BEP encoding
</details>
# 答案1
**得分**: 2
BartTokenizer和BertTokenizer是transformer库的类,你不能直接加载你生成的标记器。transformer库提供了一个包装器叫做[PreTrainedTokenizerFast](https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast)来加载它:
```python
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(models.WordLevel(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
special tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordLevelTrainer(vocab_size=1400, special_tokens=special_tokens)
tokenizer.train(files=["material.txt"], trainer=trainer)
from transformers import PreTrainedTokenizerFast
transformer_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
print(transformer_tokenizer("马云 祖籍浙江嵊县,生于浙江杭州,中国大陆企业家,中国共产党党员。").input_ids)
输出:
[0, 0, 0, 0, 0, 261, 0, 0, 0, 56, 0, 0, 261, 0, 221, 0, 345, 133, 28, 0, 357, 0, 448, 0, 345, 133, 127, 0, 377, 377, 0, 5]
附注:请注意,我在下面的代码中添加了unk参数:
tokenizer = Tokenizer(models.WordLevel(unk_token="[UNK]"))
英文:
BartTokenizer and BertTokenizer are classes of the transformer library and you can't directly load the tokenizer you generated with it. The transformer library offers you a wrapper called PreTrainedTokenizerFast to load it:
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(models.WordLevel(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
special_tokens = ['[UNK]', "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordLevelTrainer(vocab_size=1400, special_tokens=special_tokens)
tokenizer.train(files=["material.txt"], trainer=trainer)
from transformers import PreTrainedTokenizerFast
transformer_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
print(transformer_tokenizer("马云 祖籍浙江嵊县,生于浙江杭州,中国大陆企业家,中国共产党党员。").input_ids)
Output:
[0, 0, 0, 0, 0, 261, 0, 0, 0, 56, 0, 0, 261, 0, 221, 0, 345, 133, 28, 0, 357, 0, 448, 0, 345, 133, 127, 0, 377, 377, 0,5]
P.S.: Please note that I added the unk parameter to:
tokenizer = Tokenizer(models.WordLevel(unk_token="[UNK]"))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论