英文:
How to tell Spacy not to split any words with apostrophs using retokenizer?
问题
I'm stuck on a problem here. I'm going to use spacy's word tokenizer. But I have some constraints, e.g. that my tokenizer doesn't split words that contain apostrophes (').
Example:
Input string : "I can't do this"
current output: ["I","ca","n't","do","this"]
Expected output: ["I","can't","do","this"]
My Tries:
doc = nlp(sent)
position = [token.i for token in doc if token.i!=0 and "'" in token.text]
with doc.retokenize() as retokenizer:
for pos in position:
retokenizer.merge(doc[pos-1:pos+1])
for token in doc:
print(token.text)
In this way I'm getting the expected output. But I don't know if this approach is right? Or else is there a better approach to do retokenization?
英文:
I'm stuck on a problem here. I'm going to use spacy's word tokenizer. But I have some constraints, e.g. that my tokenizer doesn't splits words that contain apostrophs (').
Example:
<pre>
Input string : "I can't do this"
current output: ["I","ca","n't","do","this"]
Expected output: ["I","can't","do","this"]
</pre>
My Tries:
doc = nlp(sent)
position = [token.i for token in doc if token.i!=0 and "'" in token.text]
with doc.retokenize() as retokenizer:
for pos in position:
retokenizer.merge(doc[pos-1:pos+1])
for token in doc:
print(token.text)
In this way I'm getting the expected output. But I don't know if this approach is right? Or else is there a better approach to do retokenization?
答案1
得分: 6
The retokenizer approach works, but the simpler way is to modify the tokenizer so it doesn't split these words in the first place. The contractions with apostrophes that are split like this (don't, can't, I'm, you'll, etc.) are handled by tokenizer exceptions.
With spaCy v2.2.3, you can inspect and set tokenizer exceptions with the property nlp.tokenizer.rules
. To remove the exceptions with any kind of apostrophe:
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer.rules = {key: value for key, value in nlp.tokenizer.rules.items() if "'" not in key and "’" not in key and "‘" not in key}
assert [t.text for t in nlp("can't")] == ["can't"]
Be aware that the default models (tagger, parser, NER) provided by spaCy for English won't work as well on texts with this tokenization because they're trained on data with the contractions split.
With older versions of spaCy, you'll have to create a custom tokenizer and pass in a modified rules=
after modifying nlp.Defaults.tokenizer_exceptions
. Use all the other existing settings (nlp.tokenizer.prefix_search / suffix_search / infix_finditer / token_match
) to keep the existing tokenization in all other cases.
英文:
The retokenizer approach works, but the simpler way is to modify the tokenizer so it doesn't split these words in the first place. The contractions with apostrophes that are split like this (don't
, can't
, I'm
, you'll
, etc.) are handled by tokenizer exceptions.
With spacy v2.2.3, you can inspect and set tokenizer exceptions with the property nlp.tokenizer.rules
. To remove the exceptions with any kind of apostrophe:
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer.rules = {key: value for key, value in nlp.tokenizer.rules.items() if "'" not in key and "’" not in key and "‘" not in key}
assert [t.text for t in nlp("can't")] == ["can't"]
Be aware that the default models (tagger, parser, NER) provided by spacy for English won't work as well on texts with this tokenization because they're trained on data with the contractions split.
With older versions of spacy, you'll have to create a custom tokenizer and pass in a modified rules=
after modifying nlp.Defaults.tokenizer_exceptions
. Use all the other existing settings (nlp.tokenizer.prefix_search / suffix_search / infix_finditer / token_match
) to keep the existing tokenization in all other cases.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论