如何告诉Spacy使用retokenizer时不要拆分带有撇号的任何单词?

huangapple go评论76阅读模式
英文:

How to tell Spacy not to split any words with apostrophs using retokenizer?

问题

I'm stuck on a problem here. I'm going to use spacy's word tokenizer. But I have some constraints, e.g. that my tokenizer doesn't split words that contain apostrophes (').

Example:

Input string : "I can't do this"
current output: ["I","ca","n't","do","this"]
Expected output: ["I","can't","do","this"]

My Tries:

doc = nlp(sent)
position = [token.i for token in doc if token.i!=0 and "'" in token.text]
with doc.retokenize() as retokenizer:
    for pos in position:
        retokenizer.merge(doc[pos-1:pos+1])
for token in doc:
    print(token.text)

In this way I'm getting the expected output. But I don't know if this approach is right? Or else is there a better approach to do retokenization?

英文:

I'm stuck on a problem here. I'm going to use spacy's word tokenizer. But I have some constraints, e.g. that my tokenizer doesn't splits words that contain apostrophs (').

Example:

<pre>
Input string : "I can't do this"
current output: ["I","ca","n't","do","this"]
Expected output: ["I","can't","do","this"]
</pre>

My Tries:

doc = nlp(sent)
position = [token.i for token in doc if token.i!=0 and &quot;&#39;&quot; in token.text]
with doc.retokenize() as retokenizer:
    for pos in position:
       retokenizer.merge(doc[pos-1:pos+1])
for token in doc:
    print(token.text)

In this way I'm getting the expected output. But I don't know if this approach is right? Or else is there a better approach to do retokenization?

答案1

得分: 6

The retokenizer approach works, but the simpler way is to modify the tokenizer so it doesn't split these words in the first place. The contractions with apostrophes that are split like this (don't, can't, I'm, you'll, etc.) are handled by tokenizer exceptions.

With spaCy v2.2.3, you can inspect and set tokenizer exceptions with the property nlp.tokenizer.rules. To remove the exceptions with any kind of apostrophe:

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer.rules = {key: value for key, value in nlp.tokenizer.rules.items() if "'" not in key and "’" not in key and "‘" not in key}
assert [t.text for t in nlp("can't")] == ["can't"]

Be aware that the default models (tagger, parser, NER) provided by spaCy for English won't work as well on texts with this tokenization because they're trained on data with the contractions split.

With older versions of spaCy, you'll have to create a custom tokenizer and pass in a modified rules= after modifying nlp.Defaults.tokenizer_exceptions. Use all the other existing settings (nlp.tokenizer.prefix_search / suffix_search / infix_finditer / token_match) to keep the existing tokenization in all other cases.

英文:

The retokenizer approach works, but the simpler way is to modify the tokenizer so it doesn't split these words in the first place. The contractions with apostrophes that are split like this (don&#39;t, can&#39;t, I&#39;m, you&#39;ll, etc.) are handled by tokenizer exceptions.

With spacy v2.2.3, you can inspect and set tokenizer exceptions with the property nlp.tokenizer.rules. To remove the exceptions with any kind of apostrophe:

nlp = spacy.load(&#39;en_core_web_sm&#39;)
nlp.tokenizer.rules = {key: value for key, value in nlp.tokenizer.rules.items() if &quot;&#39;&quot; not in key and &quot;’&quot; not in key and &quot;‘&quot; not in key}
assert [t.text for t in nlp(&quot;can&#39;t&quot;)] == [&quot;can&#39;t&quot;]

Be aware that the default models (tagger, parser, NER) provided by spacy for English won't work as well on texts with this tokenization because they're trained on data with the contractions split.

With older versions of spacy, you'll have to create a custom tokenizer and pass in a modified rules= after modifying nlp.Defaults.tokenizer_exceptions. Use all the other existing settings (nlp.tokenizer.prefix_search / suffix_search / infix_finditer / token_match) to keep the existing tokenization in all other cases.

huangapple
  • 本文由 发表于 2020年1月3日 20:57:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/59579049.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定