问题

I'm stuck on a problem here. I'm going to use spacy's word tokenizer. But I have some constraints, e.g. that my tokenizer doesn't split words that contain apostrophes (').

Example:

Input string : "I can't do this"
current output: ["I","ca","n't","do","this"]
Expected output: ["I","can't","do","this"]

My Tries:

doc = nlp(sent)
position = [token.i for token in doc if token.i!=0 and "'" in token.text]
with doc.retokenize() as retokenizer:
    for pos in position:
        retokenizer.merge(doc[pos-1:pos+1])
for token in doc:
    print(token.text)

In this way I'm getting the expected output. But I don't know if this approach is right? Or else is there a better approach to do retokenization?

英文:

I'm stuck on a problem here. I'm going to use spacy's word tokenizer. But I have some constraints, e.g. that my tokenizer doesn't splits words that contain apostrophs (').

Example:

<pre>
Input string : "I can't do this"
current output: ["I","ca","n't","do","this"]
Expected output: ["I","can't","do","this"]
</pre>

My Tries:

doc = nlp(sent)
position = [token.i for token in doc if token.i!=0 and &quot;&#39;&quot; in token.text]
with doc.retokenize() as retokenizer:
    for pos in position:
       retokenizer.merge(doc[pos-1:pos+1])
for token in doc:
    print(token.text)

In this way I'm getting the expected output. But I don't know if this approach is right? Or else is there a better approach to do retokenization?

答案1

得分: 6

The retokenizer approach works, but the simpler way is to modify the tokenizer so it doesn't split these words in the first place. The contractions with apostrophes that are split like this (don't, can't, I'm, you'll, etc.) are handled by tokenizer exceptions.

With spaCy v2.2.3, you can inspect and set tokenizer exceptions with the property nlp.tokenizer.rules. To remove the exceptions with any kind of apostrophe:

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer.rules = {key: value for key, value in nlp.tokenizer.rules.items() if "'" not in key and "’" not in key and "‘" not in key}
assert [t.text for t in nlp("can't")] == ["can't"]

Be aware that the default models (tagger, parser, NER) provided by spaCy for English won't work as well on texts with this tokenization because they're trained on data with the contractions split.

With older versions of spaCy, you'll have to create a custom tokenizer and pass in a modified rules= after modifying nlp.Defaults.tokenizer_exceptions. Use all the other existing settings (nlp.tokenizer.prefix_search / suffix_search / infix_finditer / token_match) to keep the existing tokenization in all other cases.

英文:

The retokenizer approach works, but the simpler way is to modify the tokenizer so it doesn't split these words in the first place. The contractions with apostrophes that are split like this (don't, can't, I'm, you'll, etc.) are handled by tokenizer exceptions.

With spacy v2.2.3, you can inspect and set tokenizer exceptions with the property nlp.tokenizer.rules. To remove the exceptions with any kind of apostrophe:

nlp = spacy.load(&#39;en_core_web_sm&#39;)
nlp.tokenizer.rules = {key: value for key, value in nlp.tokenizer.rules.items() if &quot;&#39;&quot; not in key and &quot;’&quot; not in key and &quot;‘&quot; not in key}
assert [t.text for t in nlp(&quot;can&#39;t&quot;)] == [&quot;can&#39;t&quot;]

Be aware that the default models (tagger, parser, NER) provided by spacy for English won't work as well on texts with this tokenization because they're trained on data with the contractions split.

With older versions of spacy, you'll have to create a custom tokenizer and pass in a modified rules= after modifying nlp.Defaults.tokenizer_exceptions. Use all the other existing settings (nlp.tokenizer.prefix_search / suffix_search / infix_finditer / token_match) to keep the existing tokenization in all other cases.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何告诉Spacy使用retokenizer时不要拆分带有撇号的任何单词？

问题

答案1

How to split a column of defined strings written without spaces in pandas, e.g. appleorange to apple orange?

How to traverse entire JSON structure while appending 3 elements to a dictionary (or any other structure) that meet criteria?

Panda数据框架 – 根据其他列的条件添加值到新列

Importing pandas into main and modules?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论