问题

我正在尝试使用Spacy将txt文件（utf-8）中的文本分割成句子。它将含有缩写（例如，Mr.，Dr.等）的句子分割为单独的句子，而实际上应该读作一个句子。例如：“Mr. John Doe says” 变成句子0: Dr. 句子1: Jane Doe says

我尝试使用nlp.tokenizer.add_special_case来识别Dr.作为特殊情况，对一个案例有效（下面是代码）。但由于数据集的其余部分中有许多缩写，我想要一个缩写列表（最好来自文本文件，但只有一个列表也可以！），它将列表中的所有内容都作为特殊情况添加。

这是我的代码：

import spacy
import pathlib
from spacy.attrs import ORTH, NORM

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer.add_special_case('Dr.', [{ORTH: 'Dr .', NORM: 'Doctor'}])

file_name = r"text_test_sentence.txt"  #要拆分的文本文件的文件名
doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))
sentences = list(doc.sents)

谢谢您提前！

英文:

I am trying to segment text in a txt file (utf-8) into sentences using Spacy. It segments sentences with abbreviations (e.g., Mr., Dr., etc.) as separate sentences when it is meant to read as a single sentence. For example: 'Mr. John Doe says' becomes
Sentence 0: Dr.
Sentence 1: Jane Doe says

I tried to use nlp.tokenizer.add_special_case to recognize Dr. as a special case, and it works for one case (code below). BUT because I have many abbreviations in the rest of the dataset, I would like to have a list of abbreviations (preferably from a text file but really just a list is fine!) where it adds everything on the list as special cases.

This is my code:

import spacy
import pathlib
from spacy.attrs import ORTH, NORM

nlp = spacy.load(&#39;en_core_web_sm&#39;)
nlp.tokenizer.add_special_case(&#39;Dr.&#39;, [{ORTH: &#39;Dr .&#39;, NORM: &#39;Doctor&#39;}])

file_name = r&quot;text_test_sentence.txt&quot; #filename of textfile to split
doc = nlp(pathlib.Path(file_name).read_text(encoding=&quot;utf-8&quot;))
sentences = list (doc.sents)

Thank you in advance!!!

答案1

得分: 0

如果您想要向您的分词器添加多个规则，那么我建议编写一个for循环，遍历一个包含您想要添加到特殊情况的各种缩写的列表。

英文:

If you would like to like add multiple rules to your tokenizer, then I would suggest writing a for loop over a list that stores all the various abbreviations that you would like to add to the special cases.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为Spacy分词器添加多个特殊情况。

问题

答案1

如何深入研究NLP库（如spacy/NLTK）的语言特征？

为spaCy返回文本和标签列表时创建一个未知标签。

如何告诉Spacy使用retokenizer时不要拆分带有撇号的任何单词？

Java句子缩短程序未正常工作。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论