英文:
How can you exclude certain words before periods from being used as sentence breaks in quanteda's corpus_reshape?
问题
在使用 corpus_reshape
时,有些情况下会错误地将某些时期用作句子分隔符。我有一份来自制药行业的语料库,在许多情况下,“Dr.” 会被错误地用作句子分隔符。
这个帖子 (https://stackoverflow.com/questions/62691994/quantedas-corpus-reshape-function-how-not-to-break-sentences-after-abbreviatio) 与此类似,但不幸的是没有解决问题。以下是一个示例:
library("quanteda")
txt <- c(
d1 = "With us we have Dr. Smith. We are not sure... where we stand.",
d2 = "The U.S. is south of Canada."
)
corpus(txt) %>%
corpus_reshape(to = "sentences")
Corpus consisting of 4 documents.
d1.1 :
"With us we have Dr."d1.2 :
"Smith."d1.3 :
"We are not sure... where we stand."d2.1 :
"The U.S. is south of Canada."
它只在少数情况下适用于 "Dr."。我想知道是否可以将要排除的特定单词添加到函数中,因为我希望避免使用替代函数来将文本拆分成句子。谢谢!
英文:
In some cases, certain periods are mistakenly used as sentence breaks when using corpus_reshape
. I have a corpus from the pharmaceutical industry and in many cases "Dr." is mistakenly used as a sentence break.
This post (https://stackoverflow.com/questions/62691994/quantedas-corpus-reshape-function-how-not-to-break-sentences-after-abbreviatio) is similar but does unfortunately solve the problem. Here is an example:
library("quanteda")
txt <- c(
d1 = "With us we have Dr. Smith. We are not sure... where we stand.",
d2 = "The U.S. is south of Canada."
)
corpus(txt) %>%
corpus_reshape(to = "sentences")
> Corpus consisting of 4 documents.
> d1.1 :
> "With us we have Dr."
>
> d1.2 :
> "Smith."
>
> d1.3 :
> "We are not sure... where we stand."
>
> d2.1 :
> "The U.S. is south of Canada."
It works only for few cases with "Dr.". I was wondering if certain words to be excluded can be added to the function because I would like to avoid using an alternative function to break the text into sentences. Thanks!
答案1
得分: 0
请使用corpus_segment
,并设置pattern
和valuetype="regex"
。
你可以在这里找到示例:
https://quanteda.io/reference/corpus_segment.html
你还可以使用use_docvars
选项。
英文:
Please use corpus_segment
with pattern
& valuetype = "regex"
.
You may find example here
https://quanteda.io/reference/corpus_segment.html
You may also use use_docvars
option.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论