How can you exclude certain words before periods from being used as sentence breaks in quanteda's corpus_reshape?

huangapple go评论75阅读模式
英文:

How can you exclude certain words before periods from being used as sentence breaks in quanteda's corpus_reshape?

问题

在使用 corpus_reshape 时,有些情况下会错误地将某些时期用作句子分隔符。我有一份来自制药行业的语料库,在许多情况下,“Dr.” 会被错误地用作句子分隔符。

这个帖子 (https://stackoverflow.com/questions/62691994/quantedas-corpus-reshape-function-how-not-to-break-sentences-after-abbreviatio) 与此类似,但不幸的是没有解决问题。以下是一个示例:

library("quanteda")

txt <- c(
  d1 = "With us we have Dr. Smith. We are not sure... where we stand.",
  d2 = "The U.S. is south of Canada."
)

corpus(txt) %>%
  corpus_reshape(to = "sentences")

Corpus consisting of 4 documents.
d1.1 :
"With us we have Dr."

d1.2 :
"Smith."

d1.3 :
"We are not sure... where we stand."

d2.1 :
"The U.S. is south of Canada."

它只在少数情况下适用于 "Dr."。我想知道是否可以将要排除的特定单词添加到函数中,因为我希望避免使用替代函数来将文本拆分成句子。谢谢!

英文:

In some cases, certain periods are mistakenly used as sentence breaks when using corpus_reshape. I have a corpus from the pharmaceutical industry and in many cases "Dr." is mistakenly used as a sentence break.
This post (https://stackoverflow.com/questions/62691994/quantedas-corpus-reshape-function-how-not-to-break-sentences-after-abbreviatio) is similar but does unfortunately solve the problem. Here is an example:


    library(&quot;quanteda&quot;)
    
    txt &lt;- c(
      d1 = &quot;With us we have Dr. Smith. We are not sure... where we stand.&quot;,
      d2 = &quot;The U.S. is south of Canada.&quot;
    )
    corpus(txt) %&gt;%
      corpus_reshape(to = &quot;sentences&quot;)

> Corpus consisting of 4 documents.
> d1.1 :
> "With us we have Dr."
>
> d1.2 :
> "Smith."
>
> d1.3 :
> "We are not sure... where we stand."
>
> d2.1 :
> "The U.S. is south of Canada."

It works only for few cases with "Dr.". I was wondering if certain words to be excluded can be added to the function because I would like to avoid using an alternative function to break the text into sentences. Thanks!

答案1

得分: 0

请使用corpus_segment,并设置patternvaluetype="regex"

你可以在这里找到示例:

https://quanteda.io/reference/corpus_segment.html

你还可以使用use_docvars选项。

英文:

Please use corpus_segment with pattern & valuetype = &quot;regex&quot;.

You may find example here

https://quanteda.io/reference/corpus_segment.html

You may also use use_docvars option.

huangapple
  • 本文由 发表于 2023年2月16日 18:26:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75470895.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定