2023年2月16日 18:26:48go评论100阅读模式

英文:

How can you exclude certain words before periods from being used as sentence breaks in quanteda's corpus_reshape?

问题

在使用 corpus_reshape 时，有些情况下会错误地将某些时期用作句子分隔符。我有一份来自制药行业的语料库，在许多情况下，“Dr.” 会被错误地用作句子分隔符。

这个帖子 (https://stackoverflow.com/questions/62691994/quantedas-corpus-reshape-function-how-not-to-break-sentences-after-abbreviatio) 与此类似，但不幸的是没有解决问题。以下是一个示例：

library("quanteda")
txt <- c(
  d1 = "With us we have Dr. Smith. We are not sure... where we stand.",
  d2 = "The U.S. is south of Canada."
)
corpus(txt) %>%
  corpus_reshape(to = "sentences")

Corpus consisting of 4 documents.
d1.1 :
"With us we have Dr."

d1.2 :
"Smith."

d1.3 :
"We are not sure... where we stand."

d2.1 :
"The U.S. is south of Canada."

它只在少数情况下适用于 "Dr."。我想知道是否可以将要排除的特定单词添加到函数中，因为我希望避免使用替代函数来将文本拆分成句子。谢谢！

英文:

In some cases, certain periods are mistakenly used as sentence breaks when using corpus_reshape. I have a corpus from the pharmaceutical industry and in many cases "Dr." is mistakenly used as a sentence break.
This post (https://stackoverflow.com/questions/62691994/quantedas-corpus-reshape-function-how-not-to-break-sentences-after-abbreviatio) is similar but does unfortunately solve the problem. Here is an example:


    library(&quot;quanteda&quot;)
    
    txt &lt;- c(
      d1 = &quot;With us we have Dr. Smith. We are not sure... where we stand.&quot;,
      d2 = &quot;The U.S. is south of Canada.&quot;
    )
    corpus(txt) %&gt;%
      corpus_reshape(to = &quot;sentences&quot;)

> Corpus consisting of 4 documents.
> d1.1 :
> "With us we have Dr."
>
> d1.2 :
> "Smith."
>
> d1.3 :
> "We are not sure... where we stand."
>
> d2.1 :
> "The U.S. is south of Canada."

It works only for few cases with "Dr.". I was wondering if certain words to be excluded can be added to the function because I would like to avoid using an alternative function to break the text into sentences. Thanks!

答案1

得分: 0

请使用corpus_segment，并设置pattern和valuetype="regex"。

你可以在这里找到示例：

https://quanteda.io/reference/corpus_segment.html

你还可以使用use_docvars选项。

英文:

Please use corpus_segment with pattern & valuetype = "regex".

You may find example here

https://quanteda.io/reference/corpus_segment.html

You may also use use_docvars option.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How can you exclude certain words before periods from being used as sentence breaks in quanteda's corpus_reshape?

问题

答案1

如何通过指定一系列列将一个大的tibble拆分成多个小的tibbles。

使用str_replace()函数来检测带有’+’字符串的字符串

使用quanteda来对大型数据集进行分词并限制RAM

在R中创建一个堆叠的2 x 2 kable表格，使用不同维度的数据框。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。