2023年2月23日 21:52:44go评论99阅读模式

英文:

keeping the apostrophe using the textcnt function from the tau package in R

问题

textcnt函数在R的tau包中具有一个split参数，默认值为split = "[[:space:][:punct:][:digit:]]+"，这个参数使用撇号'来分割单词，我不想要这样，我该如何修改参数，使其不使用撇号来分割单词？

这段代码：

library(tau)
text <- "I don't want the function to use the ' to split"
textcnt(text, split = "[[:space:][:punct:][:digit:]]+", method = "string", n = 1L)

产生以下输出：

don function i split t the to use want 
   1       1 1     1 1   1  2   2   1    1

而不是将don 1和t 1分开，我想保留don't作为一个单词。

我尝试使用stringr中的str_replace_all在textcnt之前删除标点符号，然后省略textcnt参数中的标点部分，但这样就无法使用所有种类的符号，比如&、>或"来分割。我尝试修改split参数，但这样要么完全不分割句子，要么保留符号。

谢谢！

英文:

The textcnt function in R's tau package has a split argument and it's default value is
split = "[[:space:][:punct:][:digit:]]+" ç this argumet uses the apostrophe ' to split into words too and I don't want that, how can I modify the argument so it doesn't use the apostrophe to split words?

this code:

`library(tau)
text<-"I don't want the function to use the ' to split"

textcnt(text, split = "[[:space:][:punct:][:digit:]]+",method="string",n=1L)`

produces this output:

 don function        i    split        t      the       to      use     want 
   1        1        1        1        1        2        2        1        1

instead of having don 1 and t 1, i would like to keep don't as 1 word

I have tried to use str_replace_all from stringr to remove the punctuation beforehand and then omit the punct part of the argument in textcnt but then it doesn't use all kind of symbols such as & > or " to split, I have tried to modify the split argument but then it doesn't split the sentence at all or it keeps the symbols

Thank you

答案1

得分: 0

使用基于PCRE的函数时，您需要使用以下正则表达式：

split = "(?:(?!')[[:space:][:punct:][:digit:]])+|'\\B|\\B'"

在这里：

(?: - 开始一个非捕获组：
- (?!') - 如果下一个字符是'字符，则匹配失败
- [[:space:][:punct:][:digit:]] - 匹配空格、标点符号或数字字符
)+ - 连续匹配一次或多次
'\\B - ''字符，后面要么是字符串的结尾，要么是非单词字符
| - 或者
\\B' - ''，前面要么是字符串的开头，要么是非单词字符。

使用stringr函数时，您可以使用以下正则表达式：

split = "[[:space:][:punct:][:digit:]--[']]+|'\\B|\\B'"

在这里，[[:space:][:punct:][:digit:]--[']] 匹配所有与 [[:space:][:punct:][:digit:]] 匹配的字符，除了'字符。

stringr的ICU正则表达式支持使用这种表示法的字符类减法。

英文:

With PCRE-based functions you need to use

split = &quot;(?:(?!&#39;)[[:space:][:punct:][:digit:]])+|&#39;\\B|\\B&#39;&quot;

Here,

(?: - start of a container non-capturing group:
(?!') - fail the match if the next char is a ' char
[[:space:][:punct:][:digit:]] - matches whitespace, punctuation or digit char
)+ - match one or more times (consecutively)
'\B - a ' char that is followed with either end of string or a non-word char
| - or
\B' - a ' that is preceded with either start of string or a non-word char.

With stringr functions, you can use

split = &quot;[[:space:][:punct:][:digit:]--[&#39;]]+|&#39;\\B|\\B&#39;&quot;

Here, [[:space:][:punct:][:digit:]--[']] matches all characters matched by [[:space:][:punct:][:digit:]] except the ' chars.

stringr ICU regex flavor supports character class subtraction using this notation.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用R中tau包的textcnt函数保留撇号。

问题

答案1

正则表达式中如何同时使用负向回顾断言和正向回顾断言？

何时初始化会改变（R）KFAS包中的结果？

如何在R中连续连接整数向量。

在R中的for循环：model.frame.default()中的错误。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。