使用R中tau包的textcnt函数保留撇号。

huangapple go评论63阅读模式
英文:

keeping the apostrophe using the textcnt function from the tau package in R

问题

textcnt函数在R的tau包中具有一个split参数,默认值为split = "[[:space:][:punct:][:digit:]]+",这个参数使用撇号'来分割单词,我不想要这样,我该如何修改参数,使其不使用撇号来分割单词?

这段代码:

library(tau)
text <- "I don't want the function to use the ' to split"
textcnt(text, split = "[[:space:][:punct:][:digit:]]+", method = "string", n = 1L)

产生以下输出:

don function i split t the to use want 
   1       1 1     1 1   1  2   2   1    1

而不是将don 1和t 1分开,我想保留don't作为一个单词。

我尝试使用stringr中的str_replace_all在textcnt之前删除标点符号,然后省略textcnt参数中的标点部分,但这样就无法使用所有种类的符号,比如&、>或"来分割。我尝试修改split参数,但这样要么完全不分割句子,要么保留符号。

谢谢!

英文:

The textcnt function in R's tau package has a split argument and it's default value is
split = "[[:space:][:punct:][:digit:]]+" ç this argumet uses the apostrophe ' to split into words too and I don't want that, how can I modify the argument so it doesn't use the apostrophe to split words?

this code:

`library(tau)
text<-"I don't want the function to use the ' to split"

textcnt(text, split = "[[:space:][:punct:][:digit:]]+",method="string",n=1L)`

produces this output:

 don function        i    split        t      the       to      use     want 
   1        1        1        1        1        2        2        1        1 

instead of having don 1 and t 1, i would like to keep don't as 1 word

I have tried to use str_replace_all from stringr to remove the punctuation beforehand and then omit the punct part of the argument in textcnt but then it doesn't use all kind of symbols such as & > or " to split, I have tried to modify the split argument but then it doesn't split the sentence at all or it keeps the symbols

Thank you

答案1

得分: 0

使用基于PCRE的函数时,您需要使用以下正则表达式:

split = "(?:(?!')[[:space:][:punct:][:digit:]])+|'\\B|\\B'"

在这里:

  • (?: - 开始一个非捕获组:
    • (?!') - 如果下一个字符是'字符,则匹配失败
    • [[:space:][:punct:][:digit:]] - 匹配空格、标点符号或数字字符
  • )+ - 连续匹配一次或多次
  • '\\B - &#39;'字符,后面要么是字符串的结尾,要么是非单词字符
  • | - 或者
  • \\B' - &#39;',前面要么是字符串的开头,要么是非单词字符。

使用stringr函数时,您可以使用以下正则表达式:

split = "[[:space:][:punct:][:digit:]--[']]+|'\\B|\\B'"

在这里,[[:space:][:punct:][:digit:]--[']] 匹配所有与 [[:space:][:punct:][:digit:]] 匹配的字符,除了'字符。

stringr的ICU正则表达式支持使用这种表示法的字符类减法。

英文:

With PCRE-based functions you need to use

split = &quot;(?:(?!&#39;)[[:space:][:punct:][:digit:]])+|&#39;\\B|\\B&#39;&quot;

Here,

  • (?: - start of a container non-capturing group:
  • (?!&#39;) - fail the match if the next char is a &#39; char
  • [[:space:][:punct:][:digit:]] - matches whitespace, punctuation or digit char
  • )+ - match one or more times (consecutively)
  • &#39;\B - a &#39; char that is followed with either end of string or a non-word char
  • | - or
  • \B&#39; - a &#39; that is preceded with either start of string or a non-word char.

With stringr functions, you can use

split = &quot;[[:space:][:punct:][:digit:]--[&#39;]]+|&#39;\\B|\\B&#39;&quot;

Here, [[:space:][:punct:][:digit:]--[&#39;]] matches all characters matched by [[:space:][:punct:][:digit:]] except the &#39; chars.

stringr ICU regex flavor supports character class subtraction using this notation.

huangapple
  • 本文由 发表于 2023年2月23日 21:52:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/75545725.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定