英文:
keeping the apostrophe using the textcnt function from the tau package in R
问题
textcnt函数在R的tau包中具有一个split参数,默认值为split = "[[:space:][:punct:][:digit:]]+",这个参数使用撇号'来分割单词,我不想要这样,我该如何修改参数,使其不使用撇号来分割单词?
这段代码:
library(tau)
text <- "I don't want the function to use the ' to split"
textcnt(text, split = "[[:space:][:punct:][:digit:]]+", method = "string", n = 1L)
产生以下输出:
don function i split t the to use want
1 1 1 1 1 1 2 2 1 1
而不是将don 1和t 1分开,我想保留don't作为一个单词。
我尝试使用stringr中的str_replace_all在textcnt之前删除标点符号,然后省略textcnt参数中的标点部分,但这样就无法使用所有种类的符号,比如&、>或"来分割。我尝试修改split参数,但这样要么完全不分割句子,要么保留符号。
谢谢!
英文:
The textcnt function in R's tau package has a split argument and it's default value is
split = "[[:space:][:punct:][:digit:]]+" ç this argumet uses the apostrophe ' to split into words too and I don't want that, how can I modify the argument so it doesn't use the apostrophe to split words?
this code:
`library(tau)
text<-"I don't want the function to use the ' to split"
textcnt(text, split = "[[:space:][:punct:][:digit:]]+",method="string",n=1L)`
produces this output:
don function i split t the to use want
1 1 1 1 1 2 2 1 1
instead of having don 1 and t 1, i would like to keep don't as 1 word
I have tried to use str_replace_all from stringr to remove the punctuation beforehand and then omit the punct part of the argument in textcnt but then it doesn't use all kind of symbols such as & > or " to split, I have tried to modify the split argument but then it doesn't split the sentence at all or it keeps the symbols
Thank you
答案1
得分: 0
使用基于PCRE的函数时,您需要使用以下正则表达式:
split = "(?:(?!')[[:space:][:punct:][:digit:]])+|'\\B|\\B'"
在这里:
(?:
- 开始一个非捕获组:(?!')
- 如果下一个字符是'
字符,则匹配失败[[:space:][:punct:][:digit:]]
- 匹配空格、标点符号或数字字符
)+
- 连续匹配一次或多次'\\B
-''
字符,后面要么是字符串的结尾,要么是非单词字符|
- 或者\\B'
-''
,前面要么是字符串的开头,要么是非单词字符。
使用stringr
函数时,您可以使用以下正则表达式:
split = "[[:space:][:punct:][:digit:]--[']]+|'\\B|\\B'"
在这里,[[:space:][:punct:][:digit:]--[']]
匹配所有与 [[:space:][:punct:][:digit:]]
匹配的字符,除了'
字符。
stringr
的ICU正则表达式支持使用这种表示法的字符类减法。
英文:
With PCRE-based functions you need to use
split = "(?:(?!')[[:space:][:punct:][:digit:]])+|'\\B|\\B'"
Here,
(?:
- start of a container non-capturing group:(?!')
- fail the match if the next char is a'
char[[:space:][:punct:][:digit:]]
- matches whitespace, punctuation or digit char)+
- match one or more times (consecutively)'\B
- a'
char that is followed with either end of string or a non-word char|
- or\B'
- a'
that is preceded with either start of string or a non-word char.
With stringr
functions, you can use
split = "[[:space:][:punct:][:digit:]--[']]+|'\\B|\\B'"
Here, [[:space:][:punct:][:digit:]--[']]
matches all characters matched by [[:space:][:punct:][:digit:]]
except the '
chars.
stringr
ICU regex flavor supports character class subtraction using this notation.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论