在Excel列中字符串的出现次数 – R

huangapple go评论59阅读模式
英文:

R - Number of occurrences of a string in a column of excel

问题

I'm using the stringr library to count the number of occurrences of an array of strings in a column in excel.

Sample data:

As you can see from the Sample data, there are two kinds of apostrophes used ' and . However, in R, I'm only able to use ' while creating the string.arr. Consequently, the code (below) is not counting the strings which have in them.

It's not feasible to modify the data. Can I solve this in the code such that both ' and in the data are detected by ' in the code.

I'm open to using any other package in R.

英文:

I'm using the stringrlibrary to count the number of occurrences of an array of strings in a column in excel.

string.arr =  c(
    "I can't handle this.",
    "I shouldn't be this stressed out.",
    ... more possible strings ...
)

Sample data:

1 col_name

2 “I’m never going to succeed.”,“The professor will be disappointed in me.”,“Other students won’t want to work with me.”,“I shouldn't be this stressed out.",“Other people can handle this situation - what's wrong with me?"
3 “Everyone will think I am dumb.”,“People will make jokes about me if I get the wrong answer.”,“I shouldn't be this stressed out.",“Other people can handle this situation - what's wrong with me?"
4 ... more such rows ...

As you can see from the Sample data, there are two kinds of apostrophes used ' and . However, in R, I'm only able to use ' while creating the string.arr. Consequently, the code (below) is not counting the strings which have in them.

for (string in string.arr) {
 sum(str_count(deidentified_data_text_df$col_name, string), na.rm=TRUE)
}

It's not feasible to modify the data. Can I solve this in the code such that both ' and in the data are detected by ' in the code.

I'm open to using any other package in R.

答案1

得分: 1

如果 string.arr 包含的实际上是要在较大文本中匹配的关键词(或句子),并且问题在于较大文本可能包含两种不同类型的撇号,那么您可以简单地使用正则表达式的选择组将 string.arr 中的所有撇号替换为:

string.arr <- gsub("’|&#39;","(’|&#39;)",string.arr)

结果:

string.arr
[1] "I can(’|&#39;)t handle this."              
[2] "They won(’|&#39;)t handle this"            
[3] "I shouldn(’|&#39;)t be this stressed out."
[4] "no apostrophe"

数据:

string.arr =  c(
  "I can’t handle this.",                          # 弯曲的撇号
  "They won&#39;t handle this",                        # 直撇号
  "I shouldn&#39;t be this stressed out.",             # 直撇号
  "no apostrophe"                                  # 没有撇号
)
英文:

EDIT:

If string.arr contains what is essentially a list of key words (or sentences) that you want to match in larger text and the problem is that that larger text may contain two kinds of apostrophes, then you might simply replace all apostrophes in string.arr by a regex alternation group:

string.arr &lt;- gsub(&quot;’|&#39;&quot;, &quot;(’|&#39;)&quot;, string.arr)

Result:

string.arr
[1] &quot;I can(’|&#39;)t handle this.&quot;              
[2] &quot;They won(’|&#39;)t handle this&quot;            
[3] &quot;I shouldn(’|&#39;)t be this stressed out.&quot;
[4] &quot;no apostrophe&quot;

Data:

string.arr =  c(
  &quot;I can’t handle this.&quot;,                          # bent apostrophe
  &quot;They won&#39;t handle this&quot;,                        # straight apostrophe
  &quot;I shouldn&#39;t be this stressed out.&quot;,             # straight apostrophe
  &quot;no apostrophe&quot;                                  # no apostrophe
)

huangapple
  • 本文由 发表于 2023年2月7日 01:39:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/75364738.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定