在R中计算单词子集的出现次数?

huangapple go评论52阅读模式
英文:

Count subset of words occurrences in R?

问题

在R中,假设我有一个字符串列表,类似于以下内容:

str_list <- list("corn is food", "corn is good")

如果我想要计算某个词汇子集(如"corn"和"food")中的每个词在每个元素中出现的次数,是否有方法可以实现?例如,基于str_list,我想要一个向量**[2, 1]**,它统计了第一个元素中的food(1次)和corn(1次),以及第二个元素中的corn(1次)。我不想仅仅计算单个词汇如"corn"的出现次数,这可以使用stringr::str_count()函数来完成。

英文:

In R, suppose I have a list of strings like the following:

str_list &lt;- list(&quot;corn is food&quot;, &quot;corn is good&quot;)

If I want to count the number of times each word in some subset of words like "corn" and "food" occur in each element, is there any way to do this? For example, based on str_list, I would want a vector [2, 1] that counted food (1x) and corn (1x) in the first element, and corn (1x) in the second element. I do NOT want to count just a single word like "corn", which can just be done with the stringr::str_count() function).

答案1

得分: 4

你可以按照以下方式解决你的问题:

colSums(sapply(words, stringi::stri_count_fixed, str=str_list))
# corn food 
#    2    1

# 或者
stringi::stri_count_fixed(paste0(str_list, collapse=" "), words)
# [1] 2 1
数据
str_list <- list("corn is food", "corn is good")
words <- c("corn", "food")
英文:

You could solve your problem as follow:

colSums(sapply(words, stringi::stri_count_fixed, str=str_list))
# corn food 
#    2    1

# or 
stringi::stri_count_fixed(paste0(str_list, collapse=&quot; &quot;), words)
# [1] 2 1
data
str_list &lt;- list(&quot;corn is food&quot;, &quot;corn is good&quot;)
words &lt;- c(&quot;corn&quot;, &quot;food&quot;)

答案2

得分: 4

使用基本的R语言,您可以使用sapply + grep + lengths来实现:

lengths(sapply(words, grep, str_list))

# corn food 
#    2    1

更新

正如@Onyambu指出的,如果一个词在一个句子中重复出现,grep将无法捕获重复。通过将grep()替换为gregexpr()进行了修订。

sapply(words, \(x) sum(gregexpr(x, toString(str_list))[[1]] &gt; 0))

使用stringr::str_count()的等效解决方案:

colSums(sapply(words, stringr::str_count, string = str_list))

数据
str_list &lt;- list(&quot;corn is food&quot;, &quot;corn is good&quot;)
words &lt;- c(&quot;corn&quot;, &quot;food&quot;)
英文:

With base R, You can use sapply + grep + lengths:

lengths(sapply(words, grep, str_list))

# corn food 
#    2    1

Update

As @Onyambu points out, if a word is repeated in a sentence, grep will not capture the repeat. A revision is made by replacing grep() with gregexpr().

sapply(words, \(x) sum(gregexpr(x, toString(str_list))[[1]] &gt; 0))

An equivalent solution with stringr::str_count():

colSums(sapply(words, stringr::str_count, string = str_list))

Data
str_list &lt;- list(&quot;corn is food&quot;, &quot;corn is good&quot;)
words &lt;- c(&quot;corn&quot;, &quot;food&quot;)

答案3

得分: 3

如果我正确理解您的需求,下面的代码应该解决了它,尽管我们在其中使用了str_count

library(stringr)

str_list <- list("corn is food", "corn is good")
word_list <- c("corn", "food")

count_words <- function(string, words) {
  sum(sapply(words, function(word) str_count(string, word)))
}

result <- sapply(str_list, count_words, word_list)

这将输出所需的向量:

> print(result)
[1] 2 1
英文:

If I understood properly what you need, the code below should solve it, although we do have str_count in there:

library(stringr)

str_list &lt;- list(&quot;corn is food&quot;, &quot;corn is good&quot;)
word_list &lt;- c(&quot;corn&quot;, &quot;food&quot;)

count_words &lt;- function(string, words) {
  sum(sapply(words, function(word) str_count(string, word)))
}

result &lt;- sapply(str_list, count_words, word_list)

which outputs the required vector:

&gt; print(result)
[1] 2 1

答案4

得分: 2

你可以尝试使用 strsplit + table,如下所示:

&gt; table(unlist(strsplit(unlist(str_list), &quot;\\W+&quot;)))[word_list]

corn food
   2    1
英文:

You can try strsplit + table like below

&gt; table(unlist(strsplit(unlist(str_list), &quot;\\W+&quot;)))[word_list]

corn food
   2    1

huangapple
  • 本文由 发表于 2023年7月10日 12:23:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76650675.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定