2023年7月10日 12:23:37go评论66阅读模式

英文:

Count subset of words occurrences in R?

问题

在R中，假设我有一个字符串列表，类似于以下内容：

str_list <- list("corn is food", "corn is good")

如果我想要计算某个词汇子集（如"corn"和"food"）中的每个词在每个元素中出现的次数，是否有方法可以实现？例如，基于str_list，我想要一个向量**[2, 1]**，它统计了第一个元素中的food（1次）和corn（1次），以及第二个元素中的corn（1次）。我不想仅仅计算单个词汇如"corn"的出现次数，这可以使用stringr::str_count()函数来完成。

英文:

In R, suppose I have a list of strings like the following:

str_list &lt;- list(&quot;corn is food&quot;, &quot;corn is good&quot;)

If I want to count the number of times each word in some subset of words like "corn" and "food" occur in each element, is there any way to do this? For example, based on str_list, I would want a vector [2, 1] that counted food (1x) and corn (1x) in the first element, and corn (1x) in the second element. I do NOT want to count just a single word like "corn", which can just be done with the stringr::str_count() function).

答案1

得分: 4

你可以按照以下方式解决你的问题：

colSums(sapply(words, stringi::stri_count_fixed, str=str_list))
# corn food 
#    2    1

# 或者
stringi::stri_count_fixed(paste0(str_list, collapse=" "), words)
# [1] 2 1

数据

str_list <- list("corn is food", "corn is good")
words <- c("corn", "food")

英文:

You could solve your problem as follow:

colSums(sapply(words, stringi::stri_count_fixed, str=str_list))
# corn food 
#    2    1

# or 
stringi::stri_count_fixed(paste0(str_list, collapse=&quot; &quot;), words)
# [1] 2 1

data

str_list &lt;- list(&quot;corn is food&quot;, &quot;corn is good&quot;)
words &lt;- c(&quot;corn&quot;, &quot;food&quot;)

答案2

得分: 4

使用基本的R语言，您可以使用sapply + grep + lengths来实现：

lengths(sapply(words, grep, str_list))

# corn food 
#    2    1

更新

正如@Onyambu指出的，如果一个词在一个句子中重复出现，grep将无法捕获重复。通过将grep()替换为gregexpr()进行了修订。

sapply(words, \(x) sum(gregexpr(x, toString(str_list))[[1]] &gt; 0))

使用stringr::str_count()的等效解决方案：

colSums(sapply(words, stringr::str_count, string = str_list))

数据

str_list &lt;- list(&quot;corn is food&quot;, &quot;corn is good&quot;)
words &lt;- c(&quot;corn&quot;, &quot;food&quot;)

英文:

With base R, You can use sapply + grep + lengths:

lengths(sapply(words, grep, str_list))

# corn food 
#    2    1

Update

As @Onyambu points out, if a word is repeated in a sentence, grep will not capture the repeat. A revision is made by replacing grep() with gregexpr().

sapply(words, \(x) sum(gregexpr(x, toString(str_list))[[1]] &gt; 0))

An equivalent solution with stringr::str_count():

colSums(sapply(words, stringr::str_count, string = str_list))

Data

str_list &lt;- list(&quot;corn is food&quot;, &quot;corn is good&quot;)
words &lt;- c(&quot;corn&quot;, &quot;food&quot;)

答案3

得分: 3

如果我正确理解您的需求，下面的代码应该解决了它，尽管我们在其中使用了str_count：

library(stringr)

str_list <- list("corn is food", "corn is good")
word_list <- c("corn", "food")

count_words <- function(string, words) {
  sum(sapply(words, function(word) str_count(string, word)))
}

result <- sapply(str_list, count_words, word_list)

这将输出所需的向量：

> print(result)
[1] 2 1

英文:

If I understood properly what you need, the code below should solve it, although we do have str_count in there:

library(stringr)

str_list &lt;- list(&quot;corn is food&quot;, &quot;corn is good&quot;)
word_list &lt;- c(&quot;corn&quot;, &quot;food&quot;)

count_words &lt;- function(string, words) {
  sum(sapply(words, function(word) str_count(string, word)))
}

result &lt;- sapply(str_list, count_words, word_list)

which outputs the required vector:

&gt; print(result)
[1] 2 1

答案4

得分: 2

你可以尝试使用 strsplit + table，如下所示：

&gt; table(unlist(strsplit(unlist(str_list), &quot;\\W+&quot;)))[word_list]

corn food
   2    1

英文:

You can try strsplit + table like below

&gt; table(unlist(strsplit(unlist(str_list), &quot;\\W+&quot;)))[word_list]

corn food
   2    1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中计算单词子集的出现次数？

问题

答案1

数据

data

答案2

更新

数据

Update

Data

答案3

答案4

Go – 比较用户输入的字符串/字节切片

`R`/`ggplot2`：合并个别`geom_histogram`层时的奇怪现象

识别在28天内的指标演示和再次出席。

Golang在拆分字符串时出现非法的rune字面错误。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论