2023年3月9日 23:39:47go评论65阅读模式

英文:

Adding dataframe column with frequency counts for several pre-specified words in R

问题

我有一个包含成千上万篇新闻文章的数据框，看起来像这样：

id	text	date
1	newyorktimes leaders gather for the un summit in next week to discuss	1980-1-18
2	newyorktimes opinion section what the washingtonpost got wrong about	1980-1-22
3	a journalist for the washingtonpost went missing while on assignment	1980-1-22
4	washingtonpost president carter responds to criticisms on economic decline	1980-1-28
5	newyorktimes opinion section what needs to be down with about the rats	1980-1-29

我想要添加一个额外的列，其中包含文章中几个特定单词的组合计数。假设我想知道每篇文章中 "newyorktimes"、"washingtonpost" 和 "the" 出现的次数。我希望在数据框中添加一个单独的列，显示每行的计数。如下所示：

id	text	date	wordlistcount
1	newyorktimes leaders gather for the un summit in next week to discuss	1980-1-18	2
2	newyorktimes opinion section what the washingtonpost and newyorktimes got wrong	1980-1-22	4
3	a journalist for the washingtonpost went missing while on assignment	1980-1-22	2
4	washingtonpost president carter responds to criticisms on economic decline	1980-1-28	1
5	newyorktimes opinion section what needs to be done with about the rats	1980-1-29	2

我该如何实现这一目标？任何帮助将不胜感激。

英文:

I have a dataframe of thousands of news articles that looks like this:

id	text	date
1	newyorktimes leaders gather for the un summit in next week to discuss	1980-1-18
2	newyorktimes opinion section what the washingtonpost got wrong about	1980-1-22
3	a journalist for the washingtonpost went missing while on assignment	1980-1-22
4	washingtonpost president carter responds to criticisms on economic decline	1980-1-28
5	newyorktimes opinion section what needs to be down with about the rats	1980-1-29

I want to produce an additional column that has the combined counts for several specific words in the articles themselves. Let's say I want to know how many times "newyorktimes", "washingtonpost", and "the" appear in each article. I would want a separate column added to the dataframe adding the counts for that row. Like this:

id	text	date	wordlistcount
1	newyorktimes leaders gather for the un summit in next week to discuss	1980-1-18	2
2	newyorktimes opinion section what the washingtonpost and newyorktimes got wrong	1980-1-22	4
3	a journalist for the washingtonpost went missing while on assignment	1980-1-22	2
4	washingtonpost president carter responds to criticisms on economic decline	1980-1-28	1
4	newyorktimes opinion section what needs to be done with about the rats	1980-1-29	2

How can I accomplish this? Any help would be greatly appreciated.

答案1

得分: 2

在stringr中，使用str_count函数：

library(stringr)
library(dplyr)
words = c("newyorktimes", "washingtonpost", "the")
df %>%
  mutate(wordlistcount = str_count(text, str_c("\\b", words, "\\b", collapse = "|")))

英文:

In stringr, with str_count:

library(stringr)
library(dplyr)
words = c(&quot;newyorktimes&quot;, &quot;washingtonpost&quot;, &quot;the&quot;)
df %&gt;% 
  mutate(wordlistcount = str_count(text, str_c(&quot;\\b&quot;, words, &quot;\\b&quot;, collapse = &quot;|&quot;)))




#   id                                                                       text      date wordlistcount
# 1  1      newyorktimes leaders gather for the un summit in next week to discuss 1980-1-18             2
# 2  2       newyorktimes opinion section what the washingtonpost got wrong about 1980-1-22             3
# 3  3       a journalist for the washingtonpost went missing while on assignment 1980-1-22             2
# 4  4 washingtonpost president carter responds to criticisms on economic decline 1980-1-28             1
# 5  5     newyorktimes opinion section what needs to be down with about the rats 1980-1-29             2

答案2

得分: 1

正则表达式的搜索可能有点棘手。在你的情况下，"the" 是一个单词，但也可以是其他单词的一部分（比如你的虚拟数据的第一行中的 "gather"）。为了确保只计算单独的单词，你可以搜索 "the"，同时通知它前后的内容不是字母。

library(dplyr)

mydf <- data.table::fread("id  text  date
    1  newyorktimes leaders gather for the un summit in next week to discuss  1980-1-18
    2  newyorktimes opinion section what the washingtonpost and newyorktimes got wrong  1980-1-22
    3  a journalist for the washingtonpost went missing while on assignment  1980-1-22
    4  washingtonpost president carter responds to criticisms on economic decline  1980-1-28
    5  newyorktimes opinion section what needs to be down with about the rats  1980-1-29")

# 搜索词向量，其中 [^\\p{L}] 表示除了字母以外的任何字符
search_vec <- c("newyorktimes", "washingtonpost", "[^\\p{L}]the[^\\p{L}]")

mydf %>%
    dplyr::mutate(wordlistcount = stringr::str_count(text, pattern = paste(search_vec, collapse = "|")))

你的数据看起来没问题，但不管怎样，根据你的用例，你可能希望在使用或在 str_count 函数内部之前将所有文本转换为小写。这将确保大小写的差异不会干扰字符串匹配（即 "the" != "The"）...将所有文本转换为大写并将搜索词以大写形式书写是等效的。

英文:

the search for regex can be a bit tricky. In your case "the" is a word but also can be part of other words (like "gather" in the first line of your dummy data). So to be sure you only do count the individual word you can search for "the", while informing that what comes after and before, is anything but a letter.

library(dplyr)


mydf &lt;- data.table::fread(&quot;id 	text 	date
    1 	newyorktimes leaders gather for the un summit in next week to discuss 	1980-1-18
    2 	newyorktimes opinion section what the washingtonpost and newyorktimes got wrong 	1980-1-22
    3 	a journalist for the washingtonpost went missing while on assignment 	1980-1-22
    4 	washingtonpost president carter responds to criticisms on economic decline 	1980-1-28
    5 	newyorktimes opinion section what needs to be down with about the rats 	1980-1-29&quot;)

# vector of search words where [^\\p{L}] is anything but a letter from any alphabet
search_vec &lt;- c(&quot;newyorktimes&quot;,&quot;washingtonpost&quot;,&quot;[^\\p{L}]the[^\\p{L}]&quot;) 

mydf %&gt;% 
    dplyr::mutate(wordlistcount = stringr::str_count(text, pattern = paste(search_vec, collapse = &quot;|&quot;)))

   id                                                                            text       date wordlistcount
1:  1           newyorktimes leaders gather for the un summit in next week to discuss 1980-01-18             2
2:  2 newyorktimes opinion section what the washingtonpost and newyorktimes got wrong 1980-01-22             4
3:  3            a journalist for the washingtonpost went missing while on assignment 1980-01-22             2
4:  4      washingtonpost president carter responds to criticisms on economic decline 1980-01-28             1
5:  5          newyorktimes opinion section what needs to be down with about the rats 1980-01-29             2

You data looks OK but I will point out anyways, that depending on your usecase you might want to convert all text to lower case before or inside the str_count function. This will ensure that diference in upper and lower case do not interfere with the string matching (i.e. "the" != "The")... converting all text to upper and writing the search words in uppercase is the equivalent.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中添加数据框列，该列包含预定单词的频率计数。

问题

答案1

答案2

Pandas styler gradient从另一列获取vmin和vmax。

按客户和下次交易时间分组客户会话

比较一个值是否位于另外两个值之间

如何使用循环对数据进行排序？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论