在R中添加数据框列,该列包含预定单词的频率计数。

huangapple go评论65阅读模式
英文:

Adding dataframe column with frequency counts for several pre-specified words in R

问题

我有一个包含成千上万篇新闻文章的数据框,看起来像这样:

id text date
1 newyorktimes leaders gather for the un summit in next week to discuss 1980-1-18
2 newyorktimes opinion section what the washingtonpost got wrong about 1980-1-22
3 a journalist for the washingtonpost went missing while on assignment 1980-1-22
4 washingtonpost president carter responds to criticisms on economic decline 1980-1-28
5 newyorktimes opinion section what needs to be down with about the rats 1980-1-29

我想要添加一个额外的列,其中包含文章中几个特定单词的组合计数。假设我想知道每篇文章中 "newyorktimes"、"washingtonpost" 和 "the" 出现的次数。我希望在数据框中添加一个单独的列,显示每行的计数。如下所示:

id text date wordlistcount
1 newyorktimes leaders gather for the un summit in next week to discuss 1980-1-18 2
2 newyorktimes opinion section what the washingtonpost and newyorktimes got wrong 1980-1-22 4
3 a journalist for the washingtonpost went missing while on assignment 1980-1-22 2
4 washingtonpost president carter responds to criticisms on economic decline 1980-1-28 1
5 newyorktimes opinion section what needs to be done with about the rats 1980-1-29 2

我该如何实现这一目标?任何帮助将不胜感激。

英文:

I have a dataframe of thousands of news articles that looks like this:

id text date
1 newyorktimes leaders gather for the un summit in next week to discuss 1980-1-18
2 newyorktimes opinion section what the washingtonpost got wrong about 1980-1-22
3 a journalist for the washingtonpost went missing while on assignment 1980-1-22
4 washingtonpost president carter responds to criticisms on economic decline 1980-1-28
5 newyorktimes opinion section what needs to be down with about the rats 1980-1-29

I want to produce an additional column that has the combined counts for several specific words in the articles themselves. Let's say I want to know how many times "newyorktimes", "washingtonpost", and "the" appear in each article. I would want a separate column added to the dataframe adding the counts for that row. Like this:

id text date wordlistcount
1 newyorktimes leaders gather for the un summit in next week to discuss 1980-1-18 2
2 newyorktimes opinion section what the washingtonpost and newyorktimes got wrong 1980-1-22 4
3 a journalist for the washingtonpost went missing while on assignment 1980-1-22 2
4 washingtonpost president carter responds to criticisms on economic decline 1980-1-28 1
4 newyorktimes opinion section what needs to be done with about the rats 1980-1-29 2

How can I accomplish this? Any help would be greatly appreciated.

答案1

得分: 2

stringr中,使用str_count函数:

library(stringr)
library(dplyr)
words = c("newyorktimes", "washingtonpost", "the")
df %>%
  mutate(wordlistcount = str_count(text, str_c("\\b", words, "\\b", collapse = "|")))
英文:

In stringr, with str_count:

library(stringr)
library(dplyr)
words = c("newyorktimes", "washingtonpost", "the")
df %>% 
  mutate(wordlistcount = str_count(text, str_c("\\b", words, "\\b", collapse = "|")))




#   id                                                                       text      date wordlistcount
# 1  1      newyorktimes leaders gather for the un summit in next week to discuss 1980-1-18             2
# 2  2       newyorktimes opinion section what the washingtonpost got wrong about 1980-1-22             3
# 3  3       a journalist for the washingtonpost went missing while on assignment 1980-1-22             2
# 4  4 washingtonpost president carter responds to criticisms on economic decline 1980-1-28             1
# 5  5     newyorktimes opinion section what needs to be down with about the rats 1980-1-29             2

答案2

得分: 1

正则表达式的搜索可能有点棘手。在你的情况下,"the" 是一个单词,但也可以是其他单词的一部分(比如你的虚拟数据的第一行中的 "gather")。为了确保只计算单独的单词,你可以搜索 "the",同时通知它前后的内容不是字母。

library(dplyr)

mydf <- data.table::fread("id  text  date
    1  newyorktimes leaders gather for the un summit in next week to discuss  1980-1-18
    2  newyorktimes opinion section what the washingtonpost and newyorktimes got wrong  1980-1-22
    3  a journalist for the washingtonpost went missing while on assignment  1980-1-22
    4  washingtonpost president carter responds to criticisms on economic decline  1980-1-28
    5  newyorktimes opinion section what needs to be down with about the rats  1980-1-29")

# 搜索词向量,其中 [^\\p{L}] 表示除了字母以外的任何字符
search_vec <- c("newyorktimes", "washingtonpost", "[^\\p{L}]the[^\\p{L}]")

mydf %>%
    dplyr::mutate(wordlistcount = stringr::str_count(text, pattern = paste(search_vec, collapse = "|")))

你的数据看起来没问题,但不管怎样,根据你的用例,你可能希望在使用或在 str_count 函数内部之前将所有文本转换为小写。这将确保大小写的差异不会干扰字符串匹配(即 "the" != "The")...将所有文本转换为大写并将搜索词以大写形式书写是等效的。

英文:

the search for regex can be a bit tricky. In your case "the" is a word but also can be part of other words (like "gather" in the first line of your dummy data). So to be sure you only do count the individual word you can search for "the", while informing that what comes after and before, is anything but a letter.

library(dplyr)


mydf &lt;- data.table::fread(&quot;id 	text 	date
    1 	newyorktimes leaders gather for the un summit in next week to discuss 	1980-1-18
    2 	newyorktimes opinion section what the washingtonpost and newyorktimes got wrong 	1980-1-22
    3 	a journalist for the washingtonpost went missing while on assignment 	1980-1-22
    4 	washingtonpost president carter responds to criticisms on economic decline 	1980-1-28
    5 	newyorktimes opinion section what needs to be down with about the rats 	1980-1-29&quot;)

# vector of search words where [^\\p{L}] is anything but a letter from any alphabet
search_vec &lt;- c(&quot;newyorktimes&quot;,&quot;washingtonpost&quot;,&quot;[^\\p{L}]the[^\\p{L}]&quot;) 

mydf %&gt;% 
    dplyr::mutate(wordlistcount = stringr::str_count(text, pattern = paste(search_vec, collapse = &quot;|&quot;)))

   id                                                                            text       date wordlistcount
1:  1           newyorktimes leaders gather for the un summit in next week to discuss 1980-01-18             2
2:  2 newyorktimes opinion section what the washingtonpost and newyorktimes got wrong 1980-01-22             4
3:  3            a journalist for the washingtonpost went missing while on assignment 1980-01-22             2
4:  4      washingtonpost president carter responds to criticisms on economic decline 1980-01-28             1
5:  5          newyorktimes opinion section what needs to be down with about the rats 1980-01-29             2

You data looks OK but I will point out anyways, that depending on your usecase you might want to convert all text to lower case before or inside the str_count function. This will ensure that diference in upper and lower case do not interfere with the string matching (i.e. "the" != "The")... converting all text to upper and writing the search words in uppercase is the equivalent.

huangapple
  • 本文由 发表于 2023年3月9日 23:39:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/75686859.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定