英文:
Dropping posts by % or proportion of words recognized by sentiment model
问题
I have a dataset of 40K reddit posts and I am trying to estimate sentiment per post using a dictionary-based machine learning model. I am using a dictionary that contains 8K unique words and phrases to predict the sentiment.
One challenge I am facing is that for some posts, the dictionary only recognizes 1 positive and/or negative word, so I decided to exclude such posts from my dataset as I think it would be misleading to predict sentiment if the dictionary only captures one or two words at most. I coded this as follows:
#Loading packages
library(tidyverse)
require(readxl)
require(writexl)
library(quanteda)
library(stm)
library(stmCorrViz)
library(stringi)
#Filtering out posts where ONLY 1 positive and 1 negative words are recognized ##
valences_by_post_oneword <- valences_by_post # subset data for posts where only one word was recognized
one_pn <- valences_by_post_oneword %>%
filter(positive==1 & negative==1)
valences_by_post_oneword <- valences_by_post_oneword %>%
filter(!(positive==1 & negative==1))
#2011 & 2012
valence_oneowrd <- valences_by_post_oneword %>%
filter(year == 2011 | year ==2012)%>%
group_by(month_year) %>%
summarize(mean_valence= mean(valence), n=n())
However, it makes more sense to use a % to exclude posts, rather than a specific number of words as the number of words varies across posts, so I would like to exclude posts where the dictionary only recognizes 5% or less of the total words in a given post, but I am not sure how to code this.
Here is how the data currently looks like:
post negative_words positive_words total_words valence_score
xyz. 2 1 10 -0.66
The positive and negative word columns refer to the number of recognized words by the dictionary per reddit post, while "total_words" refers to all words in a given post regardless of whether they were recognized, and "valence_score" is the estimated sentiment score by reddit post, which is measured as follows:
negative words + positive words/ total words in the given post
英文:
I have a dataset of 40K reddit posts and I am trying to estimate sentiment per post using a dictionary-based machine learning model. I am using a dictionary that contains 8K unique words and phrases to predict the sentiment.
One challenge I am facing is that for some posts, the dictionary only recognizes 1 positive and/or negative word, so I decided to exclude such posts from my dataset as I think it would be misleading to predict sentiment if the dictionary only captures one or two words at most. I coded this as follows:
#Loading packages
library(tidyverse)
require(readxl)
require(writexl)
library(quanteda)
library(stm)
library(stmCorrViz)
library(stringi)
#Filtering out posts where ONLY 1 positive and 1 negative words are recognized ##
valences_by_post_oneword <- valences_by_post # subset data for posts where only one word was recognized
one_pn<- valences_by_post_oneword %>%
filter(positive==1 & negative==1)
valences_by_post_oneword <- valences_by_post_oneword %>%
filter(!(positive==1 & negative==1))
#2011 & 2012
valence_oneowrd<-valences_by_post_oneword %>%
filter(year == 2011 | year ==2012)%>%
group_by(month_year) %>%
summarize(mean_valence= mean(valence), n=n())
However, it makes more sense to use a % to exclude posts, rather than a specific number of words as the number of words varies across posts, so I would like to exclude posts where the dictionary only recognizes 5% or less of the total words in a given post, but I am not sure how to code this
Here is how the data currently looks like:
post negative_words positive_words total_words valence_score
xyz. 2 1 10 -0.66
The positive and negative word columns refer to the number of recognized words by the dictionary per reddit post,
while "total_words" refers to all words in a given post regardless of whether they were recognized, and "valence_score" is the estimate sentiment score by reddit post, which is measured as follows:
negative words + positive words/ total words in the given post
答案1
得分: 1
Since you already have a column for total words in the post, you can simply keep using filter()
. For a dummy dataset like this:
library(tidyverse)
df <- tribble(
~post, ~negative_words, ~positive_words, ~total_words,
"xyz", 2, 1, 10,
"abc", 5, 3, 12,
"def", 2, 1, 100,
)
You can do:
df %>% filter((negative_words + positive_words) > 0.05 * total_words)
# OUTPUT
# A tibble: 2 × 4
post negative_words positive_words total_words
<chr> <dbl> <dbl> <dbl>
1 xyz 2 1 10
2 abc 5 3 12
which keeps the "xyz" and "abc" posts but filters out the "def" post since its total recognized words are lower than 5% of all.
英文:
Since you already have a column for total words in the post, you can simply keep using filter()
. For a dummy dataset like this:
library(tidyverse)
df <- tribble(
~post, ~negative_words, ~positive_words, ~total_words,
"xyz", 2, 1, 10,
"abc", 5, 3, 12,
"def", 2, 1, 100,
)
You can do:
df %>% filter((negative_words + positive_words) > 0.05* total_words)
# OUTPUT
# A tibble: 2 × 4
post negative_words positive_words total_words
<chr> <dbl> <dbl> <dbl>
1 xyz 2 1 10
2 abc 5 3 12
which keeps the "xyz" and "abc" posts but filters out the "def" post since its total recognised words are lower than 5% of all.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论