下降帖子按%或情感模型识别的词语比例。

huangapple go评论91阅读模式
英文:

Dropping posts by % or proportion of words recognized by sentiment model

问题

I have a dataset of 40K reddit posts and I am trying to estimate sentiment per post using a dictionary-based machine learning model. I am using a dictionary that contains 8K unique words and phrases to predict the sentiment.

One challenge I am facing is that for some posts, the dictionary only recognizes 1 positive and/or negative word, so I decided to exclude such posts from my dataset as I think it would be misleading to predict sentiment if the dictionary only captures one or two words at most. I coded this as follows:

#Loading packages
library(tidyverse)
require(readxl)
require(writexl)
library(quanteda)
library(stm)
library(stmCorrViz)
library(stringi)

#Filtering out posts where ONLY 1 positive and 1 negative words are recognized ##

valences_by_post_oneword <- valences_by_post  # subset data for posts where only one word was recognized 

one_pn <- valences_by_post_oneword %>%
  filter(positive==1 & negative==1)

valences_by_post_oneword <- valences_by_post_oneword %>%
  filter(!(positive==1 & negative==1)) 

#2011 & 2012
valence_oneowrd <- valences_by_post_oneword %>%
  filter(year == 2011 | year ==2012)%>%
  group_by(month_year) %>%
  summarize(mean_valence= mean(valence), n=n())

However, it makes more sense to use a % to exclude posts, rather than a specific number of words as the number of words varies across posts, so I would like to exclude posts where the dictionary only recognizes 5% or less of the total words in a given post, but I am not sure how to code this.

Here is how the data currently looks like:

post   negative_words    positive_words   total_words valence_score
xyz.     2                1                  10            -0.66

The positive and negative word columns refer to the number of recognized words by the dictionary per reddit post, while "total_words" refers to all words in a given post regardless of whether they were recognized, and "valence_score" is the estimated sentiment score by reddit post, which is measured as follows:

negative words + positive words/ total words in the given post
英文:

I have a dataset of 40K reddit posts and I am trying to estimate sentiment per post using a dictionary-based machine learning model. I am using a dictionary that contains 8K unique words and phrases to predict the sentiment.

One challenge I am facing is that for some posts, the dictionary only recognizes 1 positive and/or negative word, so I decided to exclude such posts from my dataset as I think it would be misleading to predict sentiment if the dictionary only captures one or two words at most. I coded this as follows:

#Loading packages
library(tidyverse)
require(readxl)
require(writexl)
library(quanteda)
library(stm)
library(stmCorrViz)
library(stringi)

#Filtering out posts where ONLY 1 positive and 1 negative words are recognized ##

valences_by_post_oneword &lt;- valences_by_post  # subset data for posts where only one word was recognized 

one_pn&lt;- valences_by_post_oneword %&gt;% 
  filter(positive==1 &amp; negative==1)

valences_by_post_oneword &lt;- valences_by_post_oneword %&gt;% 
  filter(!(positive==1 &amp; negative==1)) 
  
#2011 &amp; 2012
valence_oneowrd&lt;-valences_by_post_oneword %&gt;%
  filter(year == 2011 | year ==2012)%&gt;%
  group_by(month_year) %&gt;%
  summarize(mean_valence= mean(valence), n=n())

However, it makes more sense to use a % to exclude posts, rather than a specific number of words as the number of words varies across posts, so I would like to exclude posts where the dictionary only recognizes 5% or less of the total words in a given post, but I am not sure how to code this

Here is how the data currently looks like:

post   negative_words    positive_words   total_words valence_score
xyz.     2                1                  10            -0.66

The positive and negative word columns refer to the number of recognized words by the dictionary per reddit post,
while "total_words" refers to all words in a given post regardless of whether they were recognized, and "valence_score" is the estimate sentiment score by reddit post, which is measured as follows:

negative words + positive words/ total words in the given post

答案1

得分: 1

Since you already have a column for total words in the post, you can simply keep using filter(). For a dummy dataset like this:

library(tidyverse)

df &lt;- tribble(
  ~post, ~negative_words, ~positive_words, ~total_words,
  &quot;xyz&quot;, 2, 1, 10,
  &quot;abc&quot;, 5, 3, 12,
  &quot;def&quot;, 2, 1, 100,
)

You can do:

df %&gt;% filter((negative_words + positive_words) &gt; 0.05 * total_words)

# OUTPUT
# A tibble: 2 &#215; 4
  post  negative_words positive_words total_words
  &lt;chr&gt;          &lt;dbl&gt;          &lt;dbl&gt;       &lt;dbl&gt;
1 xyz                2              1          10
2 abc                5              3          12

which keeps the "xyz" and "abc" posts but filters out the "def" post since its total recognized words are lower than 5% of all.

英文:

Since you already have a column for total words in the post, you can simply keep using filter(). For a dummy dataset like this:

library(tidyverse)

df &lt;- tribble(
  ~post, ~negative_words, ~positive_words, ~total_words,
  &quot;xyz&quot;, 2, 1, 10,
  &quot;abc&quot;, 5, 3, 12,
  &quot;def&quot;, 2, 1, 100,
)

You can do:

df %&gt;% filter((negative_words + positive_words) &gt; 0.05* total_words)

# OUTPUT
# A tibble: 2 &#215; 4
  post  negative_words positive_words total_words
  &lt;chr&gt;          &lt;dbl&gt;          &lt;dbl&gt;       &lt;dbl&gt;
1 xyz                2              1          10
2 abc                5              3          12

which keeps the "xyz" and "abc" posts but filters out the "def" post since its total recognised words are lower than 5% of all.

huangapple
  • 本文由 发表于 2023年4月13日 20:01:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76005166.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定