下降帖子按%或情感模型识别的词语比例。

huangapple go评论115阅读模式
英文:

Dropping posts by % or proportion of words recognized by sentiment model

问题

I have a dataset of 40K reddit posts and I am trying to estimate sentiment per post using a dictionary-based machine learning model. I am using a dictionary that contains 8K unique words and phrases to predict the sentiment.

One challenge I am facing is that for some posts, the dictionary only recognizes 1 positive and/or negative word, so I decided to exclude such posts from my dataset as I think it would be misleading to predict sentiment if the dictionary only captures one or two words at most. I coded this as follows:

  1. #Loading packages
  2. library(tidyverse)
  3. require(readxl)
  4. require(writexl)
  5. library(quanteda)
  6. library(stm)
  7. library(stmCorrViz)
  8. library(stringi)

#Filtering out posts where ONLY 1 positive and 1 negative words are recognized ##

  1. valences_by_post_oneword <- valences_by_post # subset data for posts where only one word was recognized
  2. one_pn <- valences_by_post_oneword %>%
  3. filter(positive==1 & negative==1)
  4. valences_by_post_oneword <- valences_by_post_oneword %>%
  5. filter(!(positive==1 & negative==1))
  6. #2011 & 2012
  7. valence_oneowrd <- valences_by_post_oneword %>%
  8. filter(year == 2011 | year ==2012)%>%
  9. group_by(month_year) %>%
  10. summarize(mean_valence= mean(valence), n=n())

However, it makes more sense to use a % to exclude posts, rather than a specific number of words as the number of words varies across posts, so I would like to exclude posts where the dictionary only recognizes 5% or less of the total words in a given post, but I am not sure how to code this.

Here is how the data currently looks like:

  1. post negative_words positive_words total_words valence_score
  2. xyz. 2 1 10 -0.66

The positive and negative word columns refer to the number of recognized words by the dictionary per reddit post, while "total_words" refers to all words in a given post regardless of whether they were recognized, and "valence_score" is the estimated sentiment score by reddit post, which is measured as follows:

  1. negative words + positive words/ total words in the given post
英文:

I have a dataset of 40K reddit posts and I am trying to estimate sentiment per post using a dictionary-based machine learning model. I am using a dictionary that contains 8K unique words and phrases to predict the sentiment.

One challenge I am facing is that for some posts, the dictionary only recognizes 1 positive and/or negative word, so I decided to exclude such posts from my dataset as I think it would be misleading to predict sentiment if the dictionary only captures one or two words at most. I coded this as follows:

  1. #Loading packages
  2. library(tidyverse)
  3. require(readxl)
  4. require(writexl)
  5. library(quanteda)
  6. library(stm)
  7. library(stmCorrViz)
  8. library(stringi)

#Filtering out posts where ONLY 1 positive and 1 negative words are recognized ##

  1. valences_by_post_oneword &lt;- valences_by_post # subset data for posts where only one word was recognized
  2. one_pn&lt;- valences_by_post_oneword %&gt;%
  3. filter(positive==1 &amp; negative==1)
  4. valences_by_post_oneword &lt;- valences_by_post_oneword %&gt;%
  5. filter(!(positive==1 &amp; negative==1))
  6. #2011 &amp; 2012
  7. valence_oneowrd&lt;-valences_by_post_oneword %&gt;%
  8. filter(year == 2011 | year ==2012)%&gt;%
  9. group_by(month_year) %&gt;%
  10. summarize(mean_valence= mean(valence), n=n())

However, it makes more sense to use a % to exclude posts, rather than a specific number of words as the number of words varies across posts, so I would like to exclude posts where the dictionary only recognizes 5% or less of the total words in a given post, but I am not sure how to code this

Here is how the data currently looks like:

  1. post negative_words positive_words total_words valence_score
  2. xyz. 2 1 10 -0.66

The positive and negative word columns refer to the number of recognized words by the dictionary per reddit post,
while "total_words" refers to all words in a given post regardless of whether they were recognized, and "valence_score" is the estimate sentiment score by reddit post, which is measured as follows:

  1. negative words + positive words/ total words in the given post

答案1

得分: 1

Since you already have a column for total words in the post, you can simply keep using filter(). For a dummy dataset like this:

  1. library(tidyverse)
  2. df &lt;- tribble(
  3. ~post, ~negative_words, ~positive_words, ~total_words,
  4. &quot;xyz&quot;, 2, 1, 10,
  5. &quot;abc&quot;, 5, 3, 12,
  6. &quot;def&quot;, 2, 1, 100,
  7. )

You can do:

  1. df %&gt;% filter((negative_words + positive_words) &gt; 0.05 * total_words)
  2. # OUTPUT
  3. # A tibble: 2 &#215; 4
  4. post negative_words positive_words total_words
  5. &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
  6. 1 xyz 2 1 10
  7. 2 abc 5 3 12

which keeps the "xyz" and "abc" posts but filters out the "def" post since its total recognized words are lower than 5% of all.

英文:

Since you already have a column for total words in the post, you can simply keep using filter(). For a dummy dataset like this:

  1. library(tidyverse)
  2. df &lt;- tribble(
  3. ~post, ~negative_words, ~positive_words, ~total_words,
  4. &quot;xyz&quot;, 2, 1, 10,
  5. &quot;abc&quot;, 5, 3, 12,
  6. &quot;def&quot;, 2, 1, 100,
  7. )

You can do:

  1. df %&gt;% filter((negative_words + positive_words) &gt; 0.05* total_words)
  2. # OUTPUT
  3. # A tibble: 2 &#215; 4
  4. post negative_words positive_words total_words
  5. &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
  6. 1 xyz 2 1 10
  7. 2 abc 5 3 12

which keeps the "xyz" and "abc" posts but filters out the "def" post since its total recognised words are lower than 5% of all.

huangapple
  • 本文由 发表于 2023年4月13日 20:01:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76005166.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定