2023年4月13日 20:01:08go评论115阅读模式

英文:

Dropping posts by % or proportion of words recognized by sentiment model

问题

I have a dataset of 40K reddit posts and I am trying to estimate sentiment per post using a dictionary-based machine learning model. I am using a dictionary that contains 8K unique words and phrases to predict the sentiment.

One challenge I am facing is that for some posts, the dictionary only recognizes 1 positive and/or negative word, so I decided to exclude such posts from my dataset as I think it would be misleading to predict sentiment if the dictionary only captures one or two words at most. I coded this as follows:

#Loading packages
library(tidyverse)
require(readxl)
require(writexl)
library(quanteda)
library(stm)
library(stmCorrViz)
library(stringi)

#Filtering out posts where ONLY 1 positive and 1 negative words are recognized ##

valences_by_post_oneword <- valences_by_post  # subset data for posts where only one word was recognized 
one_pn <- valences_by_post_oneword %>%
  filter(positive==1 & negative==1)
valences_by_post_oneword <- valences_by_post_oneword %>%
  filter(!(positive==1 & negative==1)) 
#2011 & 2012
valence_oneowrd <- valences_by_post_oneword %>%
  filter(year == 2011 | year ==2012)%>%
  group_by(month_year) %>%
  summarize(mean_valence= mean(valence), n=n())

However, it makes more sense to use a % to exclude posts, rather than a specific number of words as the number of words varies across posts, so I would like to exclude posts where the dictionary only recognizes 5% or less of the total words in a given post, but I am not sure how to code this.

Here is how the data currently looks like:

post   negative_words    positive_words   total_words valence_score
xyz.     2                1                  10            -0.66

The positive and negative word columns refer to the number of recognized words by the dictionary per reddit post, while "total_words" refers to all words in a given post regardless of whether they were recognized, and "valence_score" is the estimated sentiment score by reddit post, which is measured as follows:

negative words + positive words/ total words in the given post

英文:

#Loading packages
library(tidyverse)
require(readxl)
require(writexl)
library(quanteda)
library(stm)
library(stmCorrViz)
library(stringi)

#Filtering out posts where ONLY 1 positive and 1 negative words are recognized ##

valences_by_post_oneword &lt;- valences_by_post  # subset data for posts where only one word was recognized 
one_pn&lt;- valences_by_post_oneword %&gt;% 
  filter(positive==1 &amp; negative==1)
valences_by_post_oneword &lt;- valences_by_post_oneword %&gt;% 
  filter(!(positive==1 &amp; negative==1)) 
  
#2011 &amp; 2012
valence_oneowrd&lt;-valences_by_post_oneword %&gt;%
  filter(year == 2011 | year ==2012)%&gt;%
  group_by(month_year) %&gt;%
  summarize(mean_valence= mean(valence), n=n())

Here is how the data currently looks like:

post   negative_words    positive_words   total_words valence_score
xyz.     2                1                  10            -0.66

The positive and negative word columns refer to the number of recognized words by the dictionary per reddit post,
while "total_words" refers to all words in a given post regardless of whether they were recognized, and "valence_score" is the estimate sentiment score by reddit post, which is measured as follows:

negative words + positive words/ total words in the given post

答案1

得分: 1

Since you already have a column for total words in the post, you can simply keep using filter(). For a dummy dataset like this:

library(tidyverse)
df &lt;- tribble(
  ~post, ~negative_words, ~positive_words, ~total_words,
  &quot;xyz&quot;, 2, 1, 10,
  &quot;abc&quot;, 5, 3, 12,
  &quot;def&quot;, 2, 1, 100,
)

You can do:

df %&gt;% filter((negative_words + positive_words) &gt; 0.05 * total_words)
# OUTPUT
# A tibble: 2 &#215; 4
  post  negative_words positive_words total_words
  &lt;chr&gt;          &lt;dbl&gt;          &lt;dbl&gt;       &lt;dbl&gt;
1 xyz                2              1          10
2 abc                5              3          12

which keeps the "xyz" and "abc" posts but filters out the "def" post since its total recognized words are lower than 5% of all.

英文:

Since you already have a column for total words in the post, you can simply keep using filter(). For a dummy dataset like this:

library(tidyverse)
df &lt;- tribble(
  ~post, ~negative_words, ~positive_words, ~total_words,
  &quot;xyz&quot;, 2, 1, 10,
  &quot;abc&quot;, 5, 3, 12,
  &quot;def&quot;, 2, 1, 100,
)

You can do:

df %&gt;% filter((negative_words + positive_words) &gt; 0.05* total_words)
# OUTPUT
# A tibble: 2 &#215; 4
  post  negative_words positive_words total_words
  &lt;chr&gt;          &lt;dbl&gt;          &lt;dbl&gt;       &lt;dbl&gt;
1 xyz                2              1          10
2 abc                5              3          12

which keeps the "xyz" and "abc" posts but filters out the "def" post since its total recognised words are lower than 5% of all.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

下降帖子按%或情感模型识别的词语比例。

问题

答案1

在R中使用不同的文件名编写多个表格。

将数据框按组存储为JSON格式

dataframe replace() 在函数内部不起作用。

根据特定值的索引筛选numpy数组

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。