使用时间条件而不是季度/月份来改变情感指标(使用dplyr)。

huangapple go评论91阅读模式
英文:

Mutating a sentiment indicator using time conditions, rather than quarter/month (dplyr)

问题

我有一个Reddit数据集,其中每行代表一篇Reddit帖子,我有每个Reddit帖子的情感分数,还有一个变量来捕捉由相同用户名编写的所有帖子的平均情感。

我试图创建一个与最低工资政策时间线相关的情感指标,我想根据三个时期对每个用户名进行情感分类:

1- 政策公告之前,假设是在“2021-03-01”之前。
2- 政策宣布之后但在实施之前,在“2021-03-01”之后但在“2021-09-01”之前。
3- 政策实施之后,在“2021-09-01”之后。

我已经能够按月或季度计算每个用户名的情感,如下所示,但我想根据上述特定政策时间线创建每个用户名的情感,我不确定如何做到这一点。

上传包

  1. library(tidyverse)
  2. library(lubridate)
  3. library(zoo)

打印具有特定列的数据示例

  1. dput(df[1:5,c(3,4,21, 22, 23)])

输出:

  1. structure(list(date = structure(c(15149, 15150, 15150, 15150,
  2. 15150), class = "Date"), username = c("ax", "aa",
  3. "cartman", "abc", "aff"
  4. ), quarter_yr = c("2011 Q2", "2011 Q2", "2011 Q2", "2011 Q2",
  5. "2011 Q2"), sentiment_score = c("0", "-1", "1", "-1", "-1"),
  6. avg_sentiment = c(0.0666666666666667, -0.777777777777778,
  7. 1, -1, -1)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
  8. ), row.names = c(NA, -5L), groups = structure(list(username = c("ax",
  9. "cartman", "abc", "aff"), .rows = structure(list(5L, 4L, 1L, 2L, 3L), ptype = integer(0), class = c("vctrs_list_of",
  10. "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
  11. ), row.names = c(NA, -5L), .drop = TRUE))

创建季度/年份变量

  1. sentiment_df <- sentiment_df %>%
  2. mutate(date = ymd(date),
  3. quarter_yr = paste(year(date), quarters(date)))

根据用户名计算情感得分的平均值,基于他们有很多观察/帖子:

  1. sentiment_df <- df %>% group_by(username, quarter_yr) %>% summarise(avg_sentiment = mean(as.numeric(sentiment_score)))

按用户名的季度情感:

  1. dput(sentiment_df[1:2,c(1,8)])

输出

  1. structure(list(username = c("cartman","aa"
  2. ), `2014 Q2` = c(NA_real_, NA_real_)), class = c("grouped_df",
  3. "tbl_df", "tbl", "data.frame"), row.names = c(NA, -2L), groups = structure(list(
  4. username = c("cartman","aa"), .rows = structure(list(
  5. 1L, 2L), ptype = integer(0), class = c("vctrs_list_of",
  6. "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
  7. ), row.names = c(NA, -2L), .drop = TRUE))
英文:

I have a reddit dataset where each row represents a single reddit post and I have a sentiment score for each reddit post by a given username. I also have a variable capturing the average sentiment for all posts written by the same username.

I am trying to create a sentiment indicator relevant to the timeline of a minimum wage policy, where I would like to categorize sentiment per username based on three periods:

1- Before the policy's announcement, let's say it's on "2021-03-01"
2- After the policy announcement yet before implementation, so after "2021-03-01" but before "2021-09-01"
3- after the policy's implementation, on "2021-09-01"

I have been able to compute sentiment for each username by month or quarter, as I show below but I would like to create sentiment per username based on the specific policy timeline above, and I am not sure how to do that.

Upload packages

  1. library(tidyverse)
  2. library(lubridate)
  3. library(zoo)

Print data example with specific columns

  1. dput(df[1:5,c(3,4,21, 22, 23)])

output:

  1. structure(list(date = structure(c(15149, 15150, 15150, 15150,
  2. 15150), class = &quot;Date&quot;), username = c(&quot;ax&quot;, &quot;aa&quot;,
  3. &quot;cartman&quot;, &quot;abc&quot;, &quot;aff&quot;
  4. ), quarter_yr = c(&quot;2011 Q2&quot;, &quot;2011 Q2&quot;, &quot;2011 Q2&quot;, &quot;2011 Q2&quot;,
  5. &quot;2011 Q2&quot;), sentiment_score = c(&quot;0&quot;, &quot;-1&quot;, &quot;1&quot;, &quot;-1&quot;, &quot;-1&quot;),
  6. avg_sentiment = c(0.0666666666666667, -0.777777777777778,
  7. 1, -1, -1)), class = c(&quot;grouped_df&quot;, &quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;
  8. ), row.names = c(NA, -5L), groups = structure(list(username = c(&quot;ax&quot;,
  9. &quot;cartman&quot;, &quot;abc&quot;, &quot;aff&quot;), .rows = structure(list(5L, 4L, 1L, 2L, 3L), ptype = integer(0), class = c(&quot;vctrs_list_of&quot;,
  10. &quot;vctrs_vctr&quot;, &quot;list&quot;))), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;
  11. ), row.names = c(NA, -5L), .drop = TRUE))

create a quarter/year variable

  1. sentiment_df &lt;- sentiment_df %&gt;%
  2. mutate(date = ymd(date),
  3. quarter_yr = paste(year(date), quarters(date)))

Compute an average sentiment score per username, based on the many observations/posts they have:

  1. sentiment_df &lt;-
  2. df %&gt;% group_by(username, quarter_yr) %&gt;% summarise(avg_sentiment = mean(as.numeric(sentiment_score)))

Quarterly sentiment by username:

  1. dput(sentiment_df[1:2,c(1,8)])

output

  1. structure(list(username = c(&quot;cartman&quot;,&quot;aa&quot;
  2. ), `2014 Q2` = c(NA_real_, NA_real_)), class = c(&quot;grouped_df&quot;,
  3. &quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;), row.names = c(NA, -2L), groups = structure(list(
  4. username = c(&quot;cartman&quot;,&quot;aa&quot;), .rows = structure(list(
  5. 1L, 2L), ptype = integer(0), class = c(&quot;vctrs_list_of&quot;,
  6. &quot;vctrs_vctr&quot;, &quot;list&quot;))), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;
  7. ), row.names = c(NA, -2L), .drop = TRUE))

答案1

得分: 1

sentiment_df <- sentiment_df %>%
mutate(date = ymd(date),
quarter_yr = paste(year(date), quarters(date)),
phase = case_when(date < ymd(20210301) ~ "1 公告前",
date < ymd(20210901) ~ "2 实施前",
TRUE ~ "3 实施后"))

sentiment_df <-
df %>%
group_by(username, phase) %&gt%
summarise(avg_sentiment = mean(as.numeric(sentiment_score)))

英文:
  1. sentiment_df &lt;- sentiment_df %&gt;%
  2. mutate(date = ymd(date),
  3. quarter_yr = paste(year(date), quarters(date)),
  4. phase = case_when(date &lt; ymd(20210301) ~ &quot;1 Before announcement&quot;,
  5. date &lt; ymd(20210901) ~ &quot;2 Before implementation&quot;,
  6. TRUE ~ &quot;3 After implementation&quot;))
  7. sentiment_df &lt;-
  8. df %&gt;%
  9. group_by(username, phase) %&gt;%
  10. summarise(avg_sentiment = mean(as.numeric(sentiment_score)))

答案2

得分: 1

代码部分不需要翻译,以下是已翻译的内容:

It seems like you simply create a new variable using mutate() and case_when(), and then group by the new variable. Here was my attempt. Is this what you are after?

  1. library(dplyr)
  2. library(lubridate)
  3. library(zoo)
  4. sentiment_df <- structure(list(date = structure(c(15149, 15150, 15150, 15150, 15150), class = "Date"), username = c("ax", "aa", "cartman", "abc", "aff"), quarter_yr = c("2011 Q2", "2011 Q2", "2011 Q2", "2011 Q2", "2011 Q2"), sentiment_score = c("0", "-1", "1", "-1", "-1"), avg_sentiment = c(0.0666666666666667, -0.777777777777778, 1, -1, -1)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row names = c(NA, -5L), groups = structure(list(username = c("ax", "cartman", "abc", "aff"), .rows = structure(list(5L, 4L, 1L, 2L, 3L), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"), row names = c(NA, -5L), .drop = TRUE))
  5. sentiment_df <- sentiment_df %>% mutate(date = ymd(date),
  6. quarter_yr = paste(year(date), quarters(date)),
  7. implementation_period = case_when(date < as.Date("2021-03-01") ~ "Before",
  8. date >= as.Date("2021-03-01") & date < as.Date("2021-09-01") ~ "Pre_Implementation",
  9. TRUE ~ "After"))
  10. sentiment_df <-
  11. sentiment_df %>% group_by(username, implementation_period) %>% summarise(avg_sentiment = mean(as.numeric(sentiment_score)))

One quick note, in the data you provide there are only "Before" dates. But I think it should work on the whole dataset.

英文:

It seems like you simply create a new variable using mutate() and case_when(), and then group by the new variable. Here was my attempt. Is this what you are after?

  1. library(dplyr)
  2. library(lubridate)
  3. library(zoo)
  4. sentiment_df&lt;-structure(list(date = structure(c(15149, 15150, 15150, 15150,
  5. 15150), class = &quot;Date&quot;), username = c(&quot;ax&quot;, &quot;aa&quot;,
  6. &quot;cartman&quot;, &quot;abc&quot;, &quot;aff&quot;
  7. ), quarter_yr = c(&quot;2011 Q2&quot;, &quot;2011 Q2&quot;, &quot;2011 Q2&quot;, &quot;2011 Q2&quot;,
  8. &quot;2011 Q2&quot;), sentiment_score = c(&quot;0&quot;, &quot;-1&quot;, &quot;1&quot;, &quot;-1&quot;, &quot;-1&quot;),
  9. avg_sentiment = c(0.0666666666666667, -0.777777777777778,
  10. 1, -1, -1)), class = c(&quot;grouped_df&quot;, &quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;
  11. ), row.names = c(NA, -5L), groups = structure(list(username = c(&quot;ax&quot;,
  12. &quot;cartman&quot;, &quot;abc&quot;, &quot;aff&quot;), .rows = structure(list(5L, 4L, 1L, 2L, 3L), ptype = integer(0), class = c(&quot;vctrs_list_of&quot;,
  13. &quot;vctrs_vctr&quot;, &quot;list&quot;))), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;
  14. ), row.names = c(NA, -5L), .drop = TRUE))
  15. sentiment_df &lt;- sentiment_df %&gt;% mutate(date = ymd(date),
  16. quarter_yr = paste(year(date), quarters(date)),
  17. implementation_period = case_when(date &lt; as.Date(&quot;2021-03-01&quot;) ~ &quot;Before&quot;,
  18. date &gt;= as.Date(&quot;2021-03-01&quot;) &amp; date &lt; as.Date(&quot;2021-09-01&quot;) ~ &quot;Pre_Implementation&quot;,
  19. TRUE ~ &quot;After&quot;))
  20. sentiment_df &lt;-
  21. sentiment_df %&gt;% group_by(username, implementation_period) %&gt;% summarise(avg_sentiment = mean(as.numeric(sentiment_score)))

One quick note, in the data you provide there are only "Before" dates. But I think it should work on the whole dataset.

huangapple
  • 本文由 发表于 2023年3月7日 03:34:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75655086.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定