英文:
Mutating a sentiment indicator using time conditions, rather than quarter/month (dplyr)
问题
我有一个Reddit数据集,其中每行代表一篇Reddit帖子,我有每个Reddit帖子的情感分数,还有一个变量来捕捉由相同用户名编写的所有帖子的平均情感。
我试图创建一个与最低工资政策时间线相关的情感指标,我想根据三个时期对每个用户名进行情感分类:
1- 政策公告之前,假设是在“2021-03-01”之前。
2- 政策宣布之后但在实施之前,在“2021-03-01”之后但在“2021-09-01”之前。
3- 政策实施之后,在“2021-09-01”之后。
我已经能够按月或季度计算每个用户名的情感,如下所示,但我想根据上述特定政策时间线创建每个用户名的情感,我不确定如何做到这一点。
上传包
library(tidyverse)
library(lubridate)
library(zoo)
打印具有特定列的数据示例
dput(df[1:5,c(3,4,21, 22, 23)])
输出:
structure(list(date = structure(c(15149, 15150, 15150, 15150,
15150), class = "Date"), username = c("ax", "aa",
"cartman", "abc", "aff"
), quarter_yr = c("2011 Q2", "2011 Q2", "2011 Q2", "2011 Q2",
"2011 Q2"), sentiment_score = c("0", "-1", "1", "-1", "-1"),
avg_sentiment = c(0.0666666666666667, -0.777777777777778,
1, -1, -1)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), groups = structure(list(username = c("ax",
"cartman", "abc", "aff"), .rows = structure(list(5L, 4L, 1L, 2L, 3L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), .drop = TRUE))
创建季度/年份变量
sentiment_df <- sentiment_df %>%
mutate(date = ymd(date),
quarter_yr = paste(year(date), quarters(date)))
根据用户名计算情感得分的平均值,基于他们有很多观察/帖子:
sentiment_df <- df %>% group_by(username, quarter_yr) %>% summarise(avg_sentiment = mean(as.numeric(sentiment_score)))
按用户名的季度情感:
dput(sentiment_df[1:2,c(1,8)])
输出
structure(list(username = c("cartman","aa"
), `2014 Q2` = c(NA_real_, NA_real_)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -2L), groups = structure(list(
username = c("cartman","aa"), .rows = structure(list(
1L, 2L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), .drop = TRUE))
英文:
I have a reddit dataset where each row represents a single reddit post and I have a sentiment score for each reddit post by a given username. I also have a variable capturing the average sentiment for all posts written by the same username.
I am trying to create a sentiment indicator relevant to the timeline of a minimum wage policy, where I would like to categorize sentiment per username based on three periods:
1- Before the policy's announcement, let's say it's on "2021-03-01"
2- After the policy announcement yet before implementation, so after "2021-03-01" but before "2021-09-01"
3- after the policy's implementation, on "2021-09-01"
I have been able to compute sentiment for each username by month or quarter, as I show below but I would like to create sentiment per username based on the specific policy timeline above, and I am not sure how to do that.
Upload packages
library(tidyverse)
library(lubridate)
library(zoo)
Print data example with specific columns
dput(df[1:5,c(3,4,21, 22, 23)])
output:
structure(list(date = structure(c(15149, 15150, 15150, 15150,
15150), class = "Date"), username = c("ax", "aa",
"cartman", "abc", "aff"
), quarter_yr = c("2011 Q2", "2011 Q2", "2011 Q2", "2011 Q2",
"2011 Q2"), sentiment_score = c("0", "-1", "1", "-1", "-1"),
avg_sentiment = c(0.0666666666666667, -0.777777777777778,
1, -1, -1)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), groups = structure(list(username = c("ax",
"cartman", "abc", "aff"), .rows = structure(list(5L, 4L, 1L, 2L, 3L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), .drop = TRUE))
create a quarter/year variable
sentiment_df <- sentiment_df %>%
mutate(date = ymd(date),
quarter_yr = paste(year(date), quarters(date)))
Compute an average sentiment score per username, based on the many observations/posts they have:
sentiment_df <-
df %>% group_by(username, quarter_yr) %>% summarise(avg_sentiment = mean(as.numeric(sentiment_score)))
Quarterly sentiment by username:
dput(sentiment_df[1:2,c(1,8)])
output
structure(list(username = c("cartman","aa"
), `2014 Q2` = c(NA_real_, NA_real_)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -2L), groups = structure(list(
username = c("cartman","aa"), .rows = structure(list(
1L, 2L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), .drop = TRUE))
答案1
得分: 1
sentiment_df <- sentiment_df %>%
mutate(date = ymd(date),
quarter_yr = paste(year(date), quarters(date)),
phase = case_when(date < ymd(20210301) ~ "1 公告前",
date < ymd(20210901) ~ "2 实施前",
TRUE ~ "3 实施后"))
sentiment_df <-
df %>%
group_by(username, phase) %>%
summarise(avg_sentiment = mean(as.numeric(sentiment_score)))
英文:
sentiment_df <- sentiment_df %>%
mutate(date = ymd(date),
quarter_yr = paste(year(date), quarters(date)),
phase = case_when(date < ymd(20210301) ~ "1 Before announcement",
date < ymd(20210901) ~ "2 Before implementation",
TRUE ~ "3 After implementation"))
sentiment_df <-
df %>%
group_by(username, phase) %>%
summarise(avg_sentiment = mean(as.numeric(sentiment_score)))
答案2
得分: 1
代码部分不需要翻译,以下是已翻译的内容:
It seems like you simply create a new variable using mutate()
and case_when()
, and then group by the new variable. Here was my attempt. Is this what you are after?
library(dplyr)
library(lubridate)
library(zoo)
sentiment_df <- structure(list(date = structure(c(15149, 15150, 15150, 15150, 15150), class = "Date"), username = c("ax", "aa", "cartman", "abc", "aff"), quarter_yr = c("2011 Q2", "2011 Q2", "2011 Q2", "2011 Q2", "2011 Q2"), sentiment_score = c("0", "-1", "1", "-1", "-1"), avg_sentiment = c(0.0666666666666667, -0.777777777777778, 1, -1, -1)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row names = c(NA, -5L), groups = structure(list(username = c("ax", "cartman", "abc", "aff"), .rows = structure(list(5L, 4L, 1L, 2L, 3L), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"), row names = c(NA, -5L), .drop = TRUE))
sentiment_df <- sentiment_df %>% mutate(date = ymd(date),
quarter_yr = paste(year(date), quarters(date)),
implementation_period = case_when(date < as.Date("2021-03-01") ~ "Before",
date >= as.Date("2021-03-01") & date < as.Date("2021-09-01") ~ "Pre_Implementation",
TRUE ~ "After"))
sentiment_df <-
sentiment_df %>% group_by(username, implementation_period) %>% summarise(avg_sentiment = mean(as.numeric(sentiment_score)))
One quick note, in the data you provide there are only "Before" dates. But I think it should work on the whole dataset.
英文:
It seems like you simply create a new variable using mutate()
and case_when()
, and then group by the new variable. Here was my attempt. Is this what you are after?
library(dplyr)
library(lubridate)
library(zoo)
sentiment_df<-structure(list(date = structure(c(15149, 15150, 15150, 15150,
15150), class = "Date"), username = c("ax", "aa",
"cartman", "abc", "aff"
), quarter_yr = c("2011 Q2", "2011 Q2", "2011 Q2", "2011 Q2",
"2011 Q2"), sentiment_score = c("0", "-1", "1", "-1", "-1"),
avg_sentiment = c(0.0666666666666667, -0.777777777777778,
1, -1, -1)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), groups = structure(list(username = c("ax",
"cartman", "abc", "aff"), .rows = structure(list(5L, 4L, 1L, 2L, 3L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), .drop = TRUE))
sentiment_df <- sentiment_df %>% mutate(date = ymd(date),
quarter_yr = paste(year(date), quarters(date)),
implementation_period = case_when(date < as.Date("2021-03-01") ~ "Before",
date >= as.Date("2021-03-01") & date < as.Date("2021-09-01") ~ "Pre_Implementation",
TRUE ~ "After"))
sentiment_df <-
sentiment_df %>% group_by(username, implementation_period) %>% summarise(avg_sentiment = mean(as.numeric(sentiment_score)))
One quick note, in the data you provide there are only "Before" dates. But I think it should work on the whole dataset.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论