英文:
Counting string mentions on reddit by period (dplyr)
问题
我有Reddit的数据,正在尝试在俄勒冈州引入最低工资政策之前和之后计算不同私人公司的提及次数。由sub-reddit用户提及的所有公司都在“directed_to_whom”变量下进行了编码,该变量指的是在给定Reddit帖子中提及的私人公司的名称。
然后,我想创建一个柱状图,显示每家公司相对于最低工资政策的提及份额,使用一个名为“pre”的字符串变量来编码,如果公司提及是在政策之前就编码为“pre”,否则编码为“post”。
我的数据结构如下:
dput(df[1:6,c(2, 3)]) # 打印具有特定列的数据示例
数据示例:
structure(list(directed_to_whom = c("nike", "nike",
"amazon", "walmart", "walmart", "walmart"), treatment_announcement = c("pre",
"pre", "pre", "pre", "post", "post")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -6L)) -> df
然后,我按如下方式计算了公司提及的份额:
df2 <- df %>%
select(directed_to_whom, treatment_announcement) %>%
group_by(treatment_announcement) %>%
summarise(total_posts = n(),
entity_count = sum(directed_to_whom == "nike"),
entity_count = sum(directed_to_whom == "amazon"),
entity_count = sum(directed_to_whom == "walmart"),
entity_share = entity_count/total_posts * 100)
这段代码没有错误,但它没有捕捉到我感兴趣的指标,即每家公司在最低工资政策之前和之后的提及份额,而不是政策之前和之后的所有公司的平均份额。
以下是此代码生成的数据示例:
dput(df2[1:2,c(1,2, 3,4)])
输出:
structure(list(treatment_announcement = c("post", "pre"), total_posts = c(1013L,
179L), entity_count = c(152L, 26L), entity_share = c(15.004935834156,
14.5251396648045)), row names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
是否可以计算政策之前的“walmart”提及份额,然后将其与政策之前的“amazon”提及进行比较。
以下是所需的数据框输出:
"directed_to_whom" "treatment_announcement" "entity_share"
walmart pre 45%
amazon pre 10%
nike pre 45%
walmart post 60%
amazon post 15%
nike post 25%
英文:
I have reddit data and I am trying to count the mentions of different private firms before and after the introduction of a minimum wage policy in Oregon. All companies mentioned by sub-reddit users were coded under the "directed_to_whom" variable, which refers to the name of the private firm that mentioned in a given reddit post.
I would like to then create a bar graph that shows the share of mentions per company relative to the minimum wage policy using a string variable coded "pre" if the company mention is before the policy and "post" otherwise.
My data are structured as follows:
dput(df[1:6,c(2, 3)]) # Print data example with specific columns
data example:
structure(list(directed_to_whom = c("nike", "nike",
"amazon", "walmart", "walmart", "walmart"), treatment_announcement = c("pre",
"pre", "pre", "pre", "post", "post")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -6L)) -> df
I then calculated the share of company mentions as follows:
df2 <- df %>%
select(directed_to_whom, treatment_announcement) %>%
group_by(treatment_announcement) %>%
summarise(total_posts = n(),
entity_count = sum(directed_to_whom == "nike"),
entity_count = sum(directed_to_whom == "amazon"),
entity_count = sum(directed_to_whom == "walmart"),
entity_share = entity_count/total_posts * 100)
The code works without errors but it isn't capturing the indicator I am interested in, which is the share of each company mentions pre & post the minimum wage policy, rather than the average share of All companies pre and post policy.
Here is the data example of what this produces:
dput(df2[1:2,c(1,2, 3,4)])
output:
structure(list(treatment_announcement = c("post", "pre"), total_posts = c(1013L,
179L), entity_count = c(152L, 26L), entity_share = c(15.004935834156,
14.5251396648045)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
Is it possible to instead compute the share of say "walmart" mentions before the policy and be able to compare it to "amazon" mentions also pre-policy.
Here is the desired df output:
"directed_to_whom" "treatment_announcement" "entity_share"
walmart pre 45%
amazon pre 10%
nike pre 45%
walmart post 60%
amazon post 15%
nike post 25%
答案1
得分: 1
以下是代码的翻译部分:
library(dplyr)
df %>%
group_by(directed_to_whom, treatment_announcement) %>%
summarise(val_count = n()) %>%
group_by(treatment_announcement) %>%
summarise(directed_to_whom = directed_to_whom,
entity_share = prop.table(val_count),
.groups = "drop")
#> # A tibble: 6 x 3
#> treatment_announcement directed_to_whom entity_share
#> <chr> <chr> <dbl>
#> 1 post amazon 0.25
#> 2 post nike 0.25
#> 3 post walmart 0.5
#> 4 pre amazon 0.25
#> 5 pre nike 0.5
#> 6 pre walmart 0.25
ggplot(df_out) +
geom_bar(aes(x = directed_to_whom, y = entity_share, fill = treatment_announcement),
stat = "identity", show.legend = FALSE) +
facet_wrap(~treatment_announcement, ncol = 1) +
theme_bw()
数据:
df <- structure(list(directed_to_whom = c("nike", "nike", "amazon", "walmart",
"walmart", "walmart","nike", "amazon"),
treatment_announcement = c("pre", "pre", "pre", "pre",
"post", "post", "post", "post")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -8L))
#> 1 nike pre
#> 2 nike pre
#> 3 amazon pre
#> 4 walmart pre
#> 5 walmart post
#> 6 walmart post
#> 7 nike post
#> 8 amazon post
英文:
I created a dataframe with more rows to showcase the solution; see below.
library(dplyr)
df %>%
group_by(directed_to_whom, treatment_announcement) %>%
summarise(val_count = n()) %>%
group_by(treatment_announcement) %>%
summarise(directed_to_whom = directed_to_whom,
entity_share = prop.table(val_count),
.groups = "drop")
#> # A tibble: 6 x 3
#> treatment_announcement directed_to_whom entity_share
#> <chr> <chr> <dbl>
#> 1 post amazon 0.25
#> 2 post nike 0.25
#> 3 post walmart 0.5
#> 4 pre amazon 0.25
#> 5 pre nike 0.5
#> 6 pre walmart 0.25
You can also plot it like this:
ggplot(df_out) +
geom_bar(aes(x = directed_to_whom, y = entity_share, fill = treatment_announcement),
stat = "identity", show.legend = FALSE) +
facet_wrap(~treatment_announcement, ncol = 1) +
theme_bw()
Data:
df <- structure(list(directed_to_whom = c("nike", "nike", "amazon", "walmart",
"walmart", "walmart","nike", "amazon"),
treatment_announcement = c("pre", "pre", "pre", "pre",
"post", "post", "post", "post")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -8L))
#> 1 nike pre
#> 2 nike pre
#> 3 amazon pre
#> 4 walmart pre
#> 5 walmart post
#> 6 walmart post
#> 7 nike post
#> 8 amazon post
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论