在Reddit上按时间段统计字符串提及次数(dplyr)。

huangapple go评论59阅读模式
英文:

Counting string mentions on reddit by period (dplyr)

问题

我有Reddit的数据,正在尝试在俄勒冈州引入最低工资政策之前和之后计算不同私人公司的提及次数。由sub-reddit用户提及的所有公司都在“directed_to_whom”变量下进行了编码,该变量指的是在给定Reddit帖子中提及的私人公司的名称。

然后,我想创建一个柱状图,显示每家公司相对于最低工资政策的提及份额,使用一个名为“pre”的字符串变量来编码,如果公司提及是在政策之前就编码为“pre”,否则编码为“post”。

我的数据结构如下:

dput(df[1:6,c(2, 3)]) # 打印具有特定列的数据示例

数据示例:

structure(list(directed_to_whom = c("nike", "nike", 
"amazon", "walmart", "walmart", "walmart"), treatment_announcement = c("pre", 
"pre", "pre", "pre", "post", "post")), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -6L)) -> df

然后,我按如下方式计算了公司提及的份额:

df2  <- df %>%
  select(directed_to_whom, treatment_announcement) %>%
  group_by(treatment_announcement) %>%
  summarise(total_posts = n(),
            entity_count = sum(directed_to_whom == "nike"),
            entity_count = sum(directed_to_whom == "amazon"),
            entity_count = sum(directed_to_whom == "walmart"),
            entity_share = entity_count/total_posts * 100) 

这段代码没有错误,但它没有捕捉到我感兴趣的指标,即每家公司在最低工资政策之前和之后的提及份额,而不是政策之前和之后的所有公司的平均份额。

以下是此代码生成的数据示例:

dput(df2[1:2,c(1,2, 3,4)])

输出:

structure(list(treatment_announcement = c("post", "pre"), total_posts = c(1013L, 
179L), entity_count = c(152L, 26L), entity_share = c(15.004935834156, 
14.5251396648045)), row names = c(NA, -2L), class = c("tbl_df", 
"tbl", "data.frame"))

是否可以计算政策之前的“walmart”提及份额,然后将其与政策之前的“amazon”提及进行比较。

以下是所需的数据框输出:

"directed_to_whom"      "treatment_announcement" "entity_share"
walmart                  pre                          45%
amazon                    pre                          10%
nike                      pre                           45%
walmart                  post                          60%
amazon                    post                          15%
nike                      post                           25%
英文:

I have reddit data and I am trying to count the mentions of different private firms before and after the introduction of a minimum wage policy in Oregon. All companies mentioned by sub-reddit users were coded under the "directed_to_whom" variable, which refers to the name of the private firm that mentioned in a given reddit post.

I would like to then create a bar graph that shows the share of mentions per company relative to the minimum wage policy using a string variable coded "pre" if the company mention is before the policy and "post" otherwise.

My data are structured as follows:

dput(df[1:6,c(2, 3)]) # Print data example with specific columns

data example:

structure(list(directed_to_whom = c(&quot;nike&quot;, &quot;nike&quot;, 
&quot;amazon&quot;, &quot;walmart&quot;, &quot;walmart&quot;, &quot;walmart&quot;), treatment_announcement = c(&quot;pre&quot;, 
&quot;pre&quot;, &quot;pre&quot;, &quot;pre&quot;, &quot;post&quot;, &quot;post&quot;)), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, 
&quot;data.frame&quot;), row.names = c(NA, -6L)) -&gt; df

I then calculated the share of company mentions as follows:

df2  &lt;- df %&gt;%
  select(directed_to_whom, treatment_announcement) %&gt;%
  group_by(treatment_announcement) %&gt;%
  summarise(total_posts = n(),
            entity_count = sum(directed_to_whom == &quot;nike&quot;),
            entity_count = sum(directed_to_whom == &quot;amazon&quot;),
            entity_count = sum(directed_to_whom == &quot;walmart&quot;),
            entity_share = entity_count/total_posts * 100) 

The code works without errors but it isn't capturing the indicator I am interested in, which is the share of each company mentions pre & post the minimum wage policy, rather than the average share of All companies pre and post policy.

Here is the data example of what this produces:

dput(df2[1:2,c(1,2, 3,4)])

output:

structure(list(treatment_announcement = c(&quot;post&quot;, &quot;pre&quot;), total_posts = c(1013L, 
179L), entity_count = c(152L, 26L), entity_share = c(15.004935834156, 
14.5251396648045)), row.names = c(NA, -2L), class = c(&quot;tbl_df&quot;, 
&quot;tbl&quot;, &quot;data.frame&quot;))

Is it possible to instead compute the share of say "walmart" mentions before the policy and be able to compare it to "amazon" mentions also pre-policy.

Here is the desired df output:

&quot;directed_to_whom&quot;      &quot;treatment_announcement&quot; &quot;entity_share&quot;
walmart                  pre                          45%
amazon                    pre                          10%
nike                      pre                           45%
walmart                  post                          60%
amazon                    post                          15%
nike                      post                           25%

答案1

得分: 1

以下是代码的翻译部分:

library(dplyr)

df %>%
  group_by(directed_to_whom, treatment_announcement) %>%
  summarise(val_count = n()) %>%
  group_by(treatment_announcement) %>%
  summarise(directed_to_whom = directed_to_whom,
            entity_share = prop.table(val_count),
            .groups = "drop")

#> # A tibble: 6 x 3
#>   treatment_announcement directed_to_whom entity_share
#>   <chr>                  <chr>                   <dbl>
#> 1 post                   amazon                   0.25
#> 2 post                   nike                     0.25
#> 3 post                   walmart                  0.5 
#> 4 pre                    amazon                   0.25
#> 5 pre                    nike                     0.5 
#> 6 pre                    walmart                  0.25
ggplot(df_out) +
  geom_bar(aes(x = directed_to_whom, y = entity_share, fill = treatment_announcement), 
           stat = "identity", show.legend = FALSE) + 
  facet_wrap(~treatment_announcement, ncol = 1) +
  theme_bw()

数据:

df <- structure(list(directed_to_whom = c("nike", "nike", "amazon", "walmart", 
                                          "walmart", "walmart","nike", "amazon"), 
                     treatment_announcement = c("pre", "pre", "pre", "pre", 
                                                "post", "post", "post", "post")), 
                class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -8L))

#> 1 nike             pre                   
#> 2 nike             pre                   
#> 3 amazon           pre                   
#> 4 walmart          pre                   
#> 5 walmart          post                  
#> 6 walmart          post                  
#> 7 nike             post                  
#> 8 amazon           post
英文:

I created a dataframe with more rows to showcase the solution; see below.

library(dplyr)

df %&gt;% 
 group_by(directed_to_whom, treatment_announcement) %&gt;% 
 summarise(val_count = n()) %&gt;% 
 group_by(treatment_announcement) %&gt;% 
 summarise(directed_to_whom = directed_to_whom,
                              entity_share = prop.table(val_count),
                              .groups = &quot;drop&quot;)

#&gt; # A tibble: 6 x 3
#&gt;   treatment_announcement directed_to_whom entity_share
#&gt;   &lt;chr&gt;                  &lt;chr&gt;                   &lt;dbl&gt;
#&gt; 1 post                   amazon                   0.25
#&gt; 2 post                   nike                     0.25
#&gt; 3 post                   walmart                  0.5 
#&gt; 4 pre                    amazon                   0.25
#&gt; 5 pre                    nike                     0.5 
#&gt; 6 pre                    walmart                  0.25

You can also plot it like this:

ggplot(df_out) +
  geom_bar(aes(x = directed_to_whom, y = entity_share, fill = treatment_announcement), 
           stat = &quot;identity&quot;, show.legend = FALSE) + 
  facet_wrap(~treatment_announcement, ncol = 1) +
  theme_bw()

在Reddit上按时间段统计字符串提及次数(dplyr)。

Data:

df &lt;- structure(list(directed_to_whom = c(&quot;nike&quot;, &quot;nike&quot;, &quot;amazon&quot;, &quot;walmart&quot;, 
                                          &quot;walmart&quot;, &quot;walmart&quot;,&quot;nike&quot;, &quot;amazon&quot;), 
                     treatment_announcement = c(&quot;pre&quot;, &quot;pre&quot;, &quot;pre&quot;, &quot;pre&quot;, 
                                                &quot;post&quot;, &quot;post&quot;, &quot;post&quot;, &quot;post&quot;)), 
                class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;), row.names = c(NA, -8L))

#&gt; 1 nike             pre                   
#&gt; 2 nike             pre                   
#&gt; 3 amazon           pre                   
#&gt; 4 walmart          pre                   
#&gt; 5 walmart          post                  
#&gt; 6 walmart          post                  
#&gt; 7 nike             post                  
#&gt; 8 amazon           post

huangapple
  • 本文由 发表于 2023年4月7日 02:44:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75952790.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定