2023年4月7日 02:44:43go评论82阅读模式

英文:

Counting string mentions on reddit by period (dplyr)

问题

我有Reddit的数据，正在尝试在俄勒冈州引入最低工资政策之前和之后计算不同私人公司的提及次数。由sub-reddit用户提及的所有公司都在“directed_to_whom”变量下进行了编码，该变量指的是在给定Reddit帖子中提及的私人公司的名称。

然后，我想创建一个柱状图，显示每家公司相对于最低工资政策的提及份额，使用一个名为“pre”的字符串变量来编码，如果公司提及是在政策之前就编码为“pre”，否则编码为“post”。

我的数据结构如下：

dput(df[1:6,c(2, 3)]) # 打印具有特定列的数据示例

数据示例：

structure(list(directed_to_whom = c("nike", "nike", 
"amazon", "walmart", "walmart", "walmart"), treatment_announcement = c("pre", 
"pre", "pre", "pre", "post", "post")), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -6L)) -> df

然后，我按如下方式计算了公司提及的份额：

df2  <- df %>%
  select(directed_to_whom, treatment_announcement) %>%
  group_by(treatment_announcement) %>%
  summarise(total_posts = n(),
            entity_count = sum(directed_to_whom == "nike"),
            entity_count = sum(directed_to_whom == "amazon"),
            entity_count = sum(directed_to_whom == "walmart"),
            entity_share = entity_count/total_posts * 100)

这段代码没有错误，但它没有捕捉到我感兴趣的指标，即每家公司在最低工资政策之前和之后的提及份额，而不是政策之前和之后的所有公司的平均份额。

以下是此代码生成的数据示例：

dput(df2[1:2,c(1,2, 3,4)])

输出：

structure(list(treatment_announcement = c("post", "pre"), total_posts = c(1013L, 
179L), entity_count = c(152L, 26L), entity_share = c(15.004935834156, 
14.5251396648045)), row names = c(NA, -2L), class = c("tbl_df", 
"tbl", "data.frame"))

是否可以计算政策之前的“walmart”提及份额，然后将其与政策之前的“amazon”提及进行比较。

以下是所需的数据框输出：

"directed_to_whom"      "treatment_announcement" "entity_share"
walmart                  pre                          45%
amazon                    pre                          10%
nike                      pre                           45%
walmart                  post                          60%
amazon                    post                          15%
nike                      post                           25%

英文:

I have reddit data and I am trying to count the mentions of different private firms before and after the introduction of a minimum wage policy in Oregon. All companies mentioned by sub-reddit users were coded under the "directed_to_whom" variable, which refers to the name of the private firm that mentioned in a given reddit post.

I would like to then create a bar graph that shows the share of mentions per company relative to the minimum wage policy using a string variable coded "pre" if the company mention is before the policy and "post" otherwise.

My data are structured as follows:

dput(df[1:6,c(2, 3)]) # Print data example with specific columns

data example:

structure(list(directed_to_whom = c(&quot;nike&quot;, &quot;nike&quot;, 
&quot;amazon&quot;, &quot;walmart&quot;, &quot;walmart&quot;, &quot;walmart&quot;), treatment_announcement = c(&quot;pre&quot;, 
&quot;pre&quot;, &quot;pre&quot;, &quot;pre&quot;, &quot;post&quot;, &quot;post&quot;)), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, 
&quot;data.frame&quot;), row.names = c(NA, -6L)) -&gt; df

I then calculated the share of company mentions as follows:

df2  &lt;- df %&gt;%
  select(directed_to_whom, treatment_announcement) %&gt;%
  group_by(treatment_announcement) %&gt;%
  summarise(total_posts = n(),
            entity_count = sum(directed_to_whom == &quot;nike&quot;),
            entity_count = sum(directed_to_whom == &quot;amazon&quot;),
            entity_count = sum(directed_to_whom == &quot;walmart&quot;),
            entity_share = entity_count/total_posts * 100)

The code works without errors but it isn't capturing the indicator I am interested in, which is the share of each company mentions pre & post the minimum wage policy, rather than the average share of All companies pre and post policy.

Here is the data example of what this produces:

dput(df2[1:2,c(1,2, 3,4)])

output:

structure(list(treatment_announcement = c(&quot;post&quot;, &quot;pre&quot;), total_posts = c(1013L, 
179L), entity_count = c(152L, 26L), entity_share = c(15.004935834156, 
14.5251396648045)), row.names = c(NA, -2L), class = c(&quot;tbl_df&quot;, 
&quot;tbl&quot;, &quot;data.frame&quot;))

Is it possible to instead compute the share of say "walmart" mentions before the policy and be able to compare it to "amazon" mentions also pre-policy.

Here is the desired df output:

&quot;directed_to_whom&quot;      &quot;treatment_announcement&quot; &quot;entity_share&quot;
walmart                  pre                          45%
amazon                    pre                          10%
nike                      pre                           45%
walmart                  post                          60%
amazon                    post                          15%
nike                      post                           25%

答案1

得分: 1

以下是代码的翻译部分：

library(dplyr)
df %>%
  group_by(directed_to_whom, treatment_announcement) %>%
  summarise(val_count = n()) %>%
  group_by(treatment_announcement) %>%
  summarise(directed_to_whom = directed_to_whom,
            entity_share = prop.table(val_count),
            .groups = "drop")
#> # A tibble: 6 x 3
#>   treatment_announcement directed_to_whom entity_share
#>   <chr>                  <chr>                   <dbl>
#> 1 post                   amazon                   0.25
#> 2 post                   nike                     0.25
#> 3 post                   walmart                  0.5 
#> 4 pre                    amazon                   0.25
#> 5 pre                    nike                     0.5 
#> 6 pre                    walmart                  0.25

ggplot(df_out) +
  geom_bar(aes(x = directed_to_whom, y = entity_share, fill = treatment_announcement), 
           stat = "identity", show.legend = FALSE) + 
  facet_wrap(~treatment_announcement, ncol = 1) +
  theme_bw()

数据：

df <- structure(list(directed_to_whom = c("nike", "nike", "amazon", "walmart", 
                                          "walmart", "walmart","nike", "amazon"), 
                     treatment_announcement = c("pre", "pre", "pre", "pre", 
                                                "post", "post", "post", "post")), 
                class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -8L))
#> 1 nike             pre                   
#> 2 nike             pre                   
#> 3 amazon           pre                   
#> 4 walmart          pre                   
#> 5 walmart          post                  
#> 6 walmart          post                  
#> 7 nike             post                  
#> 8 amazon           post

英文:

I created a dataframe with more rows to showcase the solution; see below.

library(dplyr)
df %&gt;% 
 group_by(directed_to_whom, treatment_announcement) %&gt;% 
 summarise(val_count = n()) %&gt;% 
 group_by(treatment_announcement) %&gt;% 
 summarise(directed_to_whom = directed_to_whom,
                              entity_share = prop.table(val_count),
                              .groups = &quot;drop&quot;)
#&gt; # A tibble: 6 x 3
#&gt;   treatment_announcement directed_to_whom entity_share
#&gt;   &lt;chr&gt;                  &lt;chr&gt;                   &lt;dbl&gt;
#&gt; 1 post                   amazon                   0.25
#&gt; 2 post                   nike                     0.25
#&gt; 3 post                   walmart                  0.5 
#&gt; 4 pre                    amazon                   0.25
#&gt; 5 pre                    nike                     0.5 
#&gt; 6 pre                    walmart                  0.25

You can also plot it like this:

ggplot(df_out) +
  geom_bar(aes(x = directed_to_whom, y = entity_share, fill = treatment_announcement), 
           stat = &quot;identity&quot;, show.legend = FALSE) + 
  facet_wrap(~treatment_announcement, ncol = 1) +
  theme_bw()

在Reddit上按时间段统计字符串提及次数（dplyr）。

Data:

df &lt;- structure(list(directed_to_whom = c(&quot;nike&quot;, &quot;nike&quot;, &quot;amazon&quot;, &quot;walmart&quot;, 
                                          &quot;walmart&quot;, &quot;walmart&quot;,&quot;nike&quot;, &quot;amazon&quot;), 
                     treatment_announcement = c(&quot;pre&quot;, &quot;pre&quot;, &quot;pre&quot;, &quot;pre&quot;, 
                                                &quot;post&quot;, &quot;post&quot;, &quot;post&quot;, &quot;post&quot;)), 
                class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;), row.names = c(NA, -8L))
#&gt; 1 nike             pre                   
#&gt; 2 nike             pre                   
#&gt; 3 amazon           pre                   
#&gt; 4 walmart          pre                   
#&gt; 5 walmart          post                  
#&gt; 6 walmart          post                  
#&gt; 7 nike             post                  
#&gt; 8 amazon           post

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Reddit上按时间段统计字符串提及次数（dplyr）。

问题

答案1

数据：

Data:

提取并组织文本文件到数据框架 (dataframe)。

Static text above horizontally scrollable table in shiny

Selecting Item from One table and Iterate in another table to see if It exists and Add a column Label

R read.csv error: Operation not permittedError in file(file, "rt")

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。