统计跨多列的发生次数,并按年份分组

huangapple go评论65阅读模式
英文:

Count occurrences across multiple columns and group by year

问题

我有一个电影数据集,其中有一列表示年份,另外三列表示不同的电影类型。

以下是您提供的示例数据:

genre_structure<-structure(
  list(
    year = c(
      "2008",
      "2003",
      "2010",
      "2001",
      "2002",
      "1999",
      "1980",
      "2020",
      "1977",
      "1991",
      "1954",
      "2022",
      "1962",
      "2000",
      "1994",
      "2019",
      "2019",
      "1981",
      "2012",
      "2003"
    ),
    genre1 = c(
      "Action",
      "Action",
      "Action",
      "Action",
      "Action",
      "Action",
      "Action",
      "Action",
      "Action",
      "Action",
      "Action",
      "Action",
      "Action",
      "Action",
      "Action",
      "Action",
      "Action",
      "Action",
      "Action",
      "Action"
    ),
    genre2 = c(
      "Crime",
      "Adventure",
      "Adventure",
      "Adventure",
      "Adventure",
      "SciFi",
      "Adventure",
      "Drama",
      "Adventure",
      "SciFi",
      "Drama",
      "Drama",
      "Drama",
      "Adventure",
      "Crime",
      "Adventure",
      "Adventure",
      "Adventure",
      "Drama",
      "Drama"
    ),
    genre3 = c(
      "Drama",
      "Drama",
      "SciFi",
      "Drama",
      "Drama",
      "",
      "Fantasy",
      "",
      "Fantasy",
      "",
      "",
      "Mystery",
      "Mystery",
      "Drama",
      "Drama",
      "Crime",
      "Drama",
      "",
      "",
      "Mystery"
    )
  ),
  row.names = c(NA,-20L),
  class = "data.frame"
)

我正在尝试为每一年计算所有三种电影类型的数量。期望的结果如下(示例):

genre | year | count
Action | 2008 | 1
Comedy | 2008 | 3
Drama | 2008 | 4
...

我尝试了以下代码:

genre_years_test<-genre_structure %>%
  group_by(genre1, genre2, genre3, year) %>%
  summarise(total=n(), .groups = "drop")

但是它在每次在该年份发布新的类型时重复年份。

英文:

I have a movies dataset which has a column for year and three for genres.

Here's an example:

genre_structure&lt;-structure(
list(
year = c(
&quot;2008&quot;,
&quot;2003&quot;,
&quot;2010&quot;,
&quot;2001&quot;,
&quot;2002&quot;,
&quot;1999&quot;,
&quot;1980&quot;,
&quot;2020&quot;,
&quot;1977&quot;,
&quot;1991&quot;,
&quot;1954&quot;,
&quot;2022&quot;,
&quot;1962&quot;,
&quot;2000&quot;,
&quot;1994&quot;,
&quot;2019&quot;,
&quot;2019&quot;,
&quot;1981&quot;,
&quot;2012&quot;,
&quot;2003&quot;
),
genre1 = c(
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;,
&quot;Action&quot;
),
genre2 = c(
&quot;Crime&quot;,
&quot;Adventure&quot;,
&quot;Adventure&quot;,
&quot;Adventure&quot;,
&quot;Adventure&quot;,
&quot;SciFi&quot;,
&quot;Adventure&quot;,
&quot;Drama&quot;,
&quot;Adventure&quot;,
&quot;SciFi&quot;,
&quot;Drama&quot;,
&quot;Drama&quot;,
&quot;Drama&quot;,
&quot;Adventure&quot;,
&quot;Crime&quot;,
&quot;Adventure&quot;,
&quot;Adventure&quot;,
&quot;Adventure&quot;,
&quot;Drama&quot;,
&quot;Drama&quot;
),
genre3 = c(
&quot;Drama&quot;,
&quot;Drama&quot;,
&quot;SciFi&quot;,
&quot;Drama&quot;,
&quot;Drama&quot;,
&quot;&quot;,
&quot;Fantasy&quot;,
&quot;&quot;,
&quot;Fantasy&quot;,
&quot;&quot;,
&quot;&quot;,
&quot;Mystery&quot;,
&quot;Mystery&quot;,
&quot;Drama&quot;,
&quot;Drama&quot;,
&quot;Crime&quot;,
&quot;Drama&quot;,
&quot;&quot;,
&quot;&quot;,
&quot;Mystery&quot;
)
),
row.names = c(NA,-20L),
class =  &quot;data.frame&quot;
)

I am trying to count all 3 genres for each year. Expected result is (example):

genre | year| count
Action |2008| 1
Comedy | 2008 | 3
Drama | 2008 | 4
...

I tried:

genre_years_test&lt;-genre_structure %&gt;% 
group_by(genre1, genre2, genre3, year) %&gt;% 
summarise(total=n(), .groups = &quot;drop&quot;)

But it is repeating the years each time a new genre is released in that year.

答案1

得分: 3

我们可以重塑为 'long' 形式,并获取 'count'。

library(dplyr)
library(tidyr)
genre_structure %>%
  pivot_longer(cols = -year, values_to = 'genre') %>%
  count(year, genre, name = 'count')
英文:

We may reshape to 'long' and get the count

library(dplyr)
library(tidyr)
genre_structure %&gt;% 
pivot_longer(cols = -year, values_to = &#39;genre&#39;) %&gt;%
count(year, genre, name = &#39;count&#39;)

答案2

得分: 3

以下是代码的中文翻译:

base中的解决方案:

# 仅用于娱乐,以下是一个在“基础”中的解决方案:
subset(as.data.frame(
        table(cbind(genre_structure[1], stack(genre_structure[-1]))[-3])
                    ), Freq != 0)

data.table中的解决方案:

# 使用“data.table”:
library(data.table)

melt(setDT(genre_structure), id.vars = c("year"),
                             variable.name = "genre")[, list(Freq = .N), 
                                                       .(year, value)]

希望这些翻译对您有帮助。

英文:

Just for laffs, here's a solution in base:

subset(as.data.frame(
table(cbind(genre_structure[1], stack(genre_structure[-1]))[-3])
), Freq != 0)

And in data.table:

library(data.table)
melt(setDT(genre_structure), id.vars = c(&quot;year&quot;),
variable.name = &quot;genre&quot;)[, list(Freq =.N), 
.(year, value)]

huangapple
  • 本文由 发表于 2023年2月7日 04:23:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75366193.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定