R: 在区间内计数观测值

huangapple go评论69阅读模式
英文:

R: Counting Observations in Bins

问题

以下是您提供的R代码的翻译部分:

我在R中有以下数据集:

    library(dplyr)
    
    set.seed(123)
    n <- 100
    country <- sample(c("USA", "Canada", "UK"), n, replace = TRUE)
    gender <- sample(c("M", "F"), n, replace = TRUE)
    age <- sample(18:100, n, replace = TRUE)
    height <- runif(n, min = 150, max = 180)
    owns_bicycle <- sample(c("Yes", "No"), n, replace = TRUE)
    
    df <- data.frame(country, gender, age, height, owns_bicycle)


**我的问题:**

- 首先,我想按身高值将身高分成5个等大小的组(例如,0%-20%20%-40%等)。
- 接下来,我想按年龄值将年龄分成5个等大小的组(例如,0%-20%20%-40%等)。
- 然后,对于每个唯一的国家、性别、年龄组和身高组的组合,我想找出拥有自行车的人的百分比。
- 因此,这种分析将让我知道类似于“如果你是一个30-35岁之间的男性,身高在150-155厘米之间,来自美国,那么你拥有自行车的概率为43%”。
- 为了澄清一下 - 每个人只能在一个组内。每个组应该大致有相同数量的人。

以下是我编写的R代码:

    final = df %>%
      mutate(height_group = cut(height, breaks = 5),
             age_group = cut(age, breaks = 5)) %>%
      group_by(country, gender, height_group, age_group) %>%
      summarise(count = n(),
                percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) 

**请问有人可以告诉我我是否做得正确吗?**

    > final
    # A tibble: 67 x 6
    # Groups:   country, gender, height_group [29]
       country gender height_group age_group   count percent_own_bicycle
       <chr>   <chr>  <fct>        <fct>       <int>               <dbl>
     1 Canada  F      (151,157]    (34.2,50.4]     4                  25
     2 Canada  F      (151,157]    (66.6,82.8]     2                   0
     3 Canada  F      (157,162]    (17.9,34.2]     2                   0
     4 Canada  F      (157,162]    (34.2,50.4]     1                 100
     5 Canada  F      (157,162]    (50.4,66.6]     2                   0
     6 Canada  F      (157,162]    (82.8,99.1]     1                   0
     7 Canada  F      (162,168]    (82.8,99.1]     2                  50
     8 Canada  F      (168,174]    (17.9,34.2]     3                   0
     9 Canada  F      (168,174]    (34.2,50.4]     1                 100
    10 Canada  F      (174,180]    (17.9,34.2]     1                   0
    # ... 还有更多行,请使用 `print(n = ...)` 查看更多行

希望这有助于您理解您的R代码。如果您有其他问题,请随时提出。

<details>
<summary>英文:</summary>

I have the following dataset in R:

    library(dplyr)
    
    set.seed(123)
    n &lt;- 100
    country &lt;- sample(c(&quot;USA&quot;, &quot;Canada&quot;, &quot;UK&quot;), n, replace = TRUE)
    gender &lt;- sample(c(&quot;M&quot;, &quot;F&quot;), n, replace = TRUE)
    age &lt;- sample(18:100, n, replace = TRUE)
    height &lt;- runif(n, min = 150, max = 180)
    owns_bicycle &lt;- sample(c(&quot;Yes&quot;, &quot;No&quot;), n, replace = TRUE)
    
    df &lt;- data.frame(country, gender, age, height, owns_bicycle)


**My Problem:**

- First, I want to break height into 5 equal sized groups by value of their height (e.g. 0%-20%, 20%-40%, etc.)
- Next, I want to break age into 5 equal sized groups by value of their age (e.g. 0%-20%, 20%-40%, etc.)
- Then, for each unique combination of country, gender, age_group and height_group, I want to find out the percent of who own a bicycle.
- As a result, this type of analysis would let me know things like - &quot;if you are a man between ages 30-35, between 150-155 cm and from USA, there is a 43% chance you own a bicycle&quot;.
- Just to clarify - each person should only be in a single group. And each group should roughly have the same number of people. 

Here is my R code that I wrote:

    final = df %&gt;%
      mutate(height_group = cut(height, breaks = 5),
             age_group = cut(age, breaks = 5)) %&gt;%
      group_by(country, gender, height_group, age_group) %&gt;%
      summarise(count = n(),
                percent_own_bicycle = mean(owns_bicycle == &quot;Yes&quot;) * 100) 

**Can someone please tell me if I have done this correctly?**


    &gt; final
    # A tibble: 67 x 6
    # Groups:   country, gender, height_group [29]
       country gender height_group age_group   count percent_own_bicycle
       &lt;chr&gt;   &lt;chr&gt;  &lt;fct&gt;        &lt;fct&gt;       &lt;int&gt;               &lt;dbl&gt;
     1 Canada  F      (151,157]    (34.2,50.4]     4                  25
     2 Canada  F      (151,157]    (66.6,82.8]     2                   0
     3 Canada  F      (157,162]    (17.9,34.2]     2                   0
     4 Canada  F      (157,162]    (34.2,50.4]     1                 100
     5 Canada  F      (157,162]    (50.4,66.6]     2                   0
     6 Canada  F      (157,162]    (82.8,99.1]     1                   0
     7 Canada  F      (162,168]    (82.8,99.1]     2                  50
     8 Canada  F      (168,174]    (17.9,34.2]     3                   0
     9 Canada  F      (168,174]    (34.2,50.4]     1                 100
    10 Canada  F      (174,180]    (17.9,34.2]     1                   0
    # ... with 57 more rows
    # i Use `print(n = ...)` to see more rows

Thanks!

</details>


# 答案1
**得分**: 2

请小心询问类似于“我的代码是否正确”的问题。

话虽如此,你的代码看起来很不错!但是,使用整数`breaks`参数的`cut()`函数可能不是你想要的。从它的帮助页面可以看出:

> 当`breaks`被指定为一个单一的数字时,数据的范围会被分成相等长度的`breaks`段

因此,它不是根据数据的分布来分割数据的,而是基于其范围。你应该使用`quantile()`来查找`breaks`的位置。看一下下面的区别:

```R
> cut(df$height, 5) %>% levels()
[1] "(151,157]" "(157,162]" "(162,168]" "(168,174]" "(174,180]"
> cut(df$height, breaks = quantile(df$height, seq(0, 1, 0.2))) %>% levels()
[1] "(151,156]" "(156,160]" "(160,167]" "(167,174]" "(174,180]"
> cut(df$age, 5) %>% levels()
[1] "(17.9,34.2]" "(34.2,50.4]" "(50.4,66.6]" "(66.6,82.8]" "(82.8,99.1]"
> cut(df$age, breaks = quantile(df$age, seq(0, 1, 0.2))) %>% levels()
[1] "(18,31]"     "(31,45.2]"   "(45.2,62.4]" "(62.4,78.4]" "(78.4,99]"

将其应用到你的代码中:

df %>%
  mutate(height_group = cut(height, breaks = quantile(height, seq(0, 1, 0.2))),
         age_group = cut(age, breaks = quantile(age, seq(0, 1, 0.2)))) %>%
  group_by(country, gender, height_group, age_group) %>%
  summarise(count = n(),
            percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) 

# 一个数据框: 75 × 6
# 组:   country, gender, height_group [31]
   country gender height_group age_group   count percent_own_bicycle
   <chr>   <chr>  <fct>        <fct>       <int>               <dbl>
 1 Canada  F      (151,156]    (31,45.2]       3                33.3
 2 Canada  F      (151,156]    (62.4,78.4]     1                 0  
 3 Canada  F      (151,156]    (78.4,99]       1                 0  
 4 Canada  F      (156,160]    (18,31]         2                 0  
 5 Canada  F      (156,160]    (31,45.2]       1               100  
 6 Canada  F      (156,160]    (62.4,78.4]     1                 0  
 7 Canada  F      (156,160]    (78.4,99]       1                 0  
 8 Canada  F      (160,167]    (45.2,62.4]     1                 0  
 9 Canada  F      (160,167]    (78.4,99]       1                 0  
10 Canada  F      (167,174]    (18,31]         3                 0  
# 更多行,请使用`print(n = ...)`来查看
英文:

Be careful when asking questions of the type "is my code correct".

Having said that, your code seems great! But, cut() with an integer breaks argument isn't what you want. From it's help page:

> When breaks is specified as a single number, the range of the data is
> divided into breaks pieces of equal length

So it isn't separating your data by its distribution, but just based on its range. You want to use quantile() to find the breaks. See the difference:

&gt; cut(df$height, 5) %&gt;% levels()
[1] &quot;(151,157]&quot; &quot;(157,162]&quot; &quot;(162,168]&quot; &quot;(168,174]&quot; &quot;(174,180]&quot;
&gt; cut(df$height, breaks = quantile(df$height, seq(0, 1, 0.2))) %&gt;% levels()
[1] &quot;(151,156]&quot; &quot;(156,160]&quot; &quot;(160,167]&quot; &quot;(167,174]&quot; &quot;(174,180]&quot;
&gt; cut(df$age, 5) %&gt;% levels()
[1] &quot;(17.9,34.2]&quot; &quot;(34.2,50.4]&quot; &quot;(50.4,66.6]&quot; &quot;(66.6,82.8]&quot; &quot;(82.8,99.1]&quot;
&gt; cut(df$age, breaks = quantile(df$age, seq(0, 1, 0.2))) %&gt;% levels()
[1] &quot;(18,31]&quot;     &quot;(31,45.2]&quot;   &quot;(45.2,62.4]&quot; &quot;(62.4,78.4]&quot; &quot;(78.4,99]&quot;

Applying it to your code:

df %&gt;%
mutate(height_group = cut(height, breaks = quantile(height, seq(0, 1, 0.2))),
age_group = cut(age, breaks = quantile(age, seq(0, 1, 0.2)))) %&gt;%
group_by(country, gender, height_group, age_group) %&gt;%
summarise(count = n(),
percent_own_bicycle = mean(owns_bicycle == &quot;Yes&quot;) * 100) 
# A tibble: 75 &#215; 6
# Groups:   country, gender, height_group [31]
country gender height_group age_group   count percent_own_bicycle
&lt;chr&gt;   &lt;chr&gt;  &lt;fct&gt;        &lt;fct&gt;       &lt;int&gt;               &lt;dbl&gt;
1 Canada  F      (151,156]    (31,45.2]       3                33.3
2 Canada  F      (151,156]    (62.4,78.4]     1                 0  
3 Canada  F      (151,156]    (78.4,99]       1                 0  
4 Canada  F      (156,160]    (18,31]         2                 0  
5 Canada  F      (156,160]    (31,45.2]       1               100  
6 Canada  F      (156,160]    (62.4,78.4]     1                 0  
7 Canada  F      (156,160]    (78.4,99]       1                 0  
8 Canada  F      (160,167]    (45.2,62.4]     1                 0  
9 Canada  F      (160,167]    (78.4,99]       1                 0  
10 Canada  F      (167,174]    (18,31]         3                 0  
# ℹ 65 more rows
# ℹ Use `print(n = ...)` to see more rows

答案2

得分: 0

使用@Ricardo Semião e Castro提供的答案中给出的逻辑,以下是基于data.table库的解决方案:

library(data.table)

dt = data.table(df)
data_table_result = dt[, `:=`(height_group = cut(height, breaks = quantile(height, seq(0, 1, 0.2))),
              age_group = cut(age, breaks = quantile(age, seq(0, 1, 0.2))))][
                  , .(count = .N,
                      percent_own_bicycle = mean(owns_bicycle == "Yes") * 100),
                  by = .(country, gender, height_group, age_group)]
英文:

Using the logic given in the answer provided by @Ricardo Semião e Castro, here is a solution based on the data.table library:

library(data.table)
dt = data.table(df)
data_table_result = dt[, `:=`(height_group = cut(height, breaks = quantile(height, seq(0, 1, 0.2))),
age_group = cut(age, breaks = quantile(age, seq(0, 1, 0.2))))][
, .(count = .N,
percent_own_bicycle = mean(owns_bicycle == &quot;Yes&quot;) * 100),
by = .(country, gender, height_group, age_group)]

huangapple
  • 本文由 发表于 2023年6月15日 09:26:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76478509.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定