2023年6月15日 09:26:48go评论93阅读模式

英文:

R: Counting Observations in Bins

问题

以下是您提供的R代码的翻译部分：

我在R中有以下数据集：
    library(dplyr)
    
    set.seed(123)
    n <- 100
    country <- sample(c("USA", "Canada", "UK"), n, replace = TRUE)
    gender <- sample(c("M", "F"), n, replace = TRUE)
    age <- sample(18:100, n, replace = TRUE)
    height <- runif(n, min = 150, max = 180)
    owns_bicycle <- sample(c("Yes", "No"), n, replace = TRUE)
    
    df <- data.frame(country, gender, age, height, owns_bicycle)
**我的问题：**
- 首先，我想按身高值将身高分成5个等大小的组（例如，0%-20%，20%-40%等）。
- 接下来，我想按年龄值将年龄分成5个等大小的组（例如，0%-20%，20%-40%等）。
- 然后，对于每个唯一的国家、性别、年龄组和身高组的组合，我想找出拥有自行车的人的百分比。
- 因此，这种分析将让我知道类似于“如果你是一个30-35岁之间的男性，身高在150-155厘米之间，来自美国，那么你拥有自行车的概率为43%”。
- 为了澄清一下 - 每个人只能在一个组内。每个组应该大致有相同数量的人。
以下是我编写的R代码：
    final = df %>%
      mutate(height_group = cut(height, breaks = 5),
             age_group = cut(age, breaks = 5)) %>%
      group_by(country, gender, height_group, age_group) %>%
      summarise(count = n(),
                percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) 
**请问有人可以告诉我我是否做得正确吗？**
    > final
    # A tibble: 67 x 6
    # Groups:   country, gender, height_group [29]
       country gender height_group age_group   count percent_own_bicycle
       <chr>   <chr>  <fct>        <fct>       <int>               <dbl>
     1 Canada  F      (151,157]    (34.2,50.4]     4                  25
     2 Canada  F      (151,157]    (66.6,82.8]     2                   0
     3 Canada  F      (157,162]    (17.9,34.2]     2                   0
     4 Canada  F      (157,162]    (34.2,50.4]     1                 100
     5 Canada  F      (157,162]    (50.4,66.6]     2                   0
     6 Canada  F      (157,162]    (82.8,99.1]     1                   0
     7 Canada  F      (162,168]    (82.8,99.1]     2                  50
     8 Canada  F      (168,174]    (17.9,34.2]     3                   0
     9 Canada  F      (168,174]    (34.2,50.4]     1                 100
    10 Canada  F      (174,180]    (17.9,34.2]     1                   0
    # ... 还有更多行，请使用 `print(n = ...)` 查看更多行
希望这有助于您理解您的R代码。如果您有其他问题，请随时提出。
<details>
<summary>英文:</summary>
I have the following dataset in R:
    library(dplyr)
    
    set.seed(123)
    n &lt;- 100
    country &lt;- sample(c(&quot;USA&quot;, &quot;Canada&quot;, &quot;UK&quot;), n, replace = TRUE)
    gender &lt;- sample(c(&quot;M&quot;, &quot;F&quot;), n, replace = TRUE)
    age &lt;- sample(18:100, n, replace = TRUE)
    height &lt;- runif(n, min = 150, max = 180)
    owns_bicycle &lt;- sample(c(&quot;Yes&quot;, &quot;No&quot;), n, replace = TRUE)
    
    df &lt;- data.frame(country, gender, age, height, owns_bicycle)
**My Problem:**
- First, I want to break height into 5 equal sized groups by value of their height (e.g. 0%-20%, 20%-40%, etc.)
- Next, I want to break age into 5 equal sized groups by value of their age (e.g. 0%-20%, 20%-40%, etc.)
- Then, for each unique combination of country, gender, age_group and height_group, I want to find out the percent of who own a bicycle.
- As a result, this type of analysis would let me know things like - &quot;if you are a man between ages 30-35, between 150-155 cm and from USA, there is a 43% chance you own a bicycle&quot;.
- Just to clarify - each person should only be in a single group. And each group should roughly have the same number of people. 
Here is my R code that I wrote:
    final = df %&gt;%
      mutate(height_group = cut(height, breaks = 5),
             age_group = cut(age, breaks = 5)) %&gt;%
      group_by(country, gender, height_group, age_group) %&gt;%
      summarise(count = n(),
                percent_own_bicycle = mean(owns_bicycle == &quot;Yes&quot;) * 100) 
**Can someone please tell me if I have done this correctly?**
    &gt; final
    # A tibble: 67 x 6
    # Groups:   country, gender, height_group [29]
       country gender height_group age_group   count percent_own_bicycle
       &lt;chr&gt;   &lt;chr&gt;  &lt;fct&gt;        &lt;fct&gt;       &lt;int&gt;               &lt;dbl&gt;
     1 Canada  F      (151,157]    (34.2,50.4]     4                  25
     2 Canada  F      (151,157]    (66.6,82.8]     2                   0
     3 Canada  F      (157,162]    (17.9,34.2]     2                   0
     4 Canada  F      (157,162]    (34.2,50.4]     1                 100
     5 Canada  F      (157,162]    (50.4,66.6]     2                   0
     6 Canada  F      (157,162]    (82.8,99.1]     1                   0
     7 Canada  F      (162,168]    (82.8,99.1]     2                  50
     8 Canada  F      (168,174]    (17.9,34.2]     3                   0
     9 Canada  F      (168,174]    (34.2,50.4]     1                 100
    10 Canada  F      (174,180]    (17.9,34.2]     1                   0
    # ... with 57 more rows
    # i Use `print(n = ...)` to see more rows
Thanks!
</details>
# 答案1
**得分**: 2
请小心询问类似于“我的代码是否正确”的问题。
话虽如此，你的代码看起来很不错！但是，使用整数`breaks`参数的`cut()`函数可能不是你想要的。从它的帮助页面可以看出：
> 当`breaks`被指定为一个单一的数字时，数据的范围会被分成相等长度的`breaks`段
因此，它不是根据数据的分布来分割数据的，而是基于其范围。你应该使用`quantile()`来查找`breaks`的位置。看一下下面的区别：
```R
> cut(df$height, 5) %>% levels()
[1] "(151,157]" "(157,162]" "(162,168]" "(168,174]" "(174,180]"
> cut(df$height, breaks = quantile(df$height, seq(0, 1, 0.2))) %>% levels()
[1] "(151,156]" "(156,160]" "(160,167]" "(167,174]" "(174,180]"

> cut(df$age, 5) %>% levels()
[1] "(17.9,34.2]" "(34.2,50.4]" "(50.4,66.6]" "(66.6,82.8]" "(82.8,99.1]"
> cut(df$age, breaks = quantile(df$age, seq(0, 1, 0.2))) %>% levels()
[1] "(18,31]"     "(31,45.2]"   "(45.2,62.4]" "(62.4,78.4]" "(78.4,99]"

将其应用到你的代码中：

df %>%
  mutate(height_group = cut(height, breaks = quantile(height, seq(0, 1, 0.2))),
         age_group = cut(age, breaks = quantile(age, seq(0, 1, 0.2)))) %>%
  group_by(country, gender, height_group, age_group) %>%
  summarise(count = n(),
            percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) 
# 一个数据框: 75 × 6
# 组:   country, gender, height_group [31]
   country gender height_group age_group   count percent_own_bicycle
   <chr>   <chr>  <fct>        <fct>       <int>               <dbl>
 1 Canada  F      (151,156]    (31,45.2]       3                33.3
 2 Canada  F      (151,156]    (62.4,78.4]     1                 0  
 3 Canada  F      (151,156]    (78.4,99]       1                 0  
 4 Canada  F      (156,160]    (18,31]         2                 0  
 5 Canada  F      (156,160]    (31,45.2]       1               100  
 6 Canada  F      (156,160]    (62.4,78.4]     1                 0  
 7 Canada  F      (156,160]    (78.4,99]       1                 0  
 8 Canada  F      (160,167]    (45.2,62.4]     1                 0  
 9 Canada  F      (160,167]    (78.4,99]       1                 0  
10 Canada  F      (167,174]    (18,31]         3                 0  
# 更多行，请使用`print(n = ...)`来查看

英文:

Be careful when asking questions of the type "is my code correct".

Having said that, your code seems great! But, cut() with an integer breaks argument isn't what you want. From it's help page:

> When breaks is specified as a single number, the range of the data is
> divided into breaks pieces of equal length

So it isn't separating your data by its distribution, but just based on its range. You want to use quantile() to find the breaks. See the difference:

&gt; cut(df$height, 5) %&gt;% levels()
[1] &quot;(151,157]&quot; &quot;(157,162]&quot; &quot;(162,168]&quot; &quot;(168,174]&quot; &quot;(174,180]&quot;
&gt; cut(df$height, breaks = quantile(df$height, seq(0, 1, 0.2))) %&gt;% levels()
[1] &quot;(151,156]&quot; &quot;(156,160]&quot; &quot;(160,167]&quot; &quot;(167,174]&quot; &quot;(174,180]&quot;
&gt; cut(df$age, 5) %&gt;% levels()
[1] &quot;(17.9,34.2]&quot; &quot;(34.2,50.4]&quot; &quot;(50.4,66.6]&quot; &quot;(66.6,82.8]&quot; &quot;(82.8,99.1]&quot;
&gt; cut(df$age, breaks = quantile(df$age, seq(0, 1, 0.2))) %&gt;% levels()
[1] &quot;(18,31]&quot;     &quot;(31,45.2]&quot;   &quot;(45.2,62.4]&quot; &quot;(62.4,78.4]&quot; &quot;(78.4,99]&quot;

Applying it to your code:

df %&gt;%
mutate(height_group = cut(height, breaks = quantile(height, seq(0, 1, 0.2))),
age_group = cut(age, breaks = quantile(age, seq(0, 1, 0.2)))) %&gt;%
group_by(country, gender, height_group, age_group) %&gt;%
summarise(count = n(),
percent_own_bicycle = mean(owns_bicycle == &quot;Yes&quot;) * 100) 
# A tibble: 75 &#215; 6
# Groups:   country, gender, height_group [31]
country gender height_group age_group   count percent_own_bicycle
&lt;chr&gt;   &lt;chr&gt;  &lt;fct&gt;        &lt;fct&gt;       &lt;int&gt;               &lt;dbl&gt;
1 Canada  F      (151,156]    (31,45.2]       3                33.3
2 Canada  F      (151,156]    (62.4,78.4]     1                 0  
3 Canada  F      (151,156]    (78.4,99]       1                 0  
4 Canada  F      (156,160]    (18,31]         2                 0  
5 Canada  F      (156,160]    (31,45.2]       1               100  
6 Canada  F      (156,160]    (62.4,78.4]     1                 0  
7 Canada  F      (156,160]    (78.4,99]       1                 0  
8 Canada  F      (160,167]    (45.2,62.4]     1                 0  
9 Canada  F      (160,167]    (78.4,99]       1                 0  
10 Canada  F      (167,174]    (18,31]         3                 0  
# ℹ 65 more rows
# ℹ Use `print(n = ...)` to see more rows

答案2

得分: 0

使用@Ricardo Semião e Castro提供的答案中给出的逻辑，以下是基于data.table库的解决方案：

library(data.table)
dt = data.table(df)
data_table_result = dt[, `:=`(height_group = cut(height, breaks = quantile(height, seq(0, 1, 0.2))),
              age_group = cut(age, breaks = quantile(age, seq(0, 1, 0.2))))][
                  , .(count = .N,
                      percent_own_bicycle = mean(owns_bicycle == "Yes") * 100),
                  by = .(country, gender, height_group, age_group)]

英文:

Using the logic given in the answer provided by @Ricardo Semião e Castro, here is a solution based on the data.table library:

library(data.table)
dt = data.table(df)
data_table_result = dt[, `:=`(height_group = cut(height, breaks = quantile(height, seq(0, 1, 0.2))),
age_group = cut(age, breaks = quantile(age, seq(0, 1, 0.2))))][
, .(count = .N,
percent_own_bicycle = mean(owns_bicycle == &quot;Yes&quot;) * 100),
by = .(country, gender, height_group, age_group)]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R: 在区间内计数观测值

问题

答案2

如何使用组确定填充和X变量确定颜色来创建堆叠条形图（ggplot2）？

在使用ggplot绘制地图上的多个物种时，您可以使用以下代码：

file.choose() 在 Windows 上打开对话框时没有文件名。

如何在R中将一个tsibble列除以一个数字？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。