2023年6月15日 06:14:25go评论98阅读模式

英文:

R: Counting Number of Observations in Each Percentile

问题

我正在使用R编程语言进行工作。

我有以下数据集：

library(dplyr)
set.seed(123)
n <- 100
country <- sample(c("USA", "Canada", "UK"), n, replace = TRUE)
gender <- sample(c("M", "F"), n, replace = TRUE)
age <- sample(18:100, n, replace = TRUE)
height <- runif(n, min = 150, max = 180)
owns_bicycle <- sample(c("Yes", "No"), n, replace = TRUE)
df <- data.frame(country, gender, age, height, owns_bicycle)

我的问题：

首先，我想按身高值将身高分为3个等大小的组（例如0%-33%，33%-66%，66%-99%）。
接下来，我想按年龄值将年龄分为5个等大小的组（例如0%-20%，20%-40%，等等）。
然后，对于每个独特的国家、性别、年龄组和身高组的组合，我想找出拥有自行车的比例。

因此，这种分析能让我知道诸如“如果你是一个年龄在30-35岁之间的男性，身高在150-155厘米之间，来自美国，那么你拥有自行车的概率为43%”等信息。

以下是我目前尝试做的事情：

library(dplyr)
df %>%
  mutate(height_group = ntile(height, 3),
         age_group = ntile(age, 5)) %>%
  group_by(country, gender, height_group, age_group) %>%
  summarise(count = n(),
            min_height = min(height),
            max_height = max(height),
            min_age = min(age),
            max_age = max(age),
            percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) %>%
  mutate(height_range = paste0(round(min_height, 1), "-", round(max_height, 1)),
         age_range = paste0(min_age, "-", max_age)) %>%
  select(-min_height, -max_height, -min_age, -max_age)

当我查看结果时：

# A tibble: 62 x 8
# Groups:   country, gender, height_group [18]
   country gender height_group age_group count percent_own_bicycle height_range age_range
   <chr>   <chr>         <int>     <int> <int>               <dbl> <chr>        <chr>    
 1 Canada  F                 1         1     2                 0   157.2-158.5  23-30    
 2 Canada  F                 1         2     4                25   150.8-154.1  37-43    
 3 Canada  F                 1         4     2                 0   154.4-156.9  66-72    
 4 Canada  F                 1         5     1                 0   154.6-154.6  80-80    
 5 Canada  F                 2         1     1                 0   169.3-169.3  23-23

我看到height_group = 1 具有多个范围，例如157.2-158.5和150.8-154.1。为什么会这样 - 我本来以为height_group = 1只能有一个范围？

请问有人能告诉我我做错了什么吗？

谢谢！

英文:

I am working with the R programming language.

I have the following dataset:

library(dplyr)
set.seed(123)
n &lt;- 100
country &lt;- sample(c(&quot;USA&quot;, &quot;Canada&quot;, &quot;UK&quot;), n, replace = TRUE)
gender &lt;- sample(c(&quot;M&quot;, &quot;F&quot;), n, replace = TRUE)
age &lt;- sample(18:100, n, replace = TRUE)
height &lt;- runif(n, min = 150, max = 180)
owns_bicycle &lt;- sample(c(&quot;Yes&quot;, &quot;No&quot;), n, replace = TRUE)
df &lt;- data.frame(country, gender, age, height, owns_bicycle)

My Problem:

First, I want to break height into 3 equal sized groups by value of their height (e.g. 0%-33%, 33%-66%,66%-99%)
Next, I want to break age into 5 equal sized groups by value of their age (e.g. 0%-20%, 20%-40%, etc.)
Then, for each unique combination of country, gender, age_group and height_group, I want to find out the percent of who own a bicycle.

As a result, this type of analysis would let me know things like - "if you are a man between ages 30-35, between 150-155 cm and from USA, there is a 43% chance you own a bicycle".

Here is my current attempt to do this:

library(dplyr)
df %&gt;%
  mutate(height_group = ntile(height, 3),
         age_group = ntile(age, 5)) %&gt;%
  group_by(country, gender, height_group, age_group) %&gt;%
  summarise(count = n(),
            min_height = min(height),
            max_height = max(height),
            min_age = min(age),
            max_age = max(age),
            percent_own_bicycle = mean(owns_bicycle == &quot;Yes&quot;) * 100) %&gt;%
  mutate(height_range = paste0(round(min_height, 1), &quot;-&quot;, round(max_height, 1)),
         age_range = paste0(min_age, &quot;-&quot;, max_age)) %&gt;%
  select(-min_height, -max_height, -min_age, -max_age)

When I look at the results:

# A tibble: 62 x 8
# Groups:   country, gender, height_group [18]
   country gender height_group age_group count percent_own_bicycle height_range age_range
   &lt;chr&gt;   &lt;chr&gt;         &lt;int&gt;     &lt;int&gt; &lt;int&gt;               &lt;dbl&gt; &lt;chr&gt;        &lt;chr&gt;    
 1 Canada  F                 1         1     2                 0   157.2-158.5  23-30    
 2 Canada  F                 1         2     4                25   150.8-154.1  37-43    
 3 Canada  F                 1         4     2                 0   154.4-156.9  66-72    
 4 Canada  F                 1         5     1                 0   154.6-154.6  80-80    
 5 Canada  F                 2         1     1                 0   169.3-169.3  23-23

I see height_group = 1 having multiple ranges, e.g. 157.2-158.5 and 150.8-154.1 . How is this possible - I would have thought that height_group = 1 can only have a single range?

Can someone please show me what I am doing wrong

Thanks!

答案1

得分: 1

是的，您的尝试看起来是正确的。您使用ntile()函数将height变量分为三个等大小的组，将age变量分为五个等大小的组。然后，您按照country、gender、height_group和age_group对数据进行了分组，并计算了每个组内的计数、最小身高、最大身高、最小年龄、最大年龄以及拥有自行车的个体的百分比。

英文:

Yes, your attempt looks correct. You have divided the height variable into three equal-sized groups using the ntile() function, and the age variable into five equal-sized groups. Then, you grouped the data by country, gender, height_group, and age_group and calculated the count, minimum height, maximum height, minimum age, maximum age, and the percentage of individuals who own a bicycle within each group.

答案2

得分: -3

这是您的代码的翻译部分：

OP在这里 - 这是我的第二次尝试：
final = df %>%
  mutate(height_group = cut(height, breaks = 3),
         age_group = cut(age, breaks = 5)) %>%
  group_by(country, gender, height_group, age_group) %>%
  summarise(count = n(),
            percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) 
我认为结果现在看起来更加 "一致" 了？（例如，没有多个范围）
`summarise()` 已经按 'country'、'gender' 和 'height_group' 分组输出。您可以使用 `.groups` 参数来覆盖。
# A tibble: 60 x 6
# Groups:   country, gender, height_group [18]
   country gender height_group age_group   count percent_own_bicycle
   <chr>   <chr>  <fct>        <fct>       <int>               <dbl>
 1 Canada  F      (151,161]    (17.9,34.2]     2                   0
 2 Canada  F      (151,161]    (34.2,50.4]     5                  40
 3 Canada  F      (151,161]    (50.4,66.6]     1                   0
 4 Canada  F      (151,161]    (66.6,82.8]     2                   0
 5 Canada  F      (151,161]    (82.8,99.1]     1                   0
 6 Canada  F      (161,170]    (17.9,34.2]     1                   0
 7 Canada  F      (161,170]    (34.2,50.4]     1                 100
 8 Canada  F      (161,170]    (50.4,66.6]     1                   0
 9 Canada  F      (161,170]    (82.8,99.1]     2                  50
10 Canada  F      (170,180]    (17.9,34.2]     3                   0
这正确吗？

英文:

OP here - this is my second attempt:

final = df %&gt;%
  mutate(height_group = cut(height, breaks = 3),
         age_group = cut(age, breaks = 5)) %&gt;%
  group_by(country, gender, height_group, age_group) %&gt;%
  summarise(count = n(),
            percent_own_bicycle = mean(owns_bicycle == &quot;Yes&quot;) * 100)

I think the results look more "consistent" now? (e.g. no multiple ranges)

  `summarise()` has grouped output by &#39;country&#39;, &#39;gender&#39;, &#39;height_group&#39;. You can override using the `.groups` argument.
# A tibble: 60 x 6
# Groups:   country, gender, height_group [18]
   country gender height_group age_group   count percent_own_bicycle
   &lt;chr&gt;   &lt;chr&gt;  &lt;fct&gt;        &lt;fct&gt;       &lt;int&gt;               &lt;dbl&gt;
 1 Canada  F      (151,161]    (17.9,34.2]     2                   0
 2 Canada  F      (151,161]    (34.2,50.4]     5                  40
 3 Canada  F      (151,161]    (50.4,66.6]     1                   0
 4 Canada  F      (151,161]    (66.6,82.8]     2                   0
 5 Canada  F      (151,161]    (82.8,99.1]     1                   0
 6 Canada  F      (161,170]    (17.9,34.2]     1                   0
 7 Canada  F      (161,170]    (34.2,50.4]     1                 100
 8 Canada  F      (161,170]    (50.4,66.6]     1                   0
 9 Canada  F      (161,170]    (82.8,99.1]     2                  50
10 Canada  F      (170,180]    (17.9,34.2]     3                   0

Is this correct?

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

统计每个百分位数中的观察次数

问题

答案1

答案2

计算R中列表的各元素的特定向量的平均值，并转换为data.frame。

Rvest表格返回空白

如何在R中合并跨多行的文本

修复R中的文本编码。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。