如何基于数据集的子集获取正确的比例在R中

huangapple go评论62阅读模式
英文:

How to get right proportions based on a subset of the data set in R

问题

我要计算数据集中关于planschool type之间的比例。问题在于,我必须首先多次对数据集进行子集化。例如,在这种情况下,我需要对数据集进行子集化,以仅获取提供从level 0level 04的学校。我该如何做?

  • 编辑:

下面的方法有效,但我没有筛选出所有提供从level 0level 04的学校,我得到的是有其中之一的学校。对此有什么想法将不胜感激。

  • 数据描述:
# SCHOOL = 学校名称
# Q9 = 学校类型
# Q11 = 学校提供的级别
# Q40 = 与每个级别相关的学校提供的计划类型

注意:所有计划都与学校级别相关。因此,如果我不按每所学校提供的级别筛选,只计算这些列的比例,我的结果将具有误导性,因为每所学校可以提供不同的级别选项(所有级别、1级、2级等)。

我的尝试:


### D) 筛选只提供至Level 04的学校 ###

##################################### Quest40_2 ##########################
### 检查向量的名称:

unique(quest40_2$Q11)

unwanted_4_2 <- quest40_2 %>%
  filter(Q11 %in% c('level05','level06')) ### 筛选不想要的学校级别

### 创建一个带有不想要的学校名称的向量:

filter_vec_4_2 <- unique(unwanted_4_2$SCHOOL) ### 获取唯一的名称(只有一所学校有一个名称)

### 将原始数据框赋给一个虚拟数据框:

filtered_df_4_2 <- quest40_2 

### 循环遍历不想要的学校名称以删除它们: 

for (i in 1 :length(filter_vec_4_2)) {
  
  filtered_df_4_2 <- filtered_df_4_2[!filtered_df_4_2$SCHOOL == filter_vec_4_2[i],]
  
}
  • 我需要计算每种plan按每种school type出现的次数(并获得其比例)。问题是,每所学校可以有多个plan类型。因此,如果我们要按每种school type计算plan类型的比例,我们不能按n来除
b <- filtered_df_4_2 %>% drop_na(Q40) %>%
  count(Q40, Q9) %>%
  group_by(Q9) 
  • 这使我想到了:
### 按每种'curriculo'计算唯一的学校数:

b2 <- b2 %>%
  select(SCHOOL, Q9) %>%
  unique() %>% count(Q9)

## 合并数据框并获取每种'planejamento'类型的学校比例
## 在课程类型内

c <- b %>% full_join(b2, by = 'Q9') %>%
     mutate(prop = round((n.x/n.y *100), digits = 2)) %>%
     select(-n.x, -n.y)
  • Q1 = 我认为我没有正确筛选,因为我不是专门筛选提供所有级别从01到04的学校。我想我在使用“或”而不是“与”。

  • Q2 = 有没有一种方法可以避免使用tidyverse中的循环?提前致以感谢。

  • 最终,我正在尝试开始=>学校是否为他们提供的级别制定计划或不制定计划,以及学校类型如何影响此事?(以后可以建模,但现在我只需要百分比)

  • 数据:

dput(quest40_2)              
structure(list(SCHOOL = structure(c("School1", "School1", "School1", 
"School1", "School1", "School1", "School1", "School2", "School2", 
"School2", "School2", "School2", "School2", "School2", "School3", 
"School3", "School3", "School3", "School3", "School3", "School3", 
"School3", "School3", "School4", "School4", "School4", "School4", 
"School4", "School4", "School5", "School5", "School5", "School5", 
"School5", "School5", "School6", "School6", "School6", "School6", 
"School6", "School7", "School7", "School7", "School7", "School7", 
"School7", "School8", "School8", "School8", "School8", "School9", 
"School9", "School9", "School9", "School9", "School10", "School10", 
"School10", "School10", "School10", "School11", "School11", "School11", 
"School11", "School11", "School11", "School11", "School11", "School11", 
"School12", "School12", "School12", "School12", "School12", "School12", 
"School12", "School13", "School13", "School13", "School13", "School13", 
"School13", "School13", "School13", "School13", "School13", "School14", 
"School14", "School14", "School15", "School15", "School15", "School15", 
"School15", "School15", "School16", "School16", "School16", "School16", 
"School16", "School16", "School16", "School17", "School17", "School17", 
"School17", "School17", "School18", "School18", "School18", "School18", 
"School18", "School19", "School19", "School19", "School19", "School19", 
"School19", "School20", "School20", "School20", "School21", "School

<details>
<summary>英文:</summary>

I want to calculate the proportion of ```plan``` with respect to ```school type``` in the dataset below. Thing is, I have to first subset the dataset a couple of times. In this case, for example, I need to subset the dataset so that I only get schools which offer from ```level 0``` up to ```level 04```. **How can I do that?** 

* edit: 

The way below *is working*, but I&#39;m not filtering all schools that have all levels from ```level 0``` up to ```level 04``` , I&#39;m getting the schools that have either of them. Ideas to that would be much appreciated. 

* data description:

SCHOOL = name of the school

Q9 = type of school

Q11 = levels that the schools offer

Q40 = types of planning that the school offers related to each level.



Note: All plannings are related to a school level. Hence, If I don&#39;t filter by the levels that each school offers and just calculate this columns&#39; prop, my results will be misleading because each school can offer different level options (all of them, 1, 2 of them, etc)

my attempt:

D) filter schools which only offer UP TO Level 04

##################################### Quest40_2 ##########################

check vector's names:

unique(quest40_2$Q11)

unwanted_4_2 <- quest40_2 %>%
filter(Q11 %in% c('level05','level06')) ### FILTER UNWANTED SCHOOL LEVELS

create a vector with UNWANTED SCHOOL NAMES:

filter_vec_4_2 <- unique(unwanted_4_2$SCHOOL) ### get unique names (only school has 1 name)

assign the original dataframe to a dummy data frame:

filtered_df_4_2 <- quest40_2

loop over unwanted schools' names to remove them:

for (i in 1 :length(filter_vec_4_2)) {

filtered_df_4_2 <- filtered_df_4_2[!filtered_df_4_2$SCHOOL == filter_vec_4_2[i],]

}


* I need to count how many times EACH &#39;plan&#39; occurs by each type of &#39;school&#39; (and get its proportion). Problem = each school can have more than one &#39;plan&#39; type. Since each school can have more than one ```plan type```, if we want to count the proportion of
 ```plan``` types according to each  ```school type```, we cannot divide by n

b <- filtered_df_4_2 %>% drop_na(Q40) %>%
count(Q40, Q9) %>%
group_by(Q9)


* Which leaves me to:
 

counting the unique schools by each type of 'curriculo':

b2 <- b2 %>%
select(SCHOOL, Q9) %>%
unique() %>% count(Q9)

join dataframes and get the proportion of schools that have each type of 'planejamento'

within the curriculum types

c <- b %>% full_join(b2, by = 'Q9') %>%
mutate(prop = round((n.x/n.y *100), digits = 2)) %>%
select(-n.x, -n.y)


* *Q1* = I don&#39;t think I&#39;m filtering it correctely since I&#39;m not exclusively filtering the schools that offer alll levels up to 01 to 04. I guess I&#39;m doing an &#39;or&#39; not an &#39;and&#39;

* *Q2* = Is there a way to avoid the loop with ```tidyverse``` ? Thanks in adv.

* Ultimately, I&#39;m trying to get the starts to =&gt; **&#39;do the schools plan their agenda for the levels they offer or not and how does the type of school impact this?&#39;** (later this can be modeled, but now I just need percentages)

* data:

dput(quest40_2)
structure(list(SCHOOL = structure(c("School1", "School1", "School1",
"School1", "School1", "School1", "School1", "School2", "School2",
"School2", "School2", "School2", "School2", "School2", "School3",
"School3", "School3", "School3", "School3", "School3", "School3",
"School3", "School3", "School4", "School4", "School4", "School4",
"School4", "School4", "School5", "School5", "School5", "School5",
"School5", "School5", "School6", "School6", "School6", "School6",
"School6", "School7", "School7", "School7", "School7", "School7",
"School7", "School8", "School8", "School8", "School8", "School9",
"School9", "School9", "School9", "School9", "School10", "School10",
"School10", "School10", "School10", "School11", "School11", "School11",
"School11", "School11", "School11", "School11", "School11", "School11",
"School12", "School12", "School12", "School12", "School12", "School12",
"School12", "School13", "School13", "School13", "School13", "School13",
"School13", "School13", "School13", "School13", "School13", "School14",
"School14", "School14", "School15", "School15", "School15", "School15",
"School15", "School15", "School16", "School16", "School16", "School16",
"School16", "School16", "School16", "School17", "School17", "School17",
"School17", "School17", "School18", "School18", "School18", "School18",
"School18", "School19", "School19", "School19", "School19", "School19",
"School19", "School20", "School20", "School20", "School21", "School21",
"School21", "School21", "School21", "School22", "School22", "School22",
"School22", "School22", "School23", "School23", "School23", "School23",
"School23", "School23", "School24", "School24", "School24", "School24",
"School24", "School24", "School25", "School25", "School25", "School25",
"School25", "School26", "School26", "School26", "School26", "School26",
"School26", "School26", "School26", "School26", "School27", "School27",
"School27", "School27", "School27", "School27", "School27", "School28",
"School28", "School28", "School28", "School28", "School28", "School28",
"School28", "School29", "School29", "School29", "School29", "School29",
"School29", "School30", "School30", "School30", "School30", "School30",
"School30", "School30", "School30", "School30", "School30", "School30",
"School31", "School31", "School31", "School31", "School31", "School31",
"School31", "School31", "School31", "School31", "School31", "School32",
"School32", "School32", "School32", "School32", "School32", "School32",
"School32", "School32", "School32", "School32", "School33", "School33",
"School33", "School33", "School33", "School33", "School33", "School34",
"School34", "School34", "School34", "School34", "School34", "School34",
"School34", "School35", "School35", "School35", "School35", "School36",
"School36", "School36", "School36", "School36", "School37", "School37",
"School37", "School37", "School37", "School37", "School37", "School37",
"School38", "School38", "School38", "School38", "School39", "School39",
"School39", "School39", "School39", "School39"), class = c("glue",
"character")), Q9 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 3L,
3L, 3L, 3L, 3L, 3L), .Label = c("typeA", "typeB", "typeC", "typeD"
), class = "factor"), Q11 = structure(c(3L, 7L, 4L, 2L, NA, NA,
NA, 7L, 4L, 2L, 1L, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, NA, NA, NA,
NA, 4L, 2L, 1L, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, NA, 7L, 4L, 2L,
1L, NA, 4L, 2L, 1L, 5L, 6L, NA, 4L, 2L, 1L, NA, 4L, 2L, 1L, 5L,
NA, 4L, 2L, 1L, 5L, NA, 4L, 2L, 1L, 5L, NA, NA, NA, NA, NA, 7L,
4L, 2L, 1L, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, 6L, NA, NA, NA, NA,
5L, 6L, NA, 4L, 2L, 1L, 5L, 6L, NA, 4L, 2L, 1L, 5L, NA, NA, NA,
7L, 4L, 2L, 1L, NA, 7L, 4L, 2L, 1L, NA, 4L, 2L, 1L, NA, NA, NA,
2L, 1L, NA, 4L, 2L, 1L, 5L, NA, 4L, 2L, 1L, 5L, NA, 7L, 4L, 2L,
1L, NA, NA, 4L, 2L, 1L, NA, NA, NA, 4L, 2L, 1L, 5L, NA, 7L, 4L,
2L, 1L, 5L, NA, NA, NA, NA, 2L, 1L, 5L, NA, NA, NA, NA, 3L, 7L,
4L, 2L, 1L, 5L, 6L, NA, 4L, 2L, 1L, 5L, NA, NA, 7L, 4L, 2L, 1L,
5L, 6L, NA, NA, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, 6L, NA, NA, NA,
NA, NA, 7L, 4L, 2L, 1L, 5L, 6L, NA, NA, NA, NA, NA, 7L, 4L, 2L,
1L, 5L, 6L, NA, 7L, 4L, 2L, 1L, 5L, NA, NA, NA, 3L, 7L, 4L, NA,
7L, 4L, 2L, NA, NA, 4L, 2L, 1L, 5L, NA, NA, NA, NA, 4L, 2L, 1L,
NA, 7L, 4L, 2L, 1L, 5L, NA), .Label = c("level04", "level03",
"level0", "level02", "level05", "level06", "level01"), class = "factor"),
Q40 = structure(c(NA, NA, NA, NA, 2L, 6L, 5L, NA, NA, NA,
NA, 2L, 6L, 5L, NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, NA, NA,
NA, 2L, 6L, 5L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L,
NA, NA, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA, NA, NA, 1L,
NA, NA, NA, NA, 1L, NA, NA, NA, NA, 2L, 6L, 5L, 4L, 1L, NA,
NA, NA, NA, 2L, 6L, 5L, NA, NA, NA, NA, NA, NA, 2L, 6L, 4L,
3L, NA, NA, 1L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 2L,
6L, 5L, NA, NA, NA, NA, 2L, NA, NA, NA, NA, 1L, NA, NA, NA,
2L, 6L, 5L, NA, NA, 1L, NA, NA, NA, NA, 1L, NA, NA, NA, NA,
1L, NA, NA, NA, NA, 2L, 1L, NA, NA, NA, 2L, 6L, 5L, NA, NA,
NA, NA, 1L, NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, NA, NA, NA,
2L, 6L, 5L, 4L, NA, NA, NA, NA, NA, NA, NA, 2L, NA, NA, NA,
NA, 2L, 1L, NA, NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, 3L, NA,
NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, 3L, NA, NA, NA, NA, NA,
NA, 2L, 6L, 5L, 4L, 3L, NA, NA, NA, NA, NA, NA, 1L, NA, NA,
NA, NA, NA, 2L, 6L, 5L, NA, NA, NA, 2L, NA, NA, NA, 2L, 6L,
NA, NA, NA, NA, 2L, 6L, 5L, 4L, NA, NA, NA, 1L, NA, NA, NA,
NA, NA, 1L), .Label = c("none", "plan1_level0upto02", "plan5_level_05",
"plan4_level_05", "plan3_level_04", "plan2_level_03"), class = "factor")), row.names = c(NA,
-253L), class = c("tbl_df", "tbl", "data.frame"))


</details>


# 答案1
**得分**: 1

以下是您需要的内容的中文翻译:

这可能是您需要的或为您提供一个良好的起点。它显示了每种类型的计划在级别0-4之间的数量,还显示了百分比。

```R
left_join(na.omit(quest40_2[, c("SCHOOL", "Q9", "Q11")]), 
          na.omit(quest40_2[, c("SCHOOL", "Q40")]), c("SCHOOL"), 
          multiple = "all") %>%
  group_by(Q9, Q40) %>%
  filter(sub(".*(\\d+)$", "\", Q11) <= 4) %>%
  summarize(n = n(), .groups = "drop") %>%
  mutate(percentage = n / sum(n) * 100) %>%
  print(n = Inf)
# 一个数据表:23 × 4
   Q9    Q40                    n percentage
   <fct> <fct>              <int>      <dbl>
 1 typeA none                   3      0.974
 2 typeA plan1_level0upto02     6      1.95 
 3 typeA plan5_level_05         4      1.30 
 4 typeA plan4_level_05         6      1.95 
 5 typeA plan3_level_04         6      1.95 
 6 typeA plan2_level_03         6      1.95 
 7 typeB none                  21      6.82 
 8 typeB plan1_level0upto02    47     15.3  
 9 typeB plan5_level_05         4      1.30 
10 typeB plan4_level_05        14      4.55 
11 typeB plan3_level_04        31     10.1  
12 typeB plan2_level_03        31     10.1  
13 typeC none                  32     10.4  
14 typeC plan1_level0upto02    21      6.82 
15 typeC plan4_level_05         4      1.30 
16 typeC plan3_level_04        15      4.87 
17 typeC plan2_level_03        18      5.84 
18 typeD none                   3      0.974
19 typeD plan1_level0upto02     8      2.60 
20 typeD plan5_level_05         8      2.60 
21 typeD plan4_level_05         8      2.60 
22 typeD plan3_level_04         4      1.30 
23 typeD plan2_level_03         8      2.60

请注意,代码部分已被保留为原文。

英文:

This might be what you need or give you a good start. It shows the number of plans per type with levels 0-4, also shows the percentages.

left_join(na.omit(quest40_2[, c(&quot;SCHOOL&quot;, &quot;Q9&quot;, &quot;Q11&quot;)]), 
          na.omit(quest40_2[, c(&quot;SCHOOL&quot;, &quot;Q40&quot;)]), c(&quot;SCHOOL&quot;), 
          multiple = &quot;all&quot;) %&gt;% 
  group_by(Q9, Q40) %&gt;% 
  filter(sub(&quot;.*(\\d+)$&quot;, &quot;\&quot;, Q11) &lt;= 4) %&gt;% 
  summarize(n = n(), .groups = &quot;drop&quot;) %&gt;% 
  mutate(percentage = n / sum(n) * 100) %&gt;% 
  print(n = Inf)
# A tibble: 23 &#215; 4
   Q9    Q40                    n percentage
   &lt;fct&gt; &lt;fct&gt;              &lt;int&gt;      &lt;dbl&gt;
 1 typeA none                   3      0.974
 2 typeA plan1_level0upto02     6      1.95 
 3 typeA plan5_level_05         4      1.30 
 4 typeA plan4_level_05         6      1.95 
 5 typeA plan3_level_04         6      1.95 
 6 typeA plan2_level_03         6      1.95 
 7 typeB none                  21      6.82 
 8 typeB plan1_level0upto02    47     15.3  
 9 typeB plan5_level_05         4      1.30 
10 typeB plan4_level_05        14      4.55 
11 typeB plan3_level_04        31     10.1  
12 typeB plan2_level_03        31     10.1  
13 typeC none                  32     10.4  
14 typeC plan1_level0upto02    21      6.82 
15 typeC plan4_level_05         4      1.30 
16 typeC plan3_level_04        15      4.87 
17 typeC plan2_level_03        18      5.84 
18 typeD none                   3      0.974
19 typeD plan1_level0upto02     8      2.60 
20 typeD plan5_level_05         8      2.60 
21 typeD plan4_level_05         8      2.60 
22 typeD plan3_level_04         4      1.30 
23 typeD plan2_level_03         8      2.60

huangapple
  • 本文由 发表于 2023年2月18日 02:48:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75488236.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定