如何基于数据集的子集获取正确的比例在R中

huangapple go评论85阅读模式
英文:

How to get right proportions based on a subset of the data set in R

问题

我要计算数据集中关于planschool type之间的比例。问题在于,我必须首先多次对数据集进行子集化。例如,在这种情况下,我需要对数据集进行子集化,以仅获取提供从level 0level 04的学校。我该如何做?

  • 编辑:

下面的方法有效,但我没有筛选出所有提供从level 0level 04的学校,我得到的是有其中之一的学校。对此有什么想法将不胜感激。

  • 数据描述:
  1. # SCHOOL = 学校名称
  2. # Q9 = 学校类型
  3. # Q11 = 学校提供的级别
  4. # Q40 = 与每个级别相关的学校提供的计划类型

注意:所有计划都与学校级别相关。因此,如果我不按每所学校提供的级别筛选,只计算这些列的比例,我的结果将具有误导性,因为每所学校可以提供不同的级别选项(所有级别、1级、2级等)。

我的尝试:

  1. ### D) 筛选只提供至Level 04的学校 ###
  2. ##################################### Quest40_2 ##########################
  3. ### 检查向量的名称:
  4. unique(quest40_2$Q11)
  5. unwanted_4_2 <- quest40_2 %>%
  6. filter(Q11 %in% c('level05','level06')) ### 筛选不想要的学校级别
  7. ### 创建一个带有不想要的学校名称的向量:
  8. filter_vec_4_2 <- unique(unwanted_4_2$SCHOOL) ### 获取唯一的名称(只有一所学校有一个名称)
  9. ### 将原始数据框赋给一个虚拟数据框:
  10. filtered_df_4_2 <- quest40_2
  11. ### 循环遍历不想要的学校名称以删除它们:
  12. for (i in 1 :length(filter_vec_4_2)) {
  13. filtered_df_4_2 <- filtered_df_4_2[!filtered_df_4_2$SCHOOL == filter_vec_4_2[i],]
  14. }
  • 我需要计算每种plan按每种school type出现的次数(并获得其比例)。问题是,每所学校可以有多个plan类型。因此,如果我们要按每种school type计算plan类型的比例,我们不能按n来除
  1. b <- filtered_df_4_2 %>% drop_na(Q40) %>%
  2. count(Q40, Q9) %>%
  3. group_by(Q9)
  • 这使我想到了:
  1. ### 按每种'curriculo'计算唯一的学校数:
  2. b2 <- b2 %>%
  3. select(SCHOOL, Q9) %>%
  4. unique() %>% count(Q9)
  5. ## 合并数据框并获取每种'planejamento'类型的学校比例
  6. ## 在课程类型内
  7. c <- b %>% full_join(b2, by = 'Q9') %>%
  8. mutate(prop = round((n.x/n.y *100), digits = 2)) %>%
  9. select(-n.x, -n.y)
  • Q1 = 我认为我没有正确筛选,因为我不是专门筛选提供所有级别从01到04的学校。我想我在使用“或”而不是“与”。

  • Q2 = 有没有一种方法可以避免使用tidyverse中的循环?提前致以感谢。

  • 最终,我正在尝试开始=>学校是否为他们提供的级别制定计划或不制定计划,以及学校类型如何影响此事?(以后可以建模,但现在我只需要百分比)

  • 数据:

  1. dput(quest40_2)
  2. structure(list(SCHOOL = structure(c("School1", "School1", "School1",
  3. "School1", "School1", "School1", "School1", "School2", "School2",
  4. "School2", "School2", "School2", "School2", "School2", "School3",
  5. "School3", "School3", "School3", "School3", "School3", "School3",
  6. "School3", "School3", "School4", "School4", "School4", "School4",
  7. "School4", "School4", "School5", "School5", "School5", "School5",
  8. "School5", "School5", "School6", "School6", "School6", "School6",
  9. "School6", "School7", "School7", "School7", "School7", "School7",
  10. "School7", "School8", "School8", "School8", "School8", "School9",
  11. "School9", "School9", "School9", "School9", "School10", "School10",
  12. "School10", "School10", "School10", "School11", "School11", "School11",
  13. "School11", "School11", "School11", "School11", "School11", "School11",
  14. "School12", "School12", "School12", "School12", "School12", "School12",
  15. "School12", "School13", "School13", "School13", "School13", "School13",
  16. "School13", "School13", "School13", "School13", "School13", "School14",
  17. "School14", "School14", "School15", "School15", "School15", "School15",
  18. "School15", "School15", "School16", "School16", "School16", "School16",
  19. "School16", "School16", "School16", "School17", "School17", "School17",
  20. "School17", "School17", "School18", "School18", "School18", "School18",
  21. "School18", "School19", "School19", "School19", "School19", "School19",
  22. "School19", "School20", "School20", "School20", "School21", "School
  23. <details>
  24. <summary>英文:</summary>
  25. I want to calculate the proportion of ```plan``` with respect to ```school type``` in the dataset below. Thing is, I have to first subset the dataset a couple of times. In this case, for example, I need to subset the dataset so that I only get schools which offer from ```level 0``` up to ```level 04```. **How can I do that?**
  26. * edit:
  27. The way below *is working*, but I&#39;m not filtering all schools that have all levels from ```level 0``` up to ```level 04``` , I&#39;m getting the schools that have either of them. Ideas to that would be much appreciated.
  28. * data description:

SCHOOL = name of the school

Q9 = type of school

Q11 = levels that the schools offer

Q40 = types of planning that the school offers related to each level.

  1. Note: All plannings are related to a school level. Hence, If I don&#39;t filter by the levels that each school offers and just calculate this columns&#39; prop, my results will be misleading because each school can offer different level options (all of them, 1, 2 of them, etc)
  2. my attempt:

D) filter schools which only offer UP TO Level 04

##################################### Quest40_2 ##########################

check vector's names:

unique(quest40_2$Q11)

unwanted_4_2 <- quest40_2 %>%
filter(Q11 %in% c('level05','level06')) ### FILTER UNWANTED SCHOOL LEVELS

create a vector with UNWANTED SCHOOL NAMES:

filter_vec_4_2 <- unique(unwanted_4_2$SCHOOL) ### get unique names (only school has 1 name)

assign the original dataframe to a dummy data frame:

filtered_df_4_2 <- quest40_2

loop over unwanted schools' names to remove them:

for (i in 1 :length(filter_vec_4_2)) {

filtered_df_4_2 <- filtered_df_4_2[!filtered_df_4_2$SCHOOL == filter_vec_4_2[i],]

}

  1. * I need to count how many times EACH &#39;plan&#39; occurs by each type of &#39;school&#39; (and get its proportion). Problem = each school can have more than one &#39;plan&#39; type. Since each school can have more than one ```plan type```, if we want to count the proportion of
  2. ```plan``` types according to each ```school type```, we cannot divide by n

b <- filtered_df_4_2 %>% drop_na(Q40) %>%
count(Q40, Q9) %>%
group_by(Q9)

  1. * Which leaves me to:

counting the unique schools by each type of 'curriculo':

b2 <- b2 %>%
select(SCHOOL, Q9) %>%
unique() %>% count(Q9)

join dataframes and get the proportion of schools that have each type of 'planejamento'

within the curriculum types

c <- b %>% full_join(b2, by = 'Q9') %>%
mutate(prop = round((n.x/n.y *100), digits = 2)) %>%
select(-n.x, -n.y)

  1. * *Q1* = I don&#39;t think I&#39;m filtering it correctely since I&#39;m not exclusively filtering the schools that offer alll levels up to 01 to 04. I guess I&#39;m doing an &#39;or&#39; not an &#39;and&#39;
  2. * *Q2* = Is there a way to avoid the loop with ```tidyverse``` ? Thanks in adv.
  3. * Ultimately, I&#39;m trying to get the starts to =&gt; **&#39;do the schools plan their agenda for the levels they offer or not and how does the type of school impact this?&#39;** (later this can be modeled, but now I just need percentages)
  4. * data:

dput(quest40_2)
structure(list(SCHOOL = structure(c("School1", "School1", "School1",
"School1", "School1", "School1", "School1", "School2", "School2",
"School2", "School2", "School2", "School2", "School2", "School3",
"School3", "School3", "School3", "School3", "School3", "School3",
"School3", "School3", "School4", "School4", "School4", "School4",
"School4", "School4", "School5", "School5", "School5", "School5",
"School5", "School5", "School6", "School6", "School6", "School6",
"School6", "School7", "School7", "School7", "School7", "School7",
"School7", "School8", "School8", "School8", "School8", "School9",
"School9", "School9", "School9", "School9", "School10", "School10",
"School10", "School10", "School10", "School11", "School11", "School11",
"School11", "School11", "School11", "School11", "School11", "School11",
"School12", "School12", "School12", "School12", "School12", "School12",
"School12", "School13", "School13", "School13", "School13", "School13",
"School13", "School13", "School13", "School13", "School13", "School14",
"School14", "School14", "School15", "School15", "School15", "School15",
"School15", "School15", "School16", "School16", "School16", "School16",
"School16", "School16", "School16", "School17", "School17", "School17",
"School17", "School17", "School18", "School18", "School18", "School18",
"School18", "School19", "School19", "School19", "School19", "School19",
"School19", "School20", "School20", "School20", "School21", "School21",
"School21", "School21", "School21", "School22", "School22", "School22",
"School22", "School22", "School23", "School23", "School23", "School23",
"School23", "School23", "School24", "School24", "School24", "School24",
"School24", "School24", "School25", "School25", "School25", "School25",
"School25", "School26", "School26", "School26", "School26", "School26",
"School26", "School26", "School26", "School26", "School27", "School27",
"School27", "School27", "School27", "School27", "School27", "School28",
"School28", "School28", "School28", "School28", "School28", "School28",
"School28", "School29", "School29", "School29", "School29", "School29",
"School29", "School30", "School30", "School30", "School30", "School30",
"School30", "School30", "School30", "School30", "School30", "School30",
"School31", "School31", "School31", "School31", "School31", "School31",
"School31", "School31", "School31", "School31", "School31", "School32",
"School32", "School32", "School32", "School32", "School32", "School32",
"School32", "School32", "School32", "School32", "School33", "School33",
"School33", "School33", "School33", "School33", "School33", "School34",
"School34", "School34", "School34", "School34", "School34", "School34",
"School34", "School35", "School35", "School35", "School35", "School36",
"School36", "School36", "School36", "School36", "School37", "School37",
"School37", "School37", "School37", "School37", "School37", "School37",
"School38", "School38", "School38", "School38", "School39", "School39",
"School39", "School39", "School39", "School39"), class = c("glue",
"character")), Q9 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 3L,
3L, 3L, 3L, 3L, 3L), .Label = c("typeA", "typeB", "typeC", "typeD"
), class = "factor"), Q11 = structure(c(3L, 7L, 4L, 2L, NA, NA,
NA, 7L, 4L, 2L, 1L, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, NA, NA, NA,
NA, 4L, 2L, 1L, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, NA, 7L, 4L, 2L,
1L, NA, 4L, 2L, 1L, 5L, 6L, NA, 4L, 2L, 1L, NA, 4L, 2L, 1L, 5L,
NA, 4L, 2L, 1L, 5L, NA, 4L, 2L, 1L, 5L, NA, NA, NA, NA, NA, 7L,
4L, 2L, 1L, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, 6L, NA, NA, NA, NA,
5L, 6L, NA, 4L, 2L, 1L, 5L, 6L, NA, 4L, 2L, 1L, 5L, NA, NA, NA,
7L, 4L, 2L, 1L, NA, 7L, 4L, 2L, 1L, NA, 4L, 2L, 1L, NA, NA, NA,
2L, 1L, NA, 4L, 2L, 1L, 5L, NA, 4L, 2L, 1L, 5L, NA, 7L, 4L, 2L,
1L, NA, NA, 4L, 2L, 1L, NA, NA, NA, 4L, 2L, 1L, 5L, NA, 7L, 4L,
2L, 1L, 5L, NA, NA, NA, NA, 2L, 1L, 5L, NA, NA, NA, NA, 3L, 7L,
4L, 2L, 1L, 5L, 6L, NA, 4L, 2L, 1L, 5L, NA, NA, 7L, 4L, 2L, 1L,
5L, 6L, NA, NA, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, 6L, NA, NA, NA,
NA, NA, 7L, 4L, 2L, 1L, 5L, 6L, NA, NA, NA, NA, NA, 7L, 4L, 2L,
1L, 5L, 6L, NA, 7L, 4L, 2L, 1L, 5L, NA, NA, NA, 3L, 7L, 4L, NA,
7L, 4L, 2L, NA, NA, 4L, 2L, 1L, 5L, NA, NA, NA, NA, 4L, 2L, 1L,
NA, 7L, 4L, 2L, 1L, 5L, NA), .Label = c("level04", "level03",
"level0", "level02", "level05", "level06", "level01"), class = "factor"),
Q40 = structure(c(NA, NA, NA, NA, 2L, 6L, 5L, NA, NA, NA,
NA, 2L, 6L, 5L, NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, NA, NA,
NA, 2L, 6L, 5L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L,
NA, NA, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA, NA, NA, 1L,
NA, NA, NA, NA, 1L, NA, NA, NA, NA, 2L, 6L, 5L, 4L, 1L, NA,
NA, NA, NA, 2L, 6L, 5L, NA, NA, NA, NA, NA, NA, 2L, 6L, 4L,
3L, NA, NA, 1L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 2L,
6L, 5L, NA, NA, NA, NA, 2L, NA, NA, NA, NA, 1L, NA, NA, NA,
2L, 6L, 5L, NA, NA, 1L, NA, NA, NA, NA, 1L, NA, NA, NA, NA,
1L, NA, NA, NA, NA, 2L, 1L, NA, NA, NA, 2L, 6L, 5L, NA, NA,
NA, NA, 1L, NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, NA, NA, NA,
2L, 6L, 5L, 4L, NA, NA, NA, NA, NA, NA, NA, 2L, NA, NA, NA,
NA, 2L, 1L, NA, NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, 3L, NA,
NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, 3L, NA, NA, NA, NA, NA,
NA, 2L, 6L, 5L, 4L, 3L, NA, NA, NA, NA, NA, NA, 1L, NA, NA,
NA, NA, NA, 2L, 6L, 5L, NA, NA, NA, 2L, NA, NA, NA, 2L, 6L,
NA, NA, NA, NA, 2L, 6L, 5L, 4L, NA, NA, NA, 1L, NA, NA, NA,
NA, NA, 1L), .Label = c("none", "plan1_level0upto02", "plan5_level_05",
"plan4_level_05", "plan3_level_04", "plan2_level_03"), class = "factor")), row.names = c(NA,
-253L), class = c("tbl_df", "tbl", "data.frame"))

  1. </details>
  2. # 答案1
  3. **得分**: 1
  4. 以下是您需要的内容的中文翻译:
  5. 这可能是您需要的或为您提供一个良好的起点。它显示了每种类型的计划在级别0-4之间的数量,还显示了百分比。
  6. ```R
  7. left_join(na.omit(quest40_2[, c("SCHOOL", "Q9", "Q11")]),
  8. na.omit(quest40_2[, c("SCHOOL", "Q40")]), c("SCHOOL"),
  9. multiple = "all") %>%
  10. group_by(Q9, Q40) %>%
  11. filter(sub(".*(\\d+)$", "\", Q11) <= 4) %>%
  12. summarize(n = n(), .groups = "drop") %>%
  13. mutate(percentage = n / sum(n) * 100) %>%
  14. print(n = Inf)
  15. # 一个数据表:23 × 4
  16. Q9 Q40 n percentage
  17. <fct> <fct> <int> <dbl>
  18. 1 typeA none 3 0.974
  19. 2 typeA plan1_level0upto02 6 1.95
  20. 3 typeA plan5_level_05 4 1.30
  21. 4 typeA plan4_level_05 6 1.95
  22. 5 typeA plan3_level_04 6 1.95
  23. 6 typeA plan2_level_03 6 1.95
  24. 7 typeB none 21 6.82
  25. 8 typeB plan1_level0upto02 47 15.3
  26. 9 typeB plan5_level_05 4 1.30
  27. 10 typeB plan4_level_05 14 4.55
  28. 11 typeB plan3_level_04 31 10.1
  29. 12 typeB plan2_level_03 31 10.1
  30. 13 typeC none 32 10.4
  31. 14 typeC plan1_level0upto02 21 6.82
  32. 15 typeC plan4_level_05 4 1.30
  33. 16 typeC plan3_level_04 15 4.87
  34. 17 typeC plan2_level_03 18 5.84
  35. 18 typeD none 3 0.974
  36. 19 typeD plan1_level0upto02 8 2.60
  37. 20 typeD plan5_level_05 8 2.60
  38. 21 typeD plan4_level_05 8 2.60
  39. 22 typeD plan3_level_04 4 1.30
  40. 23 typeD plan2_level_03 8 2.60

请注意,代码部分已被保留为原文。

英文:

This might be what you need or give you a good start. It shows the number of plans per type with levels 0-4, also shows the percentages.

  1. left_join(na.omit(quest40_2[, c(&quot;SCHOOL&quot;, &quot;Q9&quot;, &quot;Q11&quot;)]),
  2. na.omit(quest40_2[, c(&quot;SCHOOL&quot;, &quot;Q40&quot;)]), c(&quot;SCHOOL&quot;),
  3. multiple = &quot;all&quot;) %&gt;%
  4. group_by(Q9, Q40) %&gt;%
  5. filter(sub(&quot;.*(\\d+)$&quot;, &quot;\&quot;, Q11) &lt;= 4) %&gt;%
  6. summarize(n = n(), .groups = &quot;drop&quot;) %&gt;%
  7. mutate(percentage = n / sum(n) * 100) %&gt;%
  8. print(n = Inf)
  9. # A tibble: 23 &#215; 4
  10. Q9 Q40 n percentage
  11. &lt;fct&gt; &lt;fct&gt; &lt;int&gt; &lt;dbl&gt;
  12. 1 typeA none 3 0.974
  13. 2 typeA plan1_level0upto02 6 1.95
  14. 3 typeA plan5_level_05 4 1.30
  15. 4 typeA plan4_level_05 6 1.95
  16. 5 typeA plan3_level_04 6 1.95
  17. 6 typeA plan2_level_03 6 1.95
  18. 7 typeB none 21 6.82
  19. 8 typeB plan1_level0upto02 47 15.3
  20. 9 typeB plan5_level_05 4 1.30
  21. 10 typeB plan4_level_05 14 4.55
  22. 11 typeB plan3_level_04 31 10.1
  23. 12 typeB plan2_level_03 31 10.1
  24. 13 typeC none 32 10.4
  25. 14 typeC plan1_level0upto02 21 6.82
  26. 15 typeC plan4_level_05 4 1.30
  27. 16 typeC plan3_level_04 15 4.87
  28. 17 typeC plan2_level_03 18 5.84
  29. 18 typeD none 3 0.974
  30. 19 typeD plan1_level0upto02 8 2.60
  31. 20 typeD plan5_level_05 8 2.60
  32. 21 typeD plan4_level_05 8 2.60
  33. 22 typeD plan3_level_04 4 1.30
  34. 23 typeD plan2_level_03 8 2.60

huangapple
  • 本文由 发表于 2023年2月18日 02:48:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75488236.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定