删除随机样本以使群体比例匹配

huangapple go评论100阅读模式
英文:

Remove random sample to make group proportions match

问题

我有一个包含区域数据的数据框,我想要使每个国家特定变量的比例相等。这是我的例子。

我有一个表格,其中包含了按性别划分的国家样本的详细信息。我想要删除样本,以使0和1相等。

  1. > table(df$Gender, df$COUNTRY)
  2. 1 2 3
  3. 0 86 81 282
  4. 1 21 7 23

是否有任何包/函数可以删除等于零的值,以保持与等于1的值相匹配?

这将是期望的结果:

  1. > table(df$Gender, df$COUNTRY)
  2. 1 2 3
  3. 0 21 7 23
  4. 1 21 7 23

如果还有其他更基本的方法可以实现这一点,也会很有帮助。例如,删除其中一个国家为1且性别为0的随机样本。然后我可以手动处理每个国家。

有人要求提供一个dput,下面是它:

  1. df <- structure(list(Gender = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  2. 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  3. 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
  4. 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
  5. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  6. 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
  7. 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  8. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  9. 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
  10. 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  11. 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
  12. 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  13. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
  14. 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  15. 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
  16. 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
  17. 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  18. 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  19. 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  20. 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
  21. 0, 1, 1, 0, 0), COUNTRY = c(2, 3, 2, 1, 3, 3, 3, 2, 2, 3, 3,
  22. 3, 3, 3, 3, 2, 3, 3, 1, 3, 3, 3, 3, 1, 2, 3, 2, 3, 1, 3, 2, 3,
  23. 3, 3, 2, 2, 3, 3, 3, 2, 3, 2, 2, 1, 3, 3, 3, 2, 2, 3, 1, 1, 2,
  24. 2, 1, 3, 3, 1, 2, 1, 3, 3, 3, 1, 1, 3, 3, 3, 1, 3, 2, 1, 3, 2,
  25. 3, 3, 2, 3, 3, 3, 3, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
  26. 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 1, 1, 3, 1, 2, 1, 3,
  27. 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 1, 2, 3, 3, 3, 1, 1, 3, 1, 1, 2,
  28. 2, 3, 3, 1, 2, 3, 3, 3, 2, 3, 3, 1, 3, 3, 1, 3, 1, 1, 3, 3, 2,
  29. 3, 1, 1, 1, 3, 3, 3, 2, 3, 3, 2, 3, 2, 1, 3, 2, 3, 1, 3, 2, 2,
  30. 2, 3, 3, 2, 1, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 2, 2,
  31. 1, 1, 3, 1, 1, 1, 3, 3, 1, 2, 1, 1, 1, 1, 3, 3, 3, 1, 3, 3, 2,
  32. 3, 3, 3, 3, 3, 1, 3, 3, 2, 1, 1, 2, 3, 2, 3, 3, 3, 2, 2, 3, 3,
  33. 3, 3, 1, 3, 2, 2, 1, 3, 2, 1, 3, 2, 3, 2, 3, 3, 2, 3, 2, 3, 3,
  34. 3, 1, 3, 2, 1, 1, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 1, 2, 3, 1, 2,
  35. 3, 2, 1, 2, 1, 3, 1, 3, 3, 3, 3, 3, 1, 3, 1, 1, 3, 1, 3, 1, 1,
  36. 3, 3, 1, 3, 1, 1, 1, 2, 3, 2, 2, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3,
  37. 3, 3, 3, 3, 2, 3, 1, 3, 3, 3, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3,
  38. 1, 3, 3, 1, 2, 3, 1, 3, 3, 3, 1, 3, 1, 3, 3, 3, 1, 1, 3, 2, 3,
  39. 1, 3, 3, 3, 1, 2, 3, 3, 3, 3, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1,
  40. 3, 3, 3, 3, 3, 1, 1, 3, 3, 2, 3, 3, 3, 3, 1, 3, 3, 2, 3, 3, 1,
  41. 3, 3, 3, 2, 3, 1, 3, 3, 1, 3, 2, 1, 2, 3, 3, 3, 3, 3, 3, 2, 2,
  42. 2, 3, 3, 2, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 2, 1, 3, 3, 3, 3, 3,
  43. 3, 1, 3, 1, 2, 3, 3, 2, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3,
  44. 1, 1, 3, 3, 3, 1, 3, 3, 1, 3, 3, 3, 3, 3, 1, 3, 3, 1, 3, 3, 3,
  45. 3, 3, 2, 3, 2, 3)), row.names = c(NA, 500L), class = "data.frame")

请注意,我只会返回翻译好的部分,不会回答关于翻译的问题。

英文:

I have a dataframe which has regional data, I'd like to be able to make the proportions within each country equal for a specific variable. Here is my example.
I have a table has the details of the sample, for Country by Gender. I'd like to be able to remove sample, with aim that 0 and 1 are equivalent.

  1. &gt; table(df$Gender , df$COUNTRY)
  2. 1 2 3
  3. 0 86 81 282
  4. 1 21 7 23

Is there any package/function that will remove the values equal to zero to keep enough to match those equal 1?

This would be the desired result

  1. &gt; table(df$Gender , df$COUNTRY)
  2. 1 2 3
  3. 0 21 7 23
  4. 1 21 7 23

If there is also a more basic way to do this, that will help too. For example, remove a random sample of 65 where df$Country=1 & df$Gender=0. I can then do each of country manually.

As someone requested a dput, here we go. The tables above have changed accordingly

  1. df &lt;- structure(list(Gender = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  2. 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  3. 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
  4. 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
  5. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  6. 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
  7. 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  8. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  9. 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
  10. 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  11. 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
  12. 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  13. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
  14. 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  15. 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
  16. 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
  17. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  18. 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
  19. 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
  20. 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
  21. 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  22. 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  23. 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  24. 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
  25. 0, 1, 1, 0, 0), COUNTRY = c(2, 3, 2, 1, 3, 3, 3, 2, 2, 3, 3,
  26. 3, 3, 3, 3, 2, 3, 3, 1, 3, 3, 3, 3, 1, 2, 3, 2, 3, 1, 3, 2, 3,
  27. 3, 3, 2, 2, 3, 3, 3, 2, 3, 2, 2, 1, 3, 3, 3, 2, 2, 3, 1, 1, 2,
  28. 2, 1, 3, 3, 1, 2, 1, 3, 3, 3, 1, 1, 3, 3, 3, 1, 3, 2, 1, 3, 2,
  29. 3, 3, 2, 3, 3, 3, 3, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
  30. 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 1, 1, 3, 1, 2, 1, 3,
  31. 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 1, 2, 3, 3, 3, 1, 1, 3, 1, 1, 2,
  32. 2, 3, 3, 1, 2, 3, 3, 3, 2, 3, 3, 1, 3, 3, 1, 3, 1, 1, 3, 3, 2,
  33. 3, 1, 1, 1, 3, 3, 3, 2, 3, 3, 2, 3, 2, 1, 3, 2, 3, 1, 3, 2, 2,
  34. 2, 3, 3, 2, 1, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 2, 2,
  35. 1, 1, 3, 1, 1, 1, 3, 3, 1, 2, 1, 1, 1, 1, 3, 3, 3, 1, 3, 3, 2,
  36. 3, 3, 3, 3, 3, 1, 3, 3, 2, 1, 1, 2, 3, 2, 3, 3, 3, 2, 2, 3, 3,
  37. 3, 3, 1, 3, 2, 2, 1, 3, 2, 1, 3, 2, 3, 2, 3, 3, 2, 3, 2, 3, 3,
  38. 3, 1, 3, 2, 1, 1, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 1, 2, 3, 1, 2,
  39. 3, 2, 1, 2, 1, 3, 1, 3, 3, 3, 3, 3, 1, 3, 1, 1, 3, 1, 3, 1, 1,
  40. 3, 3, 1, 3, 1, 1, 1, 2, 3, 2, 2, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3,
  41. 3, 3, 3, 3, 2, 3, 1, 3, 3, 3, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3,
  42. 1, 3, 3, 1, 2, 3, 1, 3, 3, 3, 1, 3, 1, 3, 3, 3, 1, 1, 3, 2, 3,
  43. 1, 3, 3, 3, 1, 2, 3, 3, 3, 3, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1,
  44. 3, 3, 3, 3, 3, 1, 1, 3, 3, 2, 3, 3, 3, 3, 1, 3, 3, 2, 3, 3, 1,
  45. 3, 3, 3, 2, 3, 1, 3, 3, 1, 3, 2, 1, 2, 3, 3, 3, 3, 3, 3, 2, 2,
  46. 2, 3, 3, 2, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 2, 1, 3, 3, 3, 3, 3,
  47. 3, 1, 3, 1, 2, 3, 3, 2, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3,
  48. 1, 1, 3, 3, 3, 1, 3, 3, 1, 3, 3, 3, 3, 3, 1, 3, 3, 1, 3, 3, 3,
  49. 3, 3, 2, 3, 2, 3)), row.names = c(NA, 500L), class = &quot;data.frame&quot;)

答案1

得分: 4

这是一个基于R语言的方法。
首先,对数据进行分组并计算行差异。然后,从第一列为0且与每个表列名相等的索引中随机抽取与这些差异相同数量的行。这些行是要删除的行。

  1. n <- 100L
  2. set.seed(2023)
  3. df1 <- data.frame(
  4. a = sample(0:1, n, TRUE, prob = c(3, 1)/4),
  5. b = sample(3, n, TRUE)
  6. )
  7. i <- df1$a == 0
  8. table(df1) |
  9. apply(2, diff) -> tmp
  10. lapply(names(tmp), \(nm) {
  11. j <- which(i & df1$b == nm)
  12. sample(j, abs(tmp[nm]))
  13. }) | unlist() -> tmp
  14. df1[-tmp, ] | table()
  15. #> b
  16. #> a 1 2 3
  17. #> 0 8 10 8
  18. #> 1 8 10 8
  19. rm(tmp) # tidy up

创建于2023-08-08,使用reprex v2.0.2

英文:

Here is a base R way.
First, table the data and compute the rows differences. Then, sample from the indices that have 1st column 0 and are equal to each table column name as many as those differences. These are the rows to remove.

  1. n &lt;- 100L
  2. set.seed(2023)
  3. df1 &lt;- data.frame(
  4. a = sample(0:1, n, TRUE, prob = c(3, 1)/4),
  5. b = sample(3, n, TRUE)
  6. )
  7. i &lt;- df1$a == 0
  8. table(df1) |&gt;
  9. apply(2, diff) -&gt; tmp
  10. lapply(names(tmp), \(nm) {
  11. j &lt;- which(i &amp; df1$b == nm)
  12. sample(j, abs(tmp[nm]))
  13. }) |&gt; unlist() -&gt; tmp
  14. df1[-tmp, ] |&gt; table()
  15. #&gt; b
  16. #&gt; a 1 2 3
  17. #&gt; 0 8 10 8
  18. #&gt; 1 8 10 8
  19. rm(tmp) # tidy up

<sup>Created on 2023-08-08 with reprex v2.0.2</sup>

答案2

得分: 3

caret::downSample 函数非常适合这种情况:

首先是一些数据:

  1. z1 <- data.frame(gender = c(rep(0, 9155), rep(1, 2628)),
  2. country = 1)
  3. z2 <- data.frame(gender = c(rep(0, 9335), rep(1, 1242)),
  4. country = 2)
  5. z3 <- data.frame(gender = c(rep(0, 31964), rep(1, 2720)),
  6. country = 3)
  7. z <- rbind(z1, z2, z3)
  8. table(z)
  9. #输出
  10. country
  11. gender 1 2 3
  12. 0 9155 9335 31964
  13. 1 2628 1242 2720

应用 down sampling,其中 y 参数是你想要使用的分类变量的交互作用:

  1. set.seed(123) #使结果可重现
  2. z_down <- caret::downSample(z, as.factor(interaction(z$gender, z$country)))
  3. table(z_down[names(z_down) != "Class"]) #移除添加的列
  4. #输出
  5. country
  6. gender 1 2 3
  7. 0 1242 1242 1242
  8. 1 1242 1242 1242

这样可以使每个分类变量组合的行数与样本最少的组合(1242)相等。

如果不需要这样,可以按国家进行操作:

  1. lapply(split(z, z$country), \(x) caret::downSample(x, as.factor(x$gender))) -> z_2

然后将数据框的列表合并为一个:

  1. z_2 <- do.call(rbind, z_2)

并移除 caret 添加的列:

  1. z_2 <- z_2[, names(z_2) != "Class"]
  2. table( z_2)
  3. #输出
  4. country
  5. gender 1 2 3
  6. 0 2628 1242 2720
  7. 1 2628 1242 2720

编辑:我有一些额外的时间,所以这里有一个根据你在评论中提到的比例混合的函数(因为 caret::downSample 不支持):

  1. downSample_custm <- function(df,
  2. factor,
  3. by = rep(1, nrow(df)),
  4. proportion = 1,
  5. output = c("df", "index")){
  6. output <- match.arg(output)
  7. # 使用 by 参数将数据框拆分为列表,如果不使用 by 参数,它将使用整个数据框
  8. splited <- split(data.frame(df,
  9. .factor = as.factor(factor),
  10. row = 1:nrow(df)), by)
  11. lapply(splited, function(x){
  12. # 获取少数类别的实例数
  13. minClass <- min(table(x$.factor))
  14. # 按类别拆分输入数据框
  15. spl <- split(x, x$.factor)
  16. # 遍历类别
  17. lapply(spl, function(i){
  18. # 类别为 i 的数据框
  19. y <- nrow(i) #行数
  20. if(proportion >= 1 && y <= ceiling(minClass * proportion)){
  21. # 如果类别的实例数小于等于 minClass * proportion,则返回全部实例
  22. return(i)
  23. } else {
  24. # 从类别中随机抽样 ceiling(minClass * proportion) 次
  25. return(i[sample(seq_along(i[,1]), ceiling(minClass * proportion)),])
  26. }
  27. }) -> out1
  28. out1 <- do.call(rbind, out1)
  29. return(out1)
  30. }) -> out2
  31. out2 <- do.call(rbind, out2)
  32. rownames(out2) <- NULL
  33. if(output == "df"){
  34. return(out2[, !colnames(out2) %in% c(".factor", "row")])
  35. } else if (output == "index") {
  36. return(sort(out2$row))
  37. }
  38. }

用法:

对所有类别进行 down sample:

  1. z2 <- downSample_custm(z, factor = z$gender, by = z$country, proportion = 0.3)
  2. table(z2)
  3. #输出
  4. country
  5. gender 1 2 3
  6. 0 789 373 816
  7. 1 789 373 816

对过度表示的类别进行 down sample:

  1. z3 <- downSample_custm(z, factor = z$gender, by = z$country, proportion = 1.5)
  2. table(z3)
  3. #输出
  4. country
  5. gender 1 2 3
  6. 0 3942 1863 4080
  7. 1 2628 1242 2720

返回行索引而不是 down sample 后的数据框:

  1. z4 <- downSample_custm(z, factor = z$gender, by = z$country, proportion = 1, output = "index")
  2. table(z[z4,])
  3. #输出
  4. country
  5. gender 1 2 3
  6. 0 2628 1242 2720
  7. 1 2628 1242 2720

简单的 down sample(没有分组):

  1. z5 <- downSample_custm(z, factor = z$gender, proportion = 1)
  2. table(z5$gender)
  3. #输出
  4. 0 1
  5. 6590 6590
英文:

The caret::downSample is practically tailored for this:

Some data:

  1. z1 &lt;- data.frame(gender = c(rep(0, 9155), rep(1, 2628)),
  2. country = 1)
  3. z2 &lt;- data.frame(gender = c(rep(0, 9335), rep(1, 1242)),
  4. country = 2)
  5. z3 &lt;- data.frame(gender = c(rep(0, 31964), rep(1, 2720)),
  6. country = 3)
  7. z &lt;- rbind(z1, z2, z3)
  8. table(z)
  9. #output
  10. country
  11. gender 1 2 3
  12. 0 9155 9335 31964
  13. 1 2628 1242 2720

Apply down sampling where the y argument would be the interaction of the categorical variables you wish to use:

  1. set.seed(123) #make it reproducible
  2. z_down &lt;- caret::downSample(z, as.factor(interaction(z$gender, z$country)))
  3. table(z_down[names(z_down) != &quot;Class&quot;]) #remove added column
  4. #output
  5. country
  6. gender 1 2 3
  7. 0 1242 1242 1242
  8. 1 1242 1242 1242

This gives equal number of rows per categorical variable combo as in the combination with the lowest number of samples (1242).

If this is not desired, one can do by country:

  1. lapply(split(z, z$country), \(x) caret::downSample(x, as.factor(x$gender))) -&gt; z_2

then just combine the list of data frames into one:

  1. z_2 &lt;- do.call(rbind, z_2)

and remove the column caret added:

  1. z_2 &lt;- z_2[, names(z_2) != &quot;Class&quot;]
  2. table( z_2)
  3. #output
  4. country
  5. gender 1 2 3
  6. 0 2628 1242 2720
  7. 1 2628 1242 2720

EDIT: I had some extra time so here is a function that does what you mention in the comment (proportion mix) since caret::downSample does not:

  1. downSample_custm &lt;- function(df,
  2. factor,
  3. by = rep(1, nrow(df)),
  4. proportion = 1,
  5. output = c(&quot;df&quot;, &quot;index&quot;)){
  6. output &lt;- match.arg(output)
  7. # split data.frame into a list using the by argument,
  8. # if by is not used it will just take the whole data.frame
  9. splited &lt;- split(data.frame(df,
  10. .factor = as.factor(factor),
  11. row = 1:nrow(df)), by)
  12. lapply(splited, function(x){
  13. #get the number of instances of the minority class
  14. minClass &lt;- min(table(x$.factor))
  15. #split input data.frame by class
  16. spl &lt;- split(x, x$.factor)
  17. #loop over classes
  18. lapply(spl, function(i){
  19. #data.frame with class i
  20. y &lt;- nrow(i) #number of rows
  21. if(proportion &gt;= 1 &amp;&amp; y &lt;= ceiling(minClass * proportion)){
  22. #if class has less instances then minClass * proportion return it all
  23. return(i)
  24. } else {
  25. #sample from class ceiling(minClass * proportion) times
  26. return(i[sample(seq_along(i[,1]), ceiling(minClass * proportion)),])
  27. }
  28. }) -&gt; out1
  29. out1 &lt;- do.call(rbind, out1)
  30. return(out1)
  31. }) -&gt; out2
  32. out2 &lt;- do.call(rbind, out2)
  33. rownames(out2) &lt;- NULL
  34. if(output == &quot;df&quot;){
  35. return(out2[, !colnames(out2) %in% c(&quot;.factor&quot;, &quot;row&quot;)])
  36. } else if (output == &quot;index&quot;) {
  37. return(sort(out2$row))
  38. }
  39. }

Usage:

down sample all classes:

  1. z2 &lt;- downSample_custm(z, factor = z$gender, by = z$country, proportion = 0.3)
  2. table(z2)
  3. #output
  4. country
  5. gender 1 2 3
  6. 0 789 373 816
  7. 1 789 373 816

down sample classes which are over represented:

  1. z3 &lt;- downSample_custm(z, factor = z$gender, by = z$country, proportion = 1.5)
  2. table(z3)
  3. #output
  4. country
  5. gender 1 2 3
  6. 0 3942 1863 4080
  7. 1 2628 1242 2720

return row index instead of down sampled data.frame:

  1. z4 &lt;- downSample_custm(z, factor = z$gender, by = z$country, proportion = 1, output = &quot;index&quot;)
  2. table(z[z4,])
  3. #output
  4. country
  5. gender 1 2 3
  6. 0 2628 1242 2720
  7. 1 2628 1242 2720

simple down sample (no by group):

  1. z5 &lt;- downSample_custm(z, factor = z$gender, proportion = 1)
  2. table(z5$gender)
  3. #output
  4. 0 1
  5. 6590 6590

huangapple
  • 本文由 发表于 2023年8月8日 23:06:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76860855.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定