移除具有过多出现的分类数值的列。

huangapple go评论95阅读模式
英文:

Remove column(s) with overrepresented categorical values

问题

我有一个如下所示的数据集:

  1. data <- data.frame(
  2. Col1 = c("id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8"),
  3. Col2 = c("A", "Bc", "A", "As", "As", "Bs", "A", "A"),
  4. Col3 = c("BK", "AB", "BsC", "BX", "BK", "AsB", "BC", "BX"),
  5. Col4 = c("CA", "XB", "CA", "SC", "CA", "CA", "CA", "SC"),
  6. Col5 = c("Ao", "Bu", "Ai", "Ayy", "Ao", "Byu", "Aiy", "Ay"),
  7. Col6 = c("Bc", "Bc", "Bc", "Bc", "Bc", "Bc", "Be", "Bd")
  8. )
  9. data

如果某个类别过度代表,需要省略该列。例如,如果阈值是 0.7474%,则筛选后的数据将删除Col6,因为类别 Bc 过度代表(6/8=75%)。筛选后的数据将如下所示:

  1. Col1 Col2 Col3 Col4 Col5
  2. 1 id1 A BK CA Ao
  3. 2 id2 Bc AB XB Bu
  4. 3 id3 A BsC CA Ai
  5. 4 id4 As BX SC Ayy
  6. 5 id5 As BK CA Ao
  7. 6 id6 Bs AsB CA Byu
  8. 7 id7 A BC CA Aiy
  9. 8 id8 A BX SC Ay

或者如果阈值是 60%筛选后的数据将删除 Col4Col6,因为类别 CA(在 Col4 中)过度代表(5/8=62.5%),以及类别 Bc(在 Col6 中)过度代表(6/8=75%)。筛选后的数据将如下所示:

  1. Col1 Col2 Col3 Col5
  2. 1 id1 A BK Ao
  3. 2 id2 Bc AB Bu
  4. 3 id3 A BsC Ai
  5. 4 id4 As BX Ayy
  6. 5 id5 As BK Ao
  7. 6 id6 Bs AsB Byu
  8. 7 id7 A BC Aiy
  9. 8 id8 A BX Ay
英文:

I have a dataset like below:

  1. data &lt;- data.frame(
  2. Col1 = c(&quot;id1&quot;, &quot;id2&quot;, &quot;id3&quot;, &quot;id4&quot;,&quot;id5&quot;, &quot;id6&quot;, &quot;id7&quot;, &quot;id8&quot;),
  3. Col2 = c(&quot;A&quot;, &quot;Bc&quot;, &quot;A&quot;, &quot;As&quot;, &quot;As&quot;, &quot;Bs&quot;, &quot;A&quot;, &quot;A&quot;),
  4. Col3 = c(&quot;BK&quot;, &quot;AB&quot;, &quot;BsC&quot;, &quot;BX&quot;, &quot;BK&quot;, &quot;AsB&quot;, &quot;BC&quot;, &quot;BX&quot;),
  5. Col4 = c(&quot;CA&quot;, &quot;XB&quot;, &quot;CA&quot;, &quot;SC&quot;, &quot;CA&quot;, &quot;CA&quot;, &quot;CA&quot;, &quot;SC&quot;),
  6. Col5 = c(&quot;Ao&quot;, &quot;Bu&quot;, &quot;Ai&quot;, &quot;Ayy&quot;, &quot;Ao&quot;, &quot;Byu&quot;, &quot;Aiy&quot;, &quot;Ay&quot;),
  7. Col6 = c(&quot;Bc&quot;, &quot;Bc&quot;, &quot;Bc&quot;, &quot;Bc&quot;, &quot;Bc&quot;, &quot;Bc&quot;, &quot;Be&quot;, &quot;Bd&quot;)
  8. )
  9. data
  10. Col1 Col2 Col3 Col4 Col5 Col6
  11. 1 id1 A BK CA Ao Bc
  12. 2 id2 Bc AB XB Bu Bc
  13. 3 id3 A BsC CA Ai Bc
  14. 4 id4 As BX SC Ayy Bc
  15. 5 id5 As BK CA Ao Bc
  16. 6 id6 Bs AsB CA Byu Bc
  17. 7 id7 A BC CA Aiy Be
  18. 8 id8 A BX SC Ay Bd

If a category is over-represented, the columns need to be omitted. For example, if the threshold is 0.74 or 74%, the filtered data will remove Col6 as category Bc is over-represented (6/8=75%). The filtered data will be like the following:

  1. Col1 Col2 Col3 Col4 Col5
  2. 1 id1 A BK CA Ao
  3. 2 id2 Bc AB XB Bu
  4. 3 id3 A BsC CA Ai
  5. 4 id4 As BX SC Ayy
  6. 5 id5 As BK CA Ao
  7. 6 id6 Bs AsB CA Byu
  8. 7 id7 A BC CA Aiy
  9. 8 id8 A BX SC Ay

Or if the threshold is 60%, the filtered data will remove Col4 and Col6 as category CA (in Col4) is over-represented (5/8=62.5%) and Bc (in Col6) is over-represented (6/8=75%). The filtered data will be like the following:

  1. Col1 Col2 Col3 Col5
  2. 1 id1 A BK Ao
  3. 2 id2 Bc AB Bu
  4. 3 id3 A BsC Ai
  5. 4 id4 As BX Ayy
  6. 5 id5 As BK Ao
  7. 6 id6 Bs AsB Byu
  8. 7 id7 A BC Aiy
  9. 8 id8 A BX Ay

答案1

得分: 2

这是在base中的一个解决方案:

  1. data[c(TRUE, apply(t(data[-1]), 1, function(x) max(table(x))) / nrow(data) < 0.6)]

#> Col1 Col2 Col3 Col5
#> 1 id1 A BK Ao
#> 2 id2 Bc AB Bu
#> 3 id3 A BsC Ai
#> 4 id4 As BX Ayy
#> 5 id5 As BK Ao
#> 6 id6 Bs AsB Byu
#> 7 id7 A BC Aiy
#> 8 id8 A BX Ay

  1. <details>
  2. <summary>英文:</summary>
  3. Here&#39;s a solution in `base`:

data[c(TRUE, apply(t(data[-1]),1,function(x) max(table(x)))/nrow(data) < 0.6)]

#> Col1 Col2 Col3 Col5
#> 1 id1 A BK Ao
#> 2 id2 Bc AB Bu
#> 3 id3 A BsC Ai
#> 4 id4 As BX Ayy
#> 5 id5 As BK Ao
#> 6 id6 Bs AsB Byu
#> 7 id7 A BC Aiy
#> 8 id8 A BX Ay

  1. </details>
  2. # 答案2
  3. **得分**: 2
  4. 循环遍历列,获取表格频率,检查是否小于阈值:
  5. ```R
  6. x = 0.74
  7. data[ sapply(data, function(i) max(prop.table(table(i)))) < x ]
  8. # Col1 Col2 Col3 Col4 Col5
  9. # 1 id1 A BK CA Ao
  10. # 2 id2 Bc AB XB Bu
  11. # 3 id3 A BsC CA Ai
  12. # 4 id4 As BX SC Ayy
  13. # 5 id5 As BK CA Ao
  14. # 6 id6 Bs AsB CA Byu
  15. # 7 id7 A BC CA Aiy
  16. # 8 id8 A BX SC Ay
英文:

Loop through columns get table frequencies, check weather smaller than threshold:

  1. x = 0.74
  2. data[ sapply(data, function(i) max(prop.table(table(i)))) &lt; x ]
  3. # Col1 Col2 Col3 Col4 Col5
  4. # 1 id1 A BK CA Ao
  5. # 2 id2 Bc AB XB Bu
  6. # 3 id3 A BsC CA Ai
  7. # 4 id4 As BX SC Ayy
  8. # 5 id5 As BK CA Ao
  9. # 6 id6 Bs AsB CA Byu
  10. # 7 id7 A BC CA Aiy
  11. # 8 id8 A BX SC Ay

答案3

得分: 2

这是使用 applyanyproportions 的解决方案:

  1. thresh <- 0.74
  2. overrepcols <- apply(data, 2, function(x) any(proportions(table(x)) > thresh))
  3. data[,!overrepcols]

输出:

  1. Col1 Col2 Col3 Col4 Col5
  2. 1 id1 A BK CA Ao
  3. 2 id2 Bc AB XB Bu
  4. 3 id3 A BsC CA Ai
  5. 4 id4 As BX SC Ayy
  6. 5 id5 As BK CA Ao
  7. 6 id6 Bs AsB CA Byu
  8. 7 id7 A BC CA Aiy
  9. 8 id8 A BX SC Ay
英文:

Here is a solution using apply, any, and proportions:

  1. thresh &lt;- 0.74
  2. overrepcols &lt;- apply(data, 2, function(x) any(proportions(table(x)) &gt; thresh))
  3. data[,!overrepcols]

Output:

  1. Col1 Col2 Col3 Col4 Col5
  2. 1 id1 A BK CA Ao
  3. 2 id2 Bc AB XB Bu
  4. 3 id3 A BsC CA Ai
  5. 4 id4 As BX SC Ayy
  6. 5 id5 As BK CA Ao
  7. 6 id6 Bs AsB CA Byu
  8. 7 id7 A BC CA Aiy
  9. 8 id8 A BX SC Ay

答案4

得分: 2

以下是翻译好的部分:

  1. # 使用 `dplyr` 和 `base` 的另一种答案
  2. data %>%
  3. select_if(~max(table(.x)) / length(.x) < 0.6)
  4. # Col1 Col2 Col3 Col5
  5. # 1 id1 A BK Ao
  6. # 2 id2 Bc AB Bu
  7. # 3 id3 A BsC Ai
  8. # 4 id4 As BX Ayy
  9. # 5 id5 As BK Ao
  10. # 6 id6 Bs AsB Byu
  11. # 7 id7 A BC Aiy
  12. # 8 id8 A BX Ay

请注意,代码部分已被保留不翻译。

英文:

And another answer using dplyr and base

  1. data %&gt;%
  2. select_if(~max(table(.x)) / length(.x) &lt; 0.6)
  3. # Col1 Col2 Col3 Col5
  4. # 1 id1 A BK Ao
  5. # 2 id2 Bc AB Bu
  6. # 3 id3 A BsC Ai
  7. # 4 id4 As BX Ayy
  8. # 5 id5 As BK Ao
  9. # 6 id6 Bs AsB Byu
  10. # 7 id7 A BC Aiy
  11. # 8 id8 A BX Ay

答案5

得分: 2

我们可以使用当前的 dplyr 语法与 select(where(condition))

  1. library(dplyr)
  2. threshold <- 0.74
  3. data %>%
  4. select(where(\(x) !any(proportions(table(x)) > threshold)))
  5. Col1 Col2 Col3 Col4 Col5
  6. 1 id1 A BK CA Ao
  7. 2 id2 Bc AB XB Bu
  8. 3 id3 A BsC CA Ai
  9. 4 id4 As BX SC Ayy
  10. 5 id5 As BK CA Ao
  11. 6 id6 Bs AsB CA Byu
  12. 7 id7 A BC CA Aiy
  13. 8 id8 A BX SC Ay
英文:

We can use current dplyr synthax with select(where(condition))

  1. library(dplyr)
  2. threshold &lt;- 0.74
  3. data |&gt;
  4. select(where(\(x) !any(proportions(table(x)) &gt; threshold)))
  5. Col1 Col2 Col3 Col4 Col5
  6. 1 id1 A BK CA Ao
  7. 2 id2 Bc AB XB Bu
  8. 3 id3 A BsC CA Ai
  9. 4 id4 As BX SC Ayy
  10. 5 id5 As BK CA Ao
  11. 6 id6 Bs AsB CA Byu
  12. 7 id7 A BC CA Aiy
  13. 8 id8 A BX SC Ay
  14. </details>

huangapple
  • 本文由 发表于 2023年6月29日 03:33:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76576203.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定