移除具有过多出现的分类数值的列。

huangapple go评论77阅读模式
英文:

Remove column(s) with overrepresented categorical values

问题

我有一个如下所示的数据集:

data <- data.frame(
  Col1 = c("id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8"),
  Col2 = c("A", "Bc", "A", "As", "As", "Bs", "A", "A"),
  Col3 = c("BK", "AB", "BsC", "BX", "BK", "AsB", "BC", "BX"),
  Col4 = c("CA", "XB", "CA", "SC", "CA", "CA", "CA", "SC"),
  Col5 = c("Ao", "Bu", "Ai", "Ayy", "Ao", "Byu", "Aiy", "Ay"),
  Col6 = c("Bc", "Bc", "Bc", "Bc", "Bc", "Bc", "Be", "Bd")
)

data

如果某个类别过度代表,需要省略该列。例如,如果阈值是 0.7474%,则筛选后的数据将删除Col6,因为类别 Bc 过度代表(6/8=75%)。筛选后的数据将如下所示:

  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay

或者如果阈值是 60%筛选后的数据将删除 Col4Col6,因为类别 CA(在 Col4 中)过度代表(5/8=62.5%),以及类别 Bc(在 Col6 中)过度代表(6/8=75%)。筛选后的数据将如下所示:

  Col1 Col2 Col3 Col5
1  id1    A   BK   Ao
2  id2   Bc   AB   Bu
3  id3    A  BsC   Ai
4  id4   As   BX  Ayy
5  id5   As   BK   Ao
6  id6   Bs  AsB  Byu
7  id7    A   BC  Aiy
8  id8    A   BX   Ay
英文:

I have a dataset like below:

data &lt;- data.frame(
  Col1 = c(&quot;id1&quot;, &quot;id2&quot;, &quot;id3&quot;, &quot;id4&quot;,&quot;id5&quot;,  &quot;id6&quot;, &quot;id7&quot;, &quot;id8&quot;),
  Col2 = c(&quot;A&quot;, &quot;Bc&quot;, &quot;A&quot;, &quot;As&quot;, &quot;As&quot;, &quot;Bs&quot;, &quot;A&quot;, &quot;A&quot;),
  Col3 = c(&quot;BK&quot;, &quot;AB&quot;, &quot;BsC&quot;, &quot;BX&quot;, &quot;BK&quot;, &quot;AsB&quot;, &quot;BC&quot;, &quot;BX&quot;),
  Col4 = c(&quot;CA&quot;, &quot;XB&quot;, &quot;CA&quot;, &quot;SC&quot;, &quot;CA&quot;, &quot;CA&quot;, &quot;CA&quot;, &quot;SC&quot;),
  Col5 = c(&quot;Ao&quot;, &quot;Bu&quot;, &quot;Ai&quot;, &quot;Ayy&quot;, &quot;Ao&quot;, &quot;Byu&quot;, &quot;Aiy&quot;, &quot;Ay&quot;),
  Col6 = c(&quot;Bc&quot;, &quot;Bc&quot;, &quot;Bc&quot;, &quot;Bc&quot;, &quot;Bc&quot;, &quot;Bc&quot;, &quot;Be&quot;, &quot;Bd&quot;)
)

data

  Col1 Col2 Col3 Col4 Col5 Col6
1  id1    A   BK   CA   Ao   Bc
2  id2   Bc   AB   XB   Bu   Bc
3  id3    A  BsC   CA   Ai   Bc
4  id4   As   BX   SC  Ayy   Bc
5  id5   As   BK   CA   Ao   Bc
6  id6   Bs  AsB   CA  Byu   Bc
7  id7    A   BC   CA  Aiy   Be
8  id8    A   BX   SC   Ay   Bd

If a category is over-represented, the columns need to be omitted. For example, if the threshold is 0.74 or 74%, the filtered data will remove Col6 as category Bc is over-represented (6/8=75%). The filtered data will be like the following:

  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay

Or if the threshold is 60%, the filtered data will remove Col4 and Col6 as category CA (in Col4) is over-represented (5/8=62.5%) and Bc (in Col6) is over-represented (6/8=75%). The filtered data will be like the following:

  Col1 Col2 Col3 Col5
1  id1    A   BK   Ao
2  id2   Bc   AB   Bu
3  id3    A  BsC   Ai
4  id4   As   BX  Ayy
5  id5   As   BK   Ao
6  id6   Bs  AsB  Byu
7  id7    A   BC  Aiy
8  id8    A   BX   Ay

答案1

得分: 2

这是在base中的一个解决方案:

data[c(TRUE, apply(t(data[-1]), 1, function(x) max(table(x))) / nrow(data) < 0.6)]

#> Col1 Col2 Col3 Col5
#> 1 id1 A BK Ao
#> 2 id2 Bc AB Bu
#> 3 id3 A BsC Ai
#> 4 id4 As BX Ayy
#> 5 id5 As BK Ao
#> 6 id6 Bs AsB Byu
#> 7 id7 A BC Aiy
#> 8 id8 A BX Ay


<details>
<summary>英文:</summary>

Here&#39;s a solution in `base`:

data[c(TRUE, apply(t(data[-1]),1,function(x) max(table(x)))/nrow(data) < 0.6)]

#> Col1 Col2 Col3 Col5
#> 1 id1 A BK Ao
#> 2 id2 Bc AB Bu
#> 3 id3 A BsC Ai
#> 4 id4 As BX Ayy
#> 5 id5 As BK Ao
#> 6 id6 Bs AsB Byu
#> 7 id7 A BC Aiy
#> 8 id8 A BX Ay


</details>



# 答案2
**得分**: 2

循环遍历列,获取表格频率,检查是否小于阈值:

```R
x = 0.74
data[ sapply(data, function(i) max(prop.table(table(i)))) < x ]
#   Col1 Col2 Col3 Col4 Col5
# 1  id1    A   BK   CA   Ao
# 2  id2   Bc   AB   XB   Bu
# 3  id3    A  BsC   CA   Ai
# 4  id4   As   BX   SC  Ayy
# 5  id5   As   BK   CA   Ao
# 6  id6   Bs  AsB   CA  Byu
# 7  id7    A   BC   CA  Aiy
# 8  id8    A   BX   SC   Ay
英文:

Loop through columns get table frequencies, check weather smaller than threshold:

x = 0.74
data[ sapply(data, function(i) max(prop.table(table(i)))) &lt; x ]
#   Col1 Col2 Col3 Col4 Col5
# 1  id1    A   BK   CA   Ao
# 2  id2   Bc   AB   XB   Bu
# 3  id3    A  BsC   CA   Ai
# 4  id4   As   BX   SC  Ayy
# 5  id5   As   BK   CA   Ao
# 6  id6   Bs  AsB   CA  Byu
# 7  id7    A   BC   CA  Aiy
# 8  id8    A   BX   SC   Ay

答案3

得分: 2

这是使用 applyanyproportions 的解决方案:

thresh <- 0.74

overrepcols <- apply(data, 2, function(x) any(proportions(table(x)) > thresh))

data[,!overrepcols]

输出:

  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay
英文:

Here is a solution using apply, any, and proportions:

thresh &lt;- 0.74

overrepcols &lt;- apply(data, 2, function(x) any(proportions(table(x)) &gt; thresh))

data[,!overrepcols]

Output:

  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay

答案4

得分: 2

以下是翻译好的部分:

# 使用 `dplyr` 和 `base` 的另一种答案
data %>%
  select_if(~max(table(.x)) / length(.x) < 0.6)

#    Col1 Col2 Col3 Col5
# 1  id1    A   BK   Ao
# 2  id2   Bc   AB   Bu
# 3  id3    A  BsC   Ai
# 4  id4   As   BX  Ayy
# 5  id5   As   BK   Ao
# 6  id6   Bs  AsB  Byu
# 7  id7    A   BC  Aiy
# 8  id8    A   BX   Ay

请注意,代码部分已被保留不翻译。

英文:

And another answer using dplyr and base

data %&gt;%
  select_if(~max(table(.x)) / length(.x) &lt; 0.6)

#    Col1 Col2 Col3 Col5
# 1  id1    A   BK   Ao
# 2  id2   Bc   AB   Bu
# 3  id3    A  BsC   Ai
# 4  id4   As   BX  Ayy
# 5  id5   As   BK   Ao
# 6  id6   Bs  AsB  Byu
# 7  id7    A   BC  Aiy
# 8  id8    A   BX   Ay

答案5

得分: 2

我们可以使用当前的 dplyr 语法与 select(where(condition))

library(dplyr)

threshold <- 0.74
data %>%
    select(where(\(x) !any(proportions(table(x)) > threshold)))

  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay
英文:

We can use current dplyr synthax with select(where(condition))

library(dplyr)

threshold &lt;- 0.74
data |&gt; 
    select(where(\(x) !any(proportions(table(x)) &gt; threshold)))

  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay

</details>



huangapple
  • 本文由 发表于 2023年6月29日 03:33:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76576203.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定