2023年6月29日 03:33:12go评论107阅读模式

英文:

Remove column(s) with overrepresented categorical values

问题

我有一个如下所示的数据集：

data <- data.frame(
  Col1 = c("id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8"),
  Col2 = c("A", "Bc", "A", "As", "As", "Bs", "A", "A"),
  Col3 = c("BK", "AB", "BsC", "BX", "BK", "AsB", "BC", "BX"),
  Col4 = c("CA", "XB", "CA", "SC", "CA", "CA", "CA", "SC"),
  Col5 = c("Ao", "Bu", "Ai", "Ayy", "Ao", "Byu", "Aiy", "Ay"),
  Col6 = c("Bc", "Bc", "Bc", "Bc", "Bc", "Bc", "Be", "Bd")
)
data

如果某个类别过度代表，需要省略该列。例如，如果阈值是 0.74 或 74%，则筛选后的数据将删除Col6，因为类别 Bc 过度代表（6/8=75%）。筛选后的数据将如下所示：

  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay

或者如果阈值是 60%，筛选后的数据将删除 Col4 和 Col6，因为类别 CA（在 Col4 中）过度代表（5/8=62.5%），以及类别 Bc（在 Col6 中）过度代表（6/8=75%）。筛选后的数据将如下所示：

  Col1 Col2 Col3 Col5
1  id1    A   BK   Ao
2  id2   Bc   AB   Bu
3  id3    A  BsC   Ai
4  id4   As   BX  Ayy
5  id5   As   BK   Ao
6  id6   Bs  AsB  Byu
7  id7    A   BC  Aiy
8  id8    A   BX   Ay

英文:

I have a dataset like below:

data &lt;- data.frame(
  Col1 = c(&quot;id1&quot;, &quot;id2&quot;, &quot;id3&quot;, &quot;id4&quot;,&quot;id5&quot;,  &quot;id6&quot;, &quot;id7&quot;, &quot;id8&quot;),
  Col2 = c(&quot;A&quot;, &quot;Bc&quot;, &quot;A&quot;, &quot;As&quot;, &quot;As&quot;, &quot;Bs&quot;, &quot;A&quot;, &quot;A&quot;),
  Col3 = c(&quot;BK&quot;, &quot;AB&quot;, &quot;BsC&quot;, &quot;BX&quot;, &quot;BK&quot;, &quot;AsB&quot;, &quot;BC&quot;, &quot;BX&quot;),
  Col4 = c(&quot;CA&quot;, &quot;XB&quot;, &quot;CA&quot;, &quot;SC&quot;, &quot;CA&quot;, &quot;CA&quot;, &quot;CA&quot;, &quot;SC&quot;),
  Col5 = c(&quot;Ao&quot;, &quot;Bu&quot;, &quot;Ai&quot;, &quot;Ayy&quot;, &quot;Ao&quot;, &quot;Byu&quot;, &quot;Aiy&quot;, &quot;Ay&quot;),
  Col6 = c(&quot;Bc&quot;, &quot;Bc&quot;, &quot;Bc&quot;, &quot;Bc&quot;, &quot;Bc&quot;, &quot;Bc&quot;, &quot;Be&quot;, &quot;Bd&quot;)
)
data
  Col1 Col2 Col3 Col4 Col5 Col6
1  id1    A   BK   CA   Ao   Bc
2  id2   Bc   AB   XB   Bu   Bc
3  id3    A  BsC   CA   Ai   Bc
4  id4   As   BX   SC  Ayy   Bc
5  id5   As   BK   CA   Ao   Bc
6  id6   Bs  AsB   CA  Byu   Bc
7  id7    A   BC   CA  Aiy   Be
8  id8    A   BX   SC   Ay   Bd

If a category is over-represented, the columns need to be omitted. For example, if the threshold is 0.74 or 74%, the filtered data will remove Col6 as category Bc is over-represented (6/8=75%). The filtered data will be like the following:

  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay

Or if the threshold is 60%, the filtered data will remove Col4 and Col6 as category CA (in Col4) is over-represented (5/8=62.5%) and Bc (in Col6) is over-represented (6/8=75%). The filtered data will be like the following:

  Col1 Col2 Col3 Col5
1  id1    A   BK   Ao
2  id2   Bc   AB   Bu
3  id3    A  BsC   Ai
4  id4   As   BX  Ayy
5  id5   As   BK   Ao
6  id6   Bs  AsB  Byu
7  id7    A   BC  Aiy
8  id8    A   BX   Ay

答案1

得分: 2

这是在base中的一个解决方案：

data[c(TRUE, apply(t(data[-1]), 1, function(x) max(table(x))) / nrow(data) < 0.6)]

#> Col1 Col2 Col3 Col5
#> 1 id1 A BK Ao
#> 2 id2 Bc AB Bu
#> 3 id3 A BsC Ai
#> 4 id4 As BX Ayy
#> 5 id5 As BK Ao
#> 6 id6 Bs AsB Byu
#> 7 id7 A BC Aiy
#> 8 id8 A BX Ay


<details>
<summary>英文:</summary>
Here&#39;s a solution in `base`:

data[c(TRUE, apply(t(data[-1]),1,function(x) max(table(x)))/nrow(data) < 0.6)]

#> Col1 Col2 Col3 Col5
#> 1 id1 A BK Ao
#> 2 id2 Bc AB Bu
#> 3 id3 A BsC Ai
#> 4 id4 As BX Ayy
#> 5 id5 As BK Ao
#> 6 id6 Bs AsB Byu
#> 7 id7 A BC Aiy
#> 8 id8 A BX Ay


</details>
# 答案2
**得分**: 2
循环遍历列，获取表格频率，检查是否小于阈值：
```R
x = 0.74
data[ sapply(data, function(i) max(prop.table(table(i)))) < x ]
#   Col1 Col2 Col3 Col4 Col5
# 1  id1    A   BK   CA   Ao
# 2  id2   Bc   AB   XB   Bu
# 3  id3    A  BsC   CA   Ai
# 4  id4   As   BX   SC  Ayy
# 5  id5   As   BK   CA   Ao
# 6  id6   Bs  AsB   CA  Byu
# 7  id7    A   BC   CA  Aiy
# 8  id8    A   BX   SC   Ay

英文:

Loop through columns get table frequencies, check weather smaller than threshold:

x = 0.74
data[ sapply(data, function(i) max(prop.table(table(i)))) &lt; x ]
#   Col1 Col2 Col3 Col4 Col5
# 1  id1    A   BK   CA   Ao
# 2  id2   Bc   AB   XB   Bu
# 3  id3    A  BsC   CA   Ai
# 4  id4   As   BX   SC  Ayy
# 5  id5   As   BK   CA   Ao
# 6  id6   Bs  AsB   CA  Byu
# 7  id7    A   BC   CA  Aiy
# 8  id8    A   BX   SC   Ay

答案3

得分: 2

这是使用 apply、any 和 proportions 的解决方案：

thresh <- 0.74
overrepcols <- apply(data, 2, function(x) any(proportions(table(x)) > thresh))
data[,!overrepcols]

输出：

  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay

英文:

Here is a solution using apply, any, and proportions:

thresh &lt;- 0.74
overrepcols &lt;- apply(data, 2, function(x) any(proportions(table(x)) &gt; thresh))
data[,!overrepcols]

Output:

  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay

答案4

得分: 2

以下是翻译好的部分：

# 使用 `dplyr` 和 `base` 的另一种答案
data %>%
  select_if(~max(table(.x)) / length(.x) < 0.6)
#    Col1 Col2 Col3 Col5
# 1  id1    A   BK   Ao
# 2  id2   Bc   AB   Bu
# 3  id3    A  BsC   Ai
# 4  id4   As   BX  Ayy
# 5  id5   As   BK   Ao
# 6  id6   Bs  AsB  Byu
# 7  id7    A   BC  Aiy
# 8  id8    A   BX   Ay

请注意，代码部分已被保留不翻译。

英文:

And another answer using dplyr and base

data %&gt;%
  select_if(~max(table(.x)) / length(.x) &lt; 0.6)
#    Col1 Col2 Col3 Col5
# 1  id1    A   BK   Ao
# 2  id2   Bc   AB   Bu
# 3  id3    A  BsC   Ai
# 4  id4   As   BX  Ayy
# 5  id5   As   BK   Ao
# 6  id6   Bs  AsB  Byu
# 7  id7    A   BC  Aiy
# 8  id8    A   BX   Ay

答案5

得分: 2

我们可以使用当前的 dplyr 语法与 select(where(condition))：

library(dplyr)
threshold <- 0.74
data %>%
    select(where(\(x) !any(proportions(table(x)) > threshold)))
  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay

英文:

We can use current dplyr synthax with select(where(condition))

library(dplyr)
threshold &lt;- 0.74
data |&gt; 
    select(where(\(x) !any(proportions(table(x)) &gt; threshold)))
  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

移除具有过多出现的分类数值的列。

问题

答案1

答案3

答案4

答案5

geom_raster基于特定的离散值着色

连接 Pandas 行如果时间是连续的。

在一个干净的会话中，逐行验证R脚本从头到尾的成功执行，没有错误。

如何在kableExtra中为整数打印小数点

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论