英文:
Remove column(s) with overrepresented categorical values
问题
我有一个如下所示的数据集:
data <- data.frame(
Col1 = c("id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8"),
Col2 = c("A", "Bc", "A", "As", "As", "Bs", "A", "A"),
Col3 = c("BK", "AB", "BsC", "BX", "BK", "AsB", "BC", "BX"),
Col4 = c("CA", "XB", "CA", "SC", "CA", "CA", "CA", "SC"),
Col5 = c("Ao", "Bu", "Ai", "Ayy", "Ao", "Byu", "Aiy", "Ay"),
Col6 = c("Bc", "Bc", "Bc", "Bc", "Bc", "Bc", "Be", "Bd")
)
data
如果某个类别过度代表,需要省略该列。例如,如果阈值是 0.74
或 74%
,则筛选后的数据
将删除Col6
,因为类别 Bc
过度代表(6/8=75%
)。筛选后的数据
将如下所示:
Col1 Col2 Col3 Col4 Col5
1 id1 A BK CA Ao
2 id2 Bc AB XB Bu
3 id3 A BsC CA Ai
4 id4 As BX SC Ayy
5 id5 As BK CA Ao
6 id6 Bs AsB CA Byu
7 id7 A BC CA Aiy
8 id8 A BX SC Ay
或者如果阈值是 60%
,筛选后的数据
将删除 Col4
和 Col6
,因为类别 CA
(在 Col4
中)过度代表(5/8=62.5%
),以及类别 Bc
(在 Col6
中)过度代表(6/8=75%
)。筛选后的数据
将如下所示:
Col1 Col2 Col3 Col5
1 id1 A BK Ao
2 id2 Bc AB Bu
3 id3 A BsC Ai
4 id4 As BX Ayy
5 id5 As BK Ao
6 id6 Bs AsB Byu
7 id7 A BC Aiy
8 id8 A BX Ay
英文:
I have a dataset like below:
data <- data.frame(
Col1 = c("id1", "id2", "id3", "id4","id5", "id6", "id7", "id8"),
Col2 = c("A", "Bc", "A", "As", "As", "Bs", "A", "A"),
Col3 = c("BK", "AB", "BsC", "BX", "BK", "AsB", "BC", "BX"),
Col4 = c("CA", "XB", "CA", "SC", "CA", "CA", "CA", "SC"),
Col5 = c("Ao", "Bu", "Ai", "Ayy", "Ao", "Byu", "Aiy", "Ay"),
Col6 = c("Bc", "Bc", "Bc", "Bc", "Bc", "Bc", "Be", "Bd")
)
data
Col1 Col2 Col3 Col4 Col5 Col6
1 id1 A BK CA Ao Bc
2 id2 Bc AB XB Bu Bc
3 id3 A BsC CA Ai Bc
4 id4 As BX SC Ayy Bc
5 id5 As BK CA Ao Bc
6 id6 Bs AsB CA Byu Bc
7 id7 A BC CA Aiy Be
8 id8 A BX SC Ay Bd
If a category is over-represented, the columns need to be omitted. For example, if the threshold is 0.74
or 74%
, the filtered data
will remove Col6
as category Bc
is over-represented (6/8=75%)
. The filtered data
will be like the following:
Col1 Col2 Col3 Col4 Col5
1 id1 A BK CA Ao
2 id2 Bc AB XB Bu
3 id3 A BsC CA Ai
4 id4 As BX SC Ayy
5 id5 As BK CA Ao
6 id6 Bs AsB CA Byu
7 id7 A BC CA Aiy
8 id8 A BX SC Ay
Or if the threshold is 60%
, the filtered data
will remove Col4
and Col6
as category CA
(in Col4
) is over-represented (5/8=62.5%)
and Bc
(in Col6
) is over-represented (6/8=75%)
. The filtered data
will be like the following:
Col1 Col2 Col3 Col5
1 id1 A BK Ao
2 id2 Bc AB Bu
3 id3 A BsC Ai
4 id4 As BX Ayy
5 id5 As BK Ao
6 id6 Bs AsB Byu
7 id7 A BC Aiy
8 id8 A BX Ay
答案1
得分: 2
这是在base
中的一个解决方案:
data[c(TRUE, apply(t(data[-1]), 1, function(x) max(table(x))) / nrow(data) < 0.6)]
#> Col1 Col2 Col3 Col5
#> 1 id1 A BK Ao
#> 2 id2 Bc AB Bu
#> 3 id3 A BsC Ai
#> 4 id4 As BX Ayy
#> 5 id5 As BK Ao
#> 6 id6 Bs AsB Byu
#> 7 id7 A BC Aiy
#> 8 id8 A BX Ay
<details>
<summary>英文:</summary>
Here's a solution in `base`:
data[c(TRUE, apply(t(data[-1]),1,function(x) max(table(x)))/nrow(data) < 0.6)]
#> Col1 Col2 Col3 Col5
#> 1 id1 A BK Ao
#> 2 id2 Bc AB Bu
#> 3 id3 A BsC Ai
#> 4 id4 As BX Ayy
#> 5 id5 As BK Ao
#> 6 id6 Bs AsB Byu
#> 7 id7 A BC Aiy
#> 8 id8 A BX Ay
</details>
# 答案2
**得分**: 2
循环遍历列,获取表格频率,检查是否小于阈值:
```R
x = 0.74
data[ sapply(data, function(i) max(prop.table(table(i)))) < x ]
# Col1 Col2 Col3 Col4 Col5
# 1 id1 A BK CA Ao
# 2 id2 Bc AB XB Bu
# 3 id3 A BsC CA Ai
# 4 id4 As BX SC Ayy
# 5 id5 As BK CA Ao
# 6 id6 Bs AsB CA Byu
# 7 id7 A BC CA Aiy
# 8 id8 A BX SC Ay
英文:
Loop through columns get table frequencies, check weather smaller than threshold:
x = 0.74
data[ sapply(data, function(i) max(prop.table(table(i)))) < x ]
# Col1 Col2 Col3 Col4 Col5
# 1 id1 A BK CA Ao
# 2 id2 Bc AB XB Bu
# 3 id3 A BsC CA Ai
# 4 id4 As BX SC Ayy
# 5 id5 As BK CA Ao
# 6 id6 Bs AsB CA Byu
# 7 id7 A BC CA Aiy
# 8 id8 A BX SC Ay
答案3
得分: 2
这是使用 apply
、any
和 proportions
的解决方案:
thresh <- 0.74
overrepcols <- apply(data, 2, function(x) any(proportions(table(x)) > thresh))
data[,!overrepcols]
输出:
Col1 Col2 Col3 Col4 Col5
1 id1 A BK CA Ao
2 id2 Bc AB XB Bu
3 id3 A BsC CA Ai
4 id4 As BX SC Ayy
5 id5 As BK CA Ao
6 id6 Bs AsB CA Byu
7 id7 A BC CA Aiy
8 id8 A BX SC Ay
英文:
Here is a solution using apply
, any
, and proportions
:
thresh <- 0.74
overrepcols <- apply(data, 2, function(x) any(proportions(table(x)) > thresh))
data[,!overrepcols]
Output:
Col1 Col2 Col3 Col4 Col5
1 id1 A BK CA Ao
2 id2 Bc AB XB Bu
3 id3 A BsC CA Ai
4 id4 As BX SC Ayy
5 id5 As BK CA Ao
6 id6 Bs AsB CA Byu
7 id7 A BC CA Aiy
8 id8 A BX SC Ay
答案4
得分: 2
以下是翻译好的部分:
# 使用 `dplyr` 和 `base` 的另一种答案
data %>%
select_if(~max(table(.x)) / length(.x) < 0.6)
# Col1 Col2 Col3 Col5
# 1 id1 A BK Ao
# 2 id2 Bc AB Bu
# 3 id3 A BsC Ai
# 4 id4 As BX Ayy
# 5 id5 As BK Ao
# 6 id6 Bs AsB Byu
# 7 id7 A BC Aiy
# 8 id8 A BX Ay
请注意,代码部分已被保留不翻译。
英文:
And another answer using dplyr
and base
data %>%
select_if(~max(table(.x)) / length(.x) < 0.6)
# Col1 Col2 Col3 Col5
# 1 id1 A BK Ao
# 2 id2 Bc AB Bu
# 3 id3 A BsC Ai
# 4 id4 As BX Ayy
# 5 id5 As BK Ao
# 6 id6 Bs AsB Byu
# 7 id7 A BC Aiy
# 8 id8 A BX Ay
答案5
得分: 2
我们可以使用当前的 dplyr
语法与 select(where(condition))
:
library(dplyr)
threshold <- 0.74
data %>%
select(where(\(x) !any(proportions(table(x)) > threshold)))
Col1 Col2 Col3 Col4 Col5
1 id1 A BK CA Ao
2 id2 Bc AB XB Bu
3 id3 A BsC CA Ai
4 id4 As BX SC Ayy
5 id5 As BK CA Ao
6 id6 Bs AsB CA Byu
7 id7 A BC CA Aiy
8 id8 A BX SC Ay
英文:
We can use current dplyr
synthax with select(where(condition))
library(dplyr)
threshold <- 0.74
data |>
select(where(\(x) !any(proportions(table(x)) > threshold)))
Col1 Col2 Col3 Col4 Col5
1 id1 A BK CA Ao
2 id2 Bc AB XB Bu
3 id3 A BsC CA Ai
4 id4 As BX SC Ayy
5 id5 As BK CA Ao
6 id6 Bs AsB CA Byu
7 id7 A BC CA Aiy
8 id8 A BX SC Ay
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论