英文:
How to automatically create columns to identify outliers for each numeric variable
问题
Sure, here's the translated code portion:
我想要为每个变量自动创建异常值的列。用于识别每个变量的异常值的列必须紧邻该变量。变量的值必须是“是”或“否”。
是否可能自动化这个过程?
ID <- 1:10
Weight <- c(65.1, 70.3, 22, 45, 150, 68.5, 87.2, 66.4, 59.2, 72.3)
Sex <- c("M", "F", "F", "F", "M", "F", "M", "M", "F", "F")
Height <- c(1.3, 1.65, 1.75, 1.86, 1.79, 1.76, 1.1, 2.65, 1.75, 1.65)
City <- head(LETTERS, 10)
Income <- c(1200, 2000, 2100, 2550, 12000, 800, 3000, 2400, 1895, 2300)
mydata2 <- data.frame(ID, Weight, Sex, Height, City, Income)
我使用Outlier {DescTools}函数来识别异常值
Outlier(mydata2$Weight)
[1] 22 150
Outlier(mydata2$Height)
[1] 1.30 1.10 2.65
Outlier(mydata2$Income)
[1] 12000 800
这是预期的数据集:
`Weight_outlier` 在 `Weight` 之后,`Height_outlier` 在 `Height` 之后,依此类推。
在我的实际数据集中有十几个数值变量。
ID Weight Weight_outlier Sex Height Height_outlier City Income Income_outlier
1 1 65.1 否 M 1.30 是 A 1200 否
2 2 70.3 否 F 1.65 否 B 2000 否
3 3 22.0 是 F 1.75 否 C 2100 否
4 4 45.0 否 F 1.86 否 D 2550 否
5 5 150.0 是 M 1.79 否 E 12000 是
6 6 68.5 否 F 1.76 否 F 800 是
7 7 87.2 否 M 1.10 是 G 3000 否
8 8 66.4 否 M 2.65 是 H 2400 否
9 9 59.2 否 F 1.75 否 I 1895 否
10 10 72.3 否 F 1.65 否 J 2300 否
I've provided the translated code part without any additional content.
英文:
I want to automatically create columns for the outliers for each variable. The column to identify the outliers of each variable must be contiguous to the variable concerned. The value of the variable must be either yes or no.
Is it possible to automate this?
ID<-1:10
Weight<-c(65.1,70.3, 22, 45, 150,68.5,87.2,66.4,59.2,72.3)
Sex<-c("M","F","F","F","M","F","M","M","F","F")
Height<-c(1.3, 1.65, 1.75, 1.86, 1.79, 1.76, 1.1, 2.65, 1.75,
1.65)
City= head(LETTERS,10)
Income<- c(1200,2000,2100,2550,12000,800,3000,2400,1895,2300)
mydata2<-data.frame(ID,Weight,Sex,Height,City,Income)
I use the function Outlier {DescTools} to identify the outliers
Outlier(mydata2$Weight)
[1] 22 150
Outlier(mydata2$Height)
[1] 1.30 1.10 2.65
Outlier(mydata2$Income)
[1] 12000 800
This the expected dataset:
Weight_outlier
come just after Weight
, Height_outlier
after Height
and so on.
I have dozen of numeric variables in my real dataset
ID Weight Weight_outlier Sex Height Height_outlier City Income Income_outlier
1 1 65.1 no M 1.30 yes A 1200 no
2 2 70.3 no F 1.65 no B 2000 no
3 3 22.0 yes F 1.75 no C 2100 no
4 4 45.0 no F 1.86 no D 2550 no
5 5 150.0 yes M 1.79 no E 12000 yes
6 6 68.5 no F 1.76 no F 800 yes
7 7 87.2 no M 1.10 yes G 3000 no
8 8 66.4 no M 2.65 yes H 2400 no
9 9 59.2 no F 1.75 no I 1895 no
10 10 72.3 no F 1.65 no J 2300 no
答案1
得分: 1
mutate(across... 将识别与异常值匹配的值,然后relocate将其放置在所需顺序中,与purrr一起使用
library(tidyverse)
# 获取要测试的列
outlier_vars <- mydata2 %>% dplyr::select(Weight, Height, Income) %>% names
mydata2 %>%
# 识别每列中与异常值匹配的值,并重命名
mutate(across({outlier_vars}, ~.x %in% DescTools::Outlier(.x), .names = "{.col}_outlier")) %>%
# 使用reduce和relocate
reduce2(
.x = outlier_vars,
.y = paste0(outlier_vars, "_outlier"),
.f = ~ relocate(..1, ..3, .after = ..2),
.init = .
)
英文:
mutate(across... will identify values that match outliers, then relocate will place in desired order, used with reduce from purrr
library(tidyverse)
# Get the columns you wish to test
outlier_vars <- mydata2 %>% dplyr::select(Weight, Height, Income) %>% names
mydata2 %>%
# Identify values that match the outliers in each column, renaming
mutate(across({outlier_vars}, ~.x %in% DescTools::Outlier(.x), .names = "{.col}_outlier")) %>%
# Use reduce with relocate
reduce2(
.x = outlier_vars,
.y = paste0(outlier_vars, "_outlier"),
.f = ~ relocate(..1, ..3, .after = ..2),
.init = .
)
答案2
得分: 0
以下是您要翻译的内容:
可以使用for
循环逐列构建数据框的副本,如果原始列是数字,则插入 _outlier
列。
library(DescTools)
mydata3 <- mydata2[, "ID", drop = FALSE]
for (cname in names(mydata2[, -1])) {
mydata3[[cname]] <- mydata2[[cname]]
if (is.numeric(mydata2[[cname]])) {
outliers <- rep("no", nrow(mydata2))
outliers[Outlier(mydata2[[cname]], value = FALSE)] <- "yes"
outliers[is.na(mydata2[[cname]])] <- NA
mydata3[[paste0(cname, "_outlier")]] <- outliers
}
}
mydata3
ID Weight Weight_outlier Sex Height Height_outlier City Income
1 1 65.1 no M 1.30 yes A 1200
2 2 70.3 no F 1.65 no B 2000
3 3 22.0 yes F 1.75 no C 2100
4 4 45.0 no F 1.86 no D 2550
5 5 150.0 yes M 1.79 no E 12000
6 6 68.5 no F 1.76 no F 800
7 7 87.2 no M 1.10 yes G 3000
8 8 66.4 no M 2.65 yes H 2400
9 9 59.2 no F 1.75 no I 1895
10 10 72.3 no F 1.65 no J 2300
Income_outlier
1 no
2 no
3 no
4 no
5 yes
6 yes
7 no
8 no
9 no
10 no
英文:
You can use a for
loop to build a copy of your dataframe column by column, inserting _outlier
columns if the original column is numeric.
library(DescTools)
mydata3 <- mydata2[, "ID", drop = FALSE]
for (cname in names(mydata2[, -1])) {
mydata3[[cname]] <- mydata2[[cname]]
if (is.numeric(mydata2[[cname]])) {
outliers <- rep("no", nrow(mydata2))
outliers[Outlier(mydata2[[cname]], value = FALSE)] <- "yes"
outliers[is.na(mydata2[[cname]])] <- NA
mydata3[[paste0(cname, "_outlier")]] <- outliers
}
}
mydata3
ID Weight Weight_outlier Sex Height Height_outlier City Income
1 1 65.1 no M 1.30 yes A 1200
2 2 70.3 no F 1.65 no B 2000
3 3 22.0 yes F 1.75 no C 2100
4 4 45.0 no F 1.86 no D 2550
5 5 150.0 yes M 1.79 no E 12000
6 6 68.5 no F 1.76 no F 800
7 7 87.2 no M 1.10 yes G 3000
8 8 66.4 no M 2.65 yes H 2400
9 9 59.2 no F 1.75 no I 1895
10 10 72.3 no F 1.65 no J 2300
Income_outlier
1 no
2 no
3 no
4 no
5 yes
6 yes
7 no
8 no
9 no
10 no
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论