如何自动创建列来识别每个数值变量的异常值。

huangapple go评论70阅读模式
英文:

How to automatically create columns to identify outliers for each numeric variable

问题

Sure, here's the translated code portion:

我想要为每个变量自动创建异常值的列。用于识别每个变量的异常值的列必须紧邻该变量。变量的值必须是“是”或“否”。
是否可能自动化这个过程?

ID <- 1:10
Weight <- c(65.1, 70.3, 22, 45, 150, 68.5, 87.2, 66.4, 59.2, 72.3)
Sex <- c("M", "F", "F", "F", "M", "F", "M", "M", "F", "F")
Height <- c(1.3, 1.65, 1.75, 1.86, 1.79, 1.76, 1.1, 2.65, 1.75, 1.65)
City <- head(LETTERS, 10)
Income <- c(1200, 2000, 2100, 2550, 12000, 800, 3000, 2400, 1895, 2300)

mydata2 <- data.frame(ID, Weight, Sex, Height, City, Income)

我使用Outlier {DescTools}函数来识别异常值

Outlier(mydata2$Weight)
[1]  22 150

Outlier(mydata2$Height)
[1] 1.30 1.10 2.65

Outlier(mydata2$Income)
[1] 12000   800

这是预期的数据集:

`Weight_outlier` 在 `Weight` 之后,`Height_outlier` 在 `Height` 之后,依此类推。

在我的实际数据集中有十几个数值变量。

  ID Weight Weight_outlier Sex Height Height_outlier City Income Income_outlier
1  1   65.1             否   M   1.30            是    A   12002  2   70.3             否   F   1.65            否    B   20003  3   22.0            是   F   1.75            否    C   21004  4   45.0             否   F   1.86            否    D   25505  5  150.0            是   M   1.79            否    E  120006  6   68.5             否   F   1.76            否    F    8007  7   87.2             否   M   1.10            是    G   30008  8   66.4             否   M   2.65            是    H   24009  9   59.2             否   F   1.75            否    I   189510 10   72.3             否   F   1.65            否    J   2300

I've provided the translated code part without any additional content.

英文:

I want to automatically create columns for the outliers for each variable. The column to identify the outliers of each variable must be contiguous to the variable concerned. The value of the variable must be either yes or no.
Is it possible to automate this?

ID&lt;-1:10
    Weight&lt;-c(65.1,70.3, 22, 45, 150,68.5,87.2,66.4,59.2,72.3)
    Sex&lt;-c(&quot;M&quot;,&quot;F&quot;,&quot;F&quot;,&quot;F&quot;,&quot;M&quot;,&quot;F&quot;,&quot;M&quot;,&quot;M&quot;,&quot;F&quot;,&quot;F&quot;)
    Height&lt;-c(1.3, 1.65, 1.75, 1.86, 1.79, 1.76, 1.1, 2.65, 1.75,
              1.65)
    City= head(LETTERS,10)
    
    Income&lt;- c(1200,2000,2100,2550,12000,800,3000,2400,1895,2300)
    
    mydata2&lt;-data.frame(ID,Weight,Sex,Height,City,Income)

I use the function Outlier {DescTools} to identify the outliers

    Outlier(mydata2$Weight)
[1]  22 150

    Outlier(mydata2$Height)
[1] 1.30 1.10 2.65

    Outlier(mydata2$Income)
[1] 12000   800

This the expected dataset:

Weight_outlier come just after Weight, Height_outlier after Height and so on.

I have dozen of numeric variables in my real dataset

   ID Weight Weight_outlier Sex Height Height_outlier City Income Income_outlier
1   1   65.1             no   M   1.30            yes    A   1200             no
2   2   70.3             no   F   1.65             no    B   2000             no
3   3   22.0            yes   F   1.75             no    C   2100             no
4   4   45.0             no   F   1.86             no    D   2550             no
5   5  150.0            yes   M   1.79             no    E  12000            yes
6   6   68.5             no   F   1.76             no    F    800            yes
7   7   87.2             no   M   1.10            yes    G   3000             no
8   8   66.4             no   M   2.65            yes    H   2400             no
9   9   59.2             no   F   1.75             no    I   1895             no
10 10   72.3             no   F   1.65             no    J   2300             no

答案1

得分: 1

mutate(across... 将识别与异常值匹配的值,然后relocate将其放置在所需顺序中,与purrr一起使用
library(tidyverse)
# 获取要测试的列
outlier_vars <- mydata2 %>% dplyr::select(Weight, Height, Income) %>% names
mydata2 %>% 
# 识别每列中与异常值匹配的值,并重命名
  mutate(across({outlier_vars}, ~.x %in% DescTools::Outlier(.x), .names = "{.col}_outlier")) %>% 
# 使用reduce和relocate
reduce2(
  .x = outlier_vars,
  .y = paste0(outlier_vars, "_outlier"),
  .f = ~ relocate(..1, ..3, .after = ..2),
  .init = .
)
英文:

mutate(across... will identify values that match outliers, then relocate will place in desired order, used with reduce from purrr

library(tidyverse)
# Get the columns you wish to test
outlier_vars &lt;- mydata2 %&gt;% dplyr::select(Weight, Height, Income) %&gt;% names
mydata2 %&gt;% 
# Identify values that match the outliers in each column, renaming
  mutate(across({outlier_vars}, ~.x %in% DescTools::Outlier(.x), .names = &quot;{.col}_outlier&quot;)) %&gt;% 
# Use reduce with relocate
reduce2(
  .x = outlier_vars,
  .y = paste0(outlier_vars, &quot;_outlier&quot;),
  .f = ~ relocate(..1, ..3, .after = ..2),
  .init = .
)

答案2

得分: 0

以下是您要翻译的内容:

可以使用for循环逐列构建数据框的副本,如果原始列是数字,则插入 _outlier 列。

library(DescTools)

mydata3 <- mydata2[, "ID", drop = FALSE]

for (cname in names(mydata2[, -1])) {
  mydata3[[cname]] <- mydata2[[cname]]
  if (is.numeric(mydata2[[cname]])) {
    outliers <- rep("no", nrow(mydata2))
    outliers[Outlier(mydata2[[cname]], value = FALSE)] <- "yes"
    outliers[is.na(mydata2[[cname]])] <- NA
    mydata3[[paste0(cname, "_outlier")]] <- outliers
  }
}

mydata3
   ID Weight Weight_outlier Sex Height Height_outlier City Income
1   1   65.1             no   M   1.30            yes    A   1200
2   2   70.3             no   F   1.65             no    B   2000
3   3   22.0            yes   F   1.75             no    C   2100
4   4   45.0             no   F   1.86             no    D   2550
5   5  150.0            yes   M   1.79             no    E  12000
6   6   68.5             no   F   1.76             no    F    800
7   7   87.2             no   M   1.10            yes    G   3000
8   8   66.4             no   M   2.65            yes    H   2400
9   9   59.2             no   F   1.75             no    I   1895
10 10   72.3             no   F   1.65             no    J   2300
   Income_outlier
1              no
2              no
3              no
4              no
5             yes
6             yes
7              no
8              no
9              no
10             no
英文:

You can use a for loop to build a copy of your dataframe column by column, inserting _outlier columns if the original column is numeric.

library(DescTools)

mydata3 &lt;- mydata2[, &quot;ID&quot;, drop = FALSE]

for (cname in names(mydata2[, -1])) {
  mydata3[[cname]] &lt;- mydata2[[cname]]
  if (is.numeric(mydata2[[cname]])) {
    outliers &lt;- rep(&quot;no&quot;, nrow(mydata2))
    outliers[Outlier(mydata2[[cname]], value = FALSE)] &lt;- &quot;yes&quot;
    outliers[is.na(mydata2[[cname]])] &lt;- NA
    mydata3[[paste0(cname, &quot;_outlier&quot;)]] &lt;- outliers
  }
}

mydata3

   ID Weight Weight_outlier Sex Height Height_outlier City Income
1   1   65.1             no   M   1.30            yes    A   1200
2   2   70.3             no   F   1.65             no    B   2000
3   3   22.0            yes   F   1.75             no    C   2100
4   4   45.0             no   F   1.86             no    D   2550
5   5  150.0            yes   M   1.79             no    E  12000
6   6   68.5             no   F   1.76             no    F    800
7   7   87.2             no   M   1.10            yes    G   3000
8   8   66.4             no   M   2.65            yes    H   2400
9   9   59.2             no   F   1.75             no    I   1895
10 10   72.3             no   F   1.65             no    J   2300
   Income_outlier
1              no
2              no
3              no
4              no
5             yes
6             yes
7              no
8              no
9              no
10             no

huangapple
  • 本文由 发表于 2023年1月6日 12:47:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75027023.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定