如何自动创建列来识别每个数值变量的异常值。

huangapple go评论99阅读模式
英文:

How to automatically create columns to identify outliers for each numeric variable

问题

Sure, here's the translated code portion:

  1. 我想要为每个变量自动创建异常值的列。用于识别每个变量的异常值的列必须紧邻该变量。变量的值必须是“是”或“否”。
  2. 是否可能自动化这个过程?
  3. ID <- 1:10
  4. Weight <- c(65.1, 70.3, 22, 45, 150, 68.5, 87.2, 66.4, 59.2, 72.3)
  5. Sex <- c("M", "F", "F", "F", "M", "F", "M", "M", "F", "F")
  6. Height <- c(1.3, 1.65, 1.75, 1.86, 1.79, 1.76, 1.1, 2.65, 1.75, 1.65)
  7. City <- head(LETTERS, 10)
  8. Income <- c(1200, 2000, 2100, 2550, 12000, 800, 3000, 2400, 1895, 2300)
  9. mydata2 <- data.frame(ID, Weight, Sex, Height, City, Income)
  10. 我使用Outlier {DescTools}函数来识别异常值
  11. Outlier(mydata2$Weight)
  12. [1] 22 150
  13. Outlier(mydata2$Height)
  14. [1] 1.30 1.10 2.65
  15. Outlier(mydata2$Income)
  16. [1] 12000 800
  17. 这是预期的数据集:
  18. `Weight_outlier` `Weight` 之后,`Height_outlier` `Height` 之后,依此类推。
  19. 在我的实际数据集中有十几个数值变量。
  20. ID Weight Weight_outlier Sex Height Height_outlier City Income Income_outlier
  21. 1 1 65.1 M 1.30 A 1200
  22. 2 2 70.3 F 1.65 B 2000
  23. 3 3 22.0 F 1.75 C 2100
  24. 4 4 45.0 F 1.86 D 2550
  25. 5 5 150.0 M 1.79 E 12000
  26. 6 6 68.5 F 1.76 F 800
  27. 7 7 87.2 M 1.10 G 3000
  28. 8 8 66.4 M 2.65 H 2400
  29. 9 9 59.2 F 1.75 I 1895
  30. 10 10 72.3 F 1.65 J 2300

I've provided the translated code part without any additional content.

英文:

I want to automatically create columns for the outliers for each variable. The column to identify the outliers of each variable must be contiguous to the variable concerned. The value of the variable must be either yes or no.
Is it possible to automate this?

  1. ID&lt;-1:10
  2. Weight&lt;-c(65.1,70.3, 22, 45, 150,68.5,87.2,66.4,59.2,72.3)
  3. Sex&lt;-c(&quot;M&quot;,&quot;F&quot;,&quot;F&quot;,&quot;F&quot;,&quot;M&quot;,&quot;F&quot;,&quot;M&quot;,&quot;M&quot;,&quot;F&quot;,&quot;F&quot;)
  4. Height&lt;-c(1.3, 1.65, 1.75, 1.86, 1.79, 1.76, 1.1, 2.65, 1.75,
  5. 1.65)
  6. City= head(LETTERS,10)
  7. Income&lt;- c(1200,2000,2100,2550,12000,800,3000,2400,1895,2300)
  8. mydata2&lt;-data.frame(ID,Weight,Sex,Height,City,Income)

I use the function Outlier {DescTools} to identify the outliers

  1. Outlier(mydata2$Weight)
  2. [1] 22 150
  3. Outlier(mydata2$Height)
  4. [1] 1.30 1.10 2.65
  5. Outlier(mydata2$Income)
  6. [1] 12000 800

This the expected dataset:

Weight_outlier come just after Weight, Height_outlier after Height and so on.

I have dozen of numeric variables in my real dataset

  1. ID Weight Weight_outlier Sex Height Height_outlier City Income Income_outlier
  2. 1 1 65.1 no M 1.30 yes A 1200 no
  3. 2 2 70.3 no F 1.65 no B 2000 no
  4. 3 3 22.0 yes F 1.75 no C 2100 no
  5. 4 4 45.0 no F 1.86 no D 2550 no
  6. 5 5 150.0 yes M 1.79 no E 12000 yes
  7. 6 6 68.5 no F 1.76 no F 800 yes
  8. 7 7 87.2 no M 1.10 yes G 3000 no
  9. 8 8 66.4 no M 2.65 yes H 2400 no
  10. 9 9 59.2 no F 1.75 no I 1895 no
  11. 10 10 72.3 no F 1.65 no J 2300 no

答案1

得分: 1

  1. mutate(across... 将识别与异常值匹配的值,然后relocate将其放置在所需顺序中,与purrr一起使用
  1. library(tidyverse)
  2. # 获取要测试的列
  3. outlier_vars <- mydata2 %>% dplyr::select(Weight, Height, Income) %>% names
  4. mydata2 %>%
  5. # 识别每列中与异常值匹配的值,并重命名
  6. mutate(across({outlier_vars}, ~.x %in% DescTools::Outlier(.x), .names = "{.col}_outlier")) %>%
  7. # 使用reduce和relocate
  8. reduce2(
  9. .x = outlier_vars,
  10. .y = paste0(outlier_vars, "_outlier"),
  11. .f = ~ relocate(..1, ..3, .after = ..2),
  12. .init = .
  13. )
英文:

mutate(across... will identify values that match outliers, then relocate will place in desired order, used with reduce from purrr

  1. library(tidyverse)
  2. # Get the columns you wish to test
  3. outlier_vars &lt;- mydata2 %&gt;% dplyr::select(Weight, Height, Income) %&gt;% names
  4. mydata2 %&gt;%
  5. # Identify values that match the outliers in each column, renaming
  6. mutate(across({outlier_vars}, ~.x %in% DescTools::Outlier(.x), .names = &quot;{.col}_outlier&quot;)) %&gt;%
  7. # Use reduce with relocate
  8. reduce2(
  9. .x = outlier_vars,
  10. .y = paste0(outlier_vars, &quot;_outlier&quot;),
  11. .f = ~ relocate(..1, ..3, .after = ..2),
  12. .init = .
  13. )

答案2

得分: 0

以下是您要翻译的内容:

可以使用for循环逐列构建数据框的副本,如果原始列是数字,则插入 _outlier 列。

  1. library(DescTools)
  2. mydata3 <- mydata2[, "ID", drop = FALSE]
  3. for (cname in names(mydata2[, -1])) {
  4. mydata3[[cname]] <- mydata2[[cname]]
  5. if (is.numeric(mydata2[[cname]])) {
  6. outliers <- rep("no", nrow(mydata2))
  7. outliers[Outlier(mydata2[[cname]], value = FALSE)] <- "yes"
  8. outliers[is.na(mydata2[[cname]])] <- NA
  9. mydata3[[paste0(cname, "_outlier")]] <- outliers
  10. }
  11. }
  12. mydata3
  1. ID Weight Weight_outlier Sex Height Height_outlier City Income
  2. 1 1 65.1 no M 1.30 yes A 1200
  3. 2 2 70.3 no F 1.65 no B 2000
  4. 3 3 22.0 yes F 1.75 no C 2100
  5. 4 4 45.0 no F 1.86 no D 2550
  6. 5 5 150.0 yes M 1.79 no E 12000
  7. 6 6 68.5 no F 1.76 no F 800
  8. 7 7 87.2 no M 1.10 yes G 3000
  9. 8 8 66.4 no M 2.65 yes H 2400
  10. 9 9 59.2 no F 1.75 no I 1895
  11. 10 10 72.3 no F 1.65 no J 2300
  12. Income_outlier
  13. 1 no
  14. 2 no
  15. 3 no
  16. 4 no
  17. 5 yes
  18. 6 yes
  19. 7 no
  20. 8 no
  21. 9 no
  22. 10 no
英文:

You can use a for loop to build a copy of your dataframe column by column, inserting _outlier columns if the original column is numeric.

  1. library(DescTools)
  2. mydata3 &lt;- mydata2[, &quot;ID&quot;, drop = FALSE]
  3. for (cname in names(mydata2[, -1])) {
  4. mydata3[[cname]] &lt;- mydata2[[cname]]
  5. if (is.numeric(mydata2[[cname]])) {
  6. outliers &lt;- rep(&quot;no&quot;, nrow(mydata2))
  7. outliers[Outlier(mydata2[[cname]], value = FALSE)] &lt;- &quot;yes&quot;
  8. outliers[is.na(mydata2[[cname]])] &lt;- NA
  9. mydata3[[paste0(cname, &quot;_outlier&quot;)]] &lt;- outliers
  10. }
  11. }
  12. mydata3
  1. ID Weight Weight_outlier Sex Height Height_outlier City Income
  2. 1 1 65.1 no M 1.30 yes A 1200
  3. 2 2 70.3 no F 1.65 no B 2000
  4. 3 3 22.0 yes F 1.75 no C 2100
  5. 4 4 45.0 no F 1.86 no D 2550
  6. 5 5 150.0 yes M 1.79 no E 12000
  7. 6 6 68.5 no F 1.76 no F 800
  8. 7 7 87.2 no M 1.10 yes G 3000
  9. 8 8 66.4 no M 2.65 yes H 2400
  10. 9 9 59.2 no F 1.75 no I 1895
  11. 10 10 72.3 no F 1.65 no J 2300
  12. Income_outlier
  13. 1 no
  14. 2 no
  15. 3 no
  16. 4 no
  17. 5 yes
  18. 6 yes
  19. 7 no
  20. 8 no
  21. 9 no
  22. 10 no

huangapple
  • 本文由 发表于 2023年1月6日 12:47:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75027023.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定