2023年1月6日 12:47:14go评论99阅读模式

英文:

How to automatically create columns to identify outliers for each numeric variable

问题

Sure, here's the translated code portion:

我想要为每个变量自动创建异常值的列。用于识别每个变量的异常值的列必须紧邻该变量。变量的值必须是“是”或“否”。
是否可能自动化这个过程？
ID <- 1:10
Weight <- c(65.1, 70.3, 22, 45, 150, 68.5, 87.2, 66.4, 59.2, 72.3)
Sex <- c("M", "F", "F", "F", "M", "F", "M", "M", "F", "F")
Height <- c(1.3, 1.65, 1.75, 1.86, 1.79, 1.76, 1.1, 2.65, 1.75, 1.65)
City <- head(LETTERS, 10)
Income <- c(1200, 2000, 2100, 2550, 12000, 800, 3000, 2400, 1895, 2300)
mydata2 <- data.frame(ID, Weight, Sex, Height, City, Income)
我使用Outlier {DescTools}函数来识别异常值
Outlier(mydata2$Weight)
[1]  22 150
Outlier(mydata2$Height)
[1] 1.30 1.10 2.65
Outlier(mydata2$Income)
[1] 12000   800
这是预期的数据集：
`Weight_outlier` 在 `Weight` 之后，`Height_outlier` 在 `Height` 之后，依此类推。
在我的实际数据集中有十几个数值变量。
  ID Weight Weight_outlier Sex Height Height_outlier City Income Income_outlier
1  1   65.1             否   M   1.30            是    A   1200             否
2  2   70.3             否   F   1.65            否    B   2000             否
3  3   22.0            是   F   1.75            否    C   2100             否
4  4   45.0             否   F   1.86            否    D   2550             否
5  5  150.0            是   M   1.79            否    E  12000            是
6  6   68.5             否   F   1.76            否    F    800            是
7  7   87.2             否   M   1.10            是    G   3000             否
8  8   66.4             否   M   2.65            是    H   2400             否
9  9   59.2             否   F   1.75            否    I   1895             否
10 10   72.3             否   F   1.65            否    J   2300             否

I've provided the translated code part without any additional content.

英文:

I want to automatically create columns for the outliers for each variable. The column to identify the outliers of each variable must be contiguous to the variable concerned. The value of the variable must be either yes or no.
Is it possible to automate this?

ID&lt;-1:10
    Weight&lt;-c(65.1,70.3, 22, 45, 150,68.5,87.2,66.4,59.2,72.3)
    Sex&lt;-c(&quot;M&quot;,&quot;F&quot;,&quot;F&quot;,&quot;F&quot;,&quot;M&quot;,&quot;F&quot;,&quot;M&quot;,&quot;M&quot;,&quot;F&quot;,&quot;F&quot;)
    Height&lt;-c(1.3, 1.65, 1.75, 1.86, 1.79, 1.76, 1.1, 2.65, 1.75,
              1.65)
    City= head(LETTERS,10)
    
    Income&lt;- c(1200,2000,2100,2550,12000,800,3000,2400,1895,2300)
    
    mydata2&lt;-data.frame(ID,Weight,Sex,Height,City,Income)

I use the function Outlier {DescTools} to identify the outliers

    Outlier(mydata2$Weight)
[1]  22 150
    Outlier(mydata2$Height)
[1] 1.30 1.10 2.65
    Outlier(mydata2$Income)
[1] 12000   800

This the expected dataset:

Weight_outlier come just after Weight, Height_outlier after Height and so on.

I have dozen of numeric variables in my real dataset

   ID Weight Weight_outlier Sex Height Height_outlier City Income Income_outlier
1   1   65.1             no   M   1.30            yes    A   1200             no
2   2   70.3             no   F   1.65             no    B   2000             no
3   3   22.0            yes   F   1.75             no    C   2100             no
4   4   45.0             no   F   1.86             no    D   2550             no
5   5  150.0            yes   M   1.79             no    E  12000            yes
6   6   68.5             no   F   1.76             no    F    800            yes
7   7   87.2             no   M   1.10            yes    G   3000             no
8   8   66.4             no   M   2.65            yes    H   2400             no
9   9   59.2             no   F   1.75             no    I   1895             no
10 10   72.3             no   F   1.65             no    J   2300             no

答案1

得分: 1

mutate(across... 将识别与异常值匹配的值，然后relocate将其放置在所需顺序中，与purrr一起使用

library(tidyverse)
# 获取要测试的列
outlier_vars <- mydata2 %>% dplyr::select(Weight, Height, Income) %>% names
mydata2 %>% 
# 识别每列中与异常值匹配的值，并重命名
  mutate(across({outlier_vars}, ~.x %in% DescTools::Outlier(.x), .names = "{.col}_outlier")) %>% 
# 使用reduce和relocate
reduce2(
  .x = outlier_vars,
  .y = paste0(outlier_vars, "_outlier"),
  .f = ~ relocate(..1, ..3, .after = ..2),
  .init = .
)

英文:

mutate(across... will identify values that match outliers, then relocate will place in desired order, used with reduce from purrr

library(tidyverse)
# Get the columns you wish to test
outlier_vars &lt;- mydata2 %&gt;% dplyr::select(Weight, Height, Income) %&gt;% names
mydata2 %&gt;% 
# Identify values that match the outliers in each column, renaming
  mutate(across({outlier_vars}, ~.x %in% DescTools::Outlier(.x), .names = &quot;{.col}_outlier&quot;)) %&gt;% 
# Use reduce with relocate
reduce2(
  .x = outlier_vars,
  .y = paste0(outlier_vars, &quot;_outlier&quot;),
  .f = ~ relocate(..1, ..3, .after = ..2),
  .init = .
)

答案2

得分: 0

以下是您要翻译的内容：

可以使用for循环逐列构建数据框的副本，如果原始列是数字，则插入 _outlier 列。

library(DescTools)
mydata3 <- mydata2[, "ID", drop = FALSE]
for (cname in names(mydata2[, -1])) {
  mydata3[[cname]] <- mydata2[[cname]]
  if (is.numeric(mydata2[[cname]])) {
    outliers <- rep("no", nrow(mydata2))
    outliers[Outlier(mydata2[[cname]], value = FALSE)] <- "yes"
    outliers[is.na(mydata2[[cname]])] <- NA
    mydata3[[paste0(cname, "_outlier")]] <- outliers
  }
}
mydata3

   ID Weight Weight_outlier Sex Height Height_outlier City Income
1   1   65.1             no   M   1.30            yes    A   1200
2   2   70.3             no   F   1.65             no    B   2000
3   3   22.0            yes   F   1.75             no    C   2100
4   4   45.0             no   F   1.86             no    D   2550
5   5  150.0            yes   M   1.79             no    E  12000
6   6   68.5             no   F   1.76             no    F    800
7   7   87.2             no   M   1.10            yes    G   3000
8   8   66.4             no   M   2.65            yes    H   2400
9   9   59.2             no   F   1.75             no    I   1895
10 10   72.3             no   F   1.65             no    J   2300
   Income_outlier
1              no
2              no
3              no
4              no
5             yes
6             yes
7              no
8              no
9              no
10             no

英文:

You can use a for loop to build a copy of your dataframe column by column, inserting _outlier columns if the original column is numeric.

library(DescTools)
mydata3 &lt;- mydata2[, &quot;ID&quot;, drop = FALSE]
for (cname in names(mydata2[, -1])) {
  mydata3[[cname]] &lt;- mydata2[[cname]]
  if (is.numeric(mydata2[[cname]])) {
    outliers &lt;- rep(&quot;no&quot;, nrow(mydata2))
    outliers[Outlier(mydata2[[cname]], value = FALSE)] &lt;- &quot;yes&quot;
    outliers[is.na(mydata2[[cname]])] &lt;- NA
    mydata3[[paste0(cname, &quot;_outlier&quot;)]] &lt;- outliers
  }
}
mydata3


   ID Weight Weight_outlier Sex Height Height_outlier City Income
1   1   65.1             no   M   1.30            yes    A   1200
2   2   70.3             no   F   1.65             no    B   2000
3   3   22.0            yes   F   1.75             no    C   2100
4   4   45.0             no   F   1.86             no    D   2550
5   5  150.0            yes   M   1.79             no    E  12000
6   6   68.5             no   F   1.76             no    F    800
7   7   87.2             no   M   1.10            yes    G   3000
8   8   66.4             no   M   2.65            yes    H   2400
9   9   59.2             no   F   1.75             no    I   1895
10 10   72.3             no   F   1.65             no    J   2300
   Income_outlier
1              no
2              no
3              no
4              no
5             yes
6             yes
7              no
8              no
9              no
10             no

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何自动创建列来识别每个数值变量的异常值。

问题

答案1

答案2

ggplot 有两个 x 轴文本堆栈。

缺失的表格在 PostgreSQL 数据库中

自动排除单层因子变量的回归。

如何使用R进行单变量和多变量生存分析？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。