英文:
Is there a way to pull outliers into a separate df?
问题
我有一个包含150列和200行的数据框,我想遍历每一列,并提取每列中大于该列均值加3倍标准差的数据点。
我用以下代码替换了异常值为NA,但后来我发现我需要将异常值保存在另一个数据框中。有没有办法修改这个代码,只提取那些异常值的单元格的行和列名称?
新数据框的预期外观如下:
Sample | Gene |
---|---|
X1027 | G-198712 |
X7CUH | G-228253 |
以下是修改后的代码:
newtpose = tpose_genexp %>%
mutate_at(.vars = vars(contains("G")),
.funs= ~ifelse(abs(.) > mean(.) + 3 * sd(.), NA, .))
英文:
Hi I have a data frame with 150 Columns and 200 rows and I want to go through each column and pull any data points that are more than 3 sd from the mean of each column.
G-198804 | G-198712 | G-228253 | G-198899 | |
---|---|---|---|---|
X1027 | 15.100481 | 15.949672 | 13.783062 | 17.106806 |
X1104 | 14.905931 | 15.766908 | 13.885380 | 17.134476 |
X5010 | 15.268376 | 16.457303 | 13.447923 | 17.345957 |
X5023 | 15.513746 | 16.457871 | 13.848918 | 17.634144 |
X5425 | 15.093679 | 16.085498 | 13.253646 | 17.066823 |
X7CUH | 15.471564 | 16.417165 | 13.764880 | 17.365255 |
X8VHB | 15.222530 | 16.440389 | 13.146401 | 17.158754 |
VWU2 | 14.999256 | 16.121702 | 13.261694 | 17.193140 |
CUKX | 14.795677 | 16.076999 | 13.325234 | 17.145046 |
I used this to replace the outliers with NA, but I realized I needed the outliers in a separate df. Is there any way to modify this to just pull the row and column name of the cells that are outliers?
newtpose = tpose_genexp %>%
mutate_at(.vars = vars(contains("G")),
.funs= ~ifelse(abs(.)>mean(.)+3*sd(.), NA, .))
My new data frame would hopefully look like
Sample | Gene |
---|---|
X1027 | G-198712 |
X7CUH | G-228253 |
答案1
得分: 0
你可以定义一个新的数据框,称之为 df_out
,其中所有不是离群值的数值将被设置为NA:
df_out <- df %>%
mutate(across(starts_with("G"),
~ifelse(abs(.) > mean(.) + 3*sd(.), ., NA)))
如果你想要将离群值存储在一个两列的数据框中,你可以添加 pivot_longer
:
df %>%
mutate(across(starts_with("G"),
~ifelse(abs(.) > mean(.) + 3*sd(.), NA, .))) %>%
pivot_longer(everything()) %>%
na.omit
英文:
You could define a new data frame, say df_out
, in which all those values that are not outliers are set to NA:
df_out <- df %>%
mutate(across(starts_with("G"),
~ifelse(abs(.) > mean(.) + 3*sd(.), ., NA)))
If you want the outliers in a two-column dataframe, you can add pivot_longer
:
df %>%
mutate(across(starts_with("G"),
~ifelse(abs(.) > mean(.) + 3*sd(.), NA, .))) %>%
pivot_longer(everything()) %>%
na.omit
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论