LASSO回归的奇怪图表,问题是什么,如何修复?

huangapple go评论146阅读模式
英文:

Weird plot of LASSO regression, what's the problem and how to fix it?

问题

I did LASSO regression for my data.
然而,两个图(系数图和交叉验证图)看起来不太好。
系数图的问题是:随着λ的变化,一些系数会增长然后下降。在已发表的论文中,系数会随着λ的变化而下降,而不是增长。
系数图
交叉验证图的问题是:红色线的一部分与其他部分不连续。
交叉验证图
我的数据:https://raw.githubusercontent.com/onkaparinga/default/main/train3.csv
我的代码:

library(readr)
library(glmnet)

train3 <- read_csv('train3.csv')

x <- as.matrix(train3[,-1])
y <- train3$CustomLabel
cvlasso <- cv.glmnet(x,y,alpha = 1,family = 'binomial')
plot(cvlasso)
plot(cvlasso$glmnet.fit)

在进行LASSO回归之前,我实际上进行了相关性分析,以删除与其他变量高度相关(>0.9)的变量。

# 读取数据集
train2 <- read_csv('train2.csv')

# 获取非标准化的变量
non_norm_vars <- train2 %>%
  summarise(across(1:(ncol(train2)-1),~shapiro.test(.x)$p.value)) %>%
  t() %>%
  as.data.frame() %>%
  filter(V1<0.05) %>%
  rownames()

# 获取标准化的变量
norm_vars <- colnames(train2)[!colnames(train2) %in% non_norm_vars] %>%
  head(-1)

# 重新排列数据集'train2',将标准化变量放在前面,以便更方便替换。
df_nonnorm <- train2[,non_norm_vars]
df_norm <- train2[,norm_vars]
train3 <- bind_cols(df_norm,df_nonnorm)

# 计算标准化和整体变量的系数
cor_norm <- cor(df_norm,method = 'pearson')
cor_all <- cor(train3,method = 'spearman')

# 用标准化变量的系数替换系数
num_norm <- dim(cor_norm)[1]
cor_all[1:num_norm,1:num_norm] <- cor_norm

# 计算每个变量有多少个'高系数'(>0.9),并按降序排列。
var_seq <- cor_all %>%
  as_tibble() %>%
  reframe(across(everything(),~sum(abs(.x)>0.9))) %>%
  t() %>%
  as.data.frame() %>%
  arrange(desc(V1)) %>%
  rownames()

# slice_seq: var_seq在colnames(cor_all)中的索引
slice_seq <- match(var_seq,colnames(cor_all))

# 创建一个新的矩阵,将最多'高系数'的变量放在前面,最少的放在最后。
cor_all <- cor_all %>%
  as_tibble() %>%
  select(all_of(var_seq)) %>%
  slice(slice_seq) %>%
  as.matrix()
rownames(cor_all) <- colnames(cor_all)

# 设置为三角形矩阵
cor_all[upper.tri(cor_all)] <- 0
diag(cor_all) <- 0

# 保留具有0个'高系数'的变量
cor_vars <- cor_all %>%
  as_tibble() %>%
  summarise(across(everything(),~any(abs(.x)>0.9))) %>%
  t() %>%
  as.data.frame() %>%
  filter(V1 == F) %>%
  rownames()

# train3包含了
train3 <- train2 %>%
  select(CustomLabel, all_of(cor_vars))

希望我的英语水平没有让您感到困惑...

train2: https://raw.githubusercontent.com/onkaparinga/default/main/train2.csv

英文:

I did LASSO regression for my data.

However, the two plots (coefficients plot and cross validation plot) seemed not very well.

The problem with the coefficients plot is: some coefficients grow and then drop, as the lambda changes. In published papers, the coefficients drop as the lambda changes. They don't grow.

coefficients plot

The problem with the cross validation plot is: a part of the red line is not continuous with others.

cross validation plot

My data: https://raw.githubusercontent.com/onkaparinga/default/main/train3.csv

My code:

library(readr)
library(glmnet)

train3 &lt;- read_csv(&#39;train3.csv&#39;)

x &lt;- as.matrix(train3[,-1])
y &lt;- train3$CustomLabel
cvlasso &lt;- cv.glmnet(x,y,alpha = 1,family = &#39;binomial&#39;)
plot(cvlasso)
plot(cvlasso$glmnet.fit)

Before LASSO regression, I actually did correlation analysis to remove variables that are highly correlated(>0.9) with others.

#read dataset
train2 &lt;- read_csv(&#39;train2.csv&#39;)

#get non-normalized varibles
non_norm_vars &lt;- train2 %&gt;% 
  summarise(across(1:(ncol(train2)-1),~shapiro.test(.x)$p.value)) %&gt;% 
  t() %&gt;% 
  as.data.frame() %&gt;% 
  filter(V1&lt;0.05) %&gt;% 
  rownames()

#get normalized varibles
norm_vars &lt;- colnames(train2)[!colnames(train2) %in% non_norm_vars] %&gt;% 
  head(-1)

#rearrange the dataset &#39;train2&#39;,put normalized variable in the front, so convinient for replace.
df_nonnorm &lt;- train2[,non_norm_vars]
df_norm &lt;- train2[,norm_vars]
train3 &lt;- bind_cols(df_norm,df_nonnorm)

#calculate coefficient for normalized and overall varibales
cor_norm &lt;- cor(df_norm,method = &#39;pearson&#39;)
cor_all &lt;- cor(train3,method = &#39;spearman&#39;)
#replace the coefficient by normalized variable&#39;s
num_norm &lt;- dim(cor_norm)[1]
cor_all[1:num_norm,1:num_norm] &lt;- cor_norm


#sum how many &#39;high coefficients&#39; (&gt;0.9) each variable has, and rearrange by descendant.
var_seq &lt;- cor_all %&gt;% 
  as_tibble %&gt;% 
  reframe(across(everything(),~sum(abs(.x)&gt;0.9))) %&gt;% 
  t() %&gt;% 
  as.data.frame() %&gt;% 
  arrange(desc(V1)) %&gt;% 
  rownames()
#slice_seq: index of var_seq in colnames(cor_all)
slice_seq &lt;- match(var_seq,colnames(cor_all))
#make a new matrix, most &#39;high coefficients&#39; variable in the front, least in the end.
cor_all &lt;- cor_all %&gt;% 
  as_tibble() %&gt;% 
  select(all_of(var_seq)) %&gt;% 
  slice(slice_seq) %&gt;% 
  as.matrix()
rownames(cor_all) &lt;- colnames(cor_all)

#set to triangle matrix
cor_all[upper.tri(cor_all)] &lt;- 0
diag(cor_all) &lt;- 0

#keep variables that has 0 &#39;high coefficients&#39;
cor_vars &lt;- cor_all %&gt;% 
  as_tibble() %&gt;% 
  summarise(across(everything(),~any(abs(.x)&gt;0.9))) %&gt;% 
  t() %&gt;% 
  as.data.frame() %&gt;% 
  filter(V1 == F) %&gt;% 
  rownames()
#train3 got
train3 &lt;- train2 %&gt;% 
  select(CustomLabel, all_of(cor_vars))

hope my poor English did not confuse you ...

train2: https://raw.githubusercontent.com/onkaparinga/default/main/train2.csv

答案1

得分: 0

您的数据集train3存在相当多的共线性问题,即至少有4244个双变量组合的相关系数r²至少为0.9:

corr_mat <- cor(train3[-1])

expand.grid(A = dimnames(corr_mat)[[1]],
            B = dimnames(corr_mat)[[2]]
            ) |
  cbind(r2 = as.vector(corr_mat)^2) |
  as_tibble() |
  filter(as.vector(upper.tri(corr_mat)),
         r2 > 0.9,
         r2 < 1
         ) |
  print(n = 3)
# A tibble: 4,244 x 3
  A     B        r2
1 A693  A1228 0.996
2 A693  A1597 0.988
3 A1228 A1597 0.992
# 还有 4,241 行

此外,您的数据集中有31个特征,其至少有3个极端异常值与均值相差四个标准差或更多:

train3 |
summarise(across(where(is.numeric), 
                 ~ sum(abs(mean(.x) - .x) > 4 * sd(.x))
                 )
          ) |> t() |> 
  as.data.frame() |>
  filter(V1 > 3) |>
  nrow()

## + [1] 31

共线性和异常值都可能严重影响您的回归分析。因此,在模型建立之前,最好在您的建模流程中采取一些降维和异常值处理的措施。

英文:

Your data set train3 has substantial collinearity, i. e. 4244 bivariate combinations have an r² of at least .9:

corr_mat &lt;- cor(train3[-1])

expand.grid(A = dimnames(corr_mat)[[1]],
            B = dimnames(corr_mat)[[2]]
            ) |&gt;
  cbind(r2 = as.vector(corr_mat)^2) |&gt;
  as_tibble() |&gt;
  filter(as.vector(upper.tri(corr_mat)),
         r2 &gt; .9,
         r2 &lt; 1
         ) |&gt; 
  print(n = 3)
# A tibble: 4,244 x 3
  A     B        r2
  &lt;fct&gt; &lt;fct&gt; &lt;dbl&gt;
1 A693  A1228 0.996
2 A693  A1597 0.988
3 A1228 A1597 0.992
# i 4,241 more rows

Moreover, 31 of your features have at least 3 extreme outliers off the mean by four [sic] standard deviations or more:

train3 |&gt;
summarise(across(where(is.numeric), 
                 ~ sum(abs(mean(.x) - .x) &gt; 4 * sd(.x))
                 )
          ) |&gt; t() |&gt; 
  as.data.frame() |&gt;
  filter(V1 &gt; 3) |&gt;
  nrow()

## + [1] 31

Both multicollinearity and outliers can severely topple your regression.
It would thus be good to put a dimensionality reduction and some outlier management up front of your modelling chain.

huangapple
  • 本文由 发表于 2023年5月21日 17:41:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/76299225.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定