2023年5月21日 17:41:26go评论182阅读模式

英文:

Weird plot of LASSO regression, what's the problem and how to fix it?

问题

I did LASSO regression for my data.
然而，两个图（系数图和交叉验证图）看起来不太好。
系数图的问题是：随着λ的变化，一些系数会增长然后下降。在已发表的论文中，系数会随着λ的变化而下降，而不是增长。
系数图
交叉验证图的问题是：红色线的一部分与其他部分不连续。
交叉验证图
我的数据：https://raw.githubusercontent.com/onkaparinga/default/main/train3.csv
我的代码：

library(readr)
library(glmnet)
train3 <- read_csv('train3.csv')
x <- as.matrix(train3[,-1])
y <- train3$CustomLabel
cvlasso <- cv.glmnet(x,y,alpha = 1,family = 'binomial')
plot(cvlasso)
plot(cvlasso$glmnet.fit)

在进行LASSO回归之前，我实际上进行了相关性分析，以删除与其他变量高度相关（>0.9）的变量。

# 读取数据集
train2 <- read_csv('train2.csv')
# 获取非标准化的变量
non_norm_vars <- train2 %>%
  summarise(across(1:(ncol(train2)-1),~shapiro.test(.x)$p.value)) %>%
  t() %>%
  as.data.frame() %>%
  filter(V1<0.05) %>%
  rownames()
# 获取标准化的变量
norm_vars <- colnames(train2)[!colnames(train2) %in% non_norm_vars] %>%
  head(-1)
# 重新排列数据集'train2'，将标准化变量放在前面，以便更方便替换。
df_nonnorm <- train2[,non_norm_vars]
df_norm <- train2[,norm_vars]
train3 <- bind_cols(df_norm,df_nonnorm)
# 计算标准化和整体变量的系数
cor_norm <- cor(df_norm,method = 'pearson')
cor_all <- cor(train3,method = 'spearman')
# 用标准化变量的系数替换系数
num_norm <- dim(cor_norm)[1]
cor_all[1:num_norm,1:num_norm] <- cor_norm
# 计算每个变量有多少个'高系数'（>0.9），并按降序排列。
var_seq <- cor_all %>%
  as_tibble() %>%
  reframe(across(everything(),~sum(abs(.x)>0.9))) %>%
  t() %>%
  as.data.frame() %>%
  arrange(desc(V1)) %>%
  rownames()
# slice_seq: var_seq在colnames(cor_all)中的索引
slice_seq <- match(var_seq,colnames(cor_all))
# 创建一个新的矩阵，将最多'高系数'的变量放在前面，最少的放在最后。
cor_all <- cor_all %>%
  as_tibble() %>%
  select(all_of(var_seq)) %>%
  slice(slice_seq) %>%
  as.matrix()
rownames(cor_all) <- colnames(cor_all)
# 设置为三角形矩阵
cor_all[upper.tri(cor_all)] <- 0
diag(cor_all) <- 0
# 保留具有0个'高系数'的变量
cor_vars <- cor_all %>%
  as_tibble() %>%
  summarise(across(everything(),~any(abs(.x)>0.9))) %>%
  t() %>%
  as.data.frame() %>%
  filter(V1 == F) %>%
  rownames()
# train3包含了
train3 <- train2 %>%
  select(CustomLabel, all_of(cor_vars))

希望我的英语水平没有让您感到困惑...

train2: https://raw.githubusercontent.com/onkaparinga/default/main/train2.csv

英文:

I did LASSO regression for my data.

However, the two plots (coefficients plot and cross validation plot) seemed not very well.

The problem with the coefficients plot is: some coefficients grow and then drop, as the lambda changes. In published papers, the coefficients drop as the lambda changes. They don't grow.

coefficients plot

The problem with the cross validation plot is: a part of the red line is not continuous with others.

cross validation plot

My data: https://raw.githubusercontent.com/onkaparinga/default/main/train3.csv

My code:

library(readr)
library(glmnet)
train3 &lt;- read_csv(&#39;train3.csv&#39;)
x &lt;- as.matrix(train3[,-1])
y &lt;- train3$CustomLabel
cvlasso &lt;- cv.glmnet(x,y,alpha = 1,family = &#39;binomial&#39;)
plot(cvlasso)
plot(cvlasso$glmnet.fit)

Before LASSO regression, I actually did correlation analysis to remove variables that are highly correlated(>0.9) with others.

#read dataset
train2 &lt;- read_csv(&#39;train2.csv&#39;)
#get non-normalized varibles
non_norm_vars &lt;- train2 %&gt;% 
  summarise(across(1:(ncol(train2)-1),~shapiro.test(.x)$p.value)) %&gt;% 
  t() %&gt;% 
  as.data.frame() %&gt;% 
  filter(V1&lt;0.05) %&gt;% 
  rownames()
#get normalized varibles
norm_vars &lt;- colnames(train2)[!colnames(train2) %in% non_norm_vars] %&gt;% 
  head(-1)
#rearrange the dataset &#39;train2&#39;,put normalized variable in the front, so convinient for replace.
df_nonnorm &lt;- train2[,non_norm_vars]
df_norm &lt;- train2[,norm_vars]
train3 &lt;- bind_cols(df_norm,df_nonnorm)
#calculate coefficient for normalized and overall varibales
cor_norm &lt;- cor(df_norm,method = &#39;pearson&#39;)
cor_all &lt;- cor(train3,method = &#39;spearman&#39;)
#replace the coefficient by normalized variable&#39;s
num_norm &lt;- dim(cor_norm)[1]
cor_all[1:num_norm,1:num_norm] &lt;- cor_norm
#sum how many &#39;high coefficients&#39; (&gt;0.9) each variable has, and rearrange by descendant.
var_seq &lt;- cor_all %&gt;% 
  as_tibble %&gt;% 
  reframe(across(everything(),~sum(abs(.x)&gt;0.9))) %&gt;% 
  t() %&gt;% 
  as.data.frame() %&gt;% 
  arrange(desc(V1)) %&gt;% 
  rownames()
#slice_seq: index of var_seq in colnames(cor_all)
slice_seq &lt;- match(var_seq,colnames(cor_all))
#make a new matrix, most &#39;high coefficients&#39; variable in the front, least in the end.
cor_all &lt;- cor_all %&gt;% 
  as_tibble() %&gt;% 
  select(all_of(var_seq)) %&gt;% 
  slice(slice_seq) %&gt;% 
  as.matrix()
rownames(cor_all) &lt;- colnames(cor_all)
#set to triangle matrix
cor_all[upper.tri(cor_all)] &lt;- 0
diag(cor_all) &lt;- 0
#keep variables that has 0 &#39;high coefficients&#39;
cor_vars &lt;- cor_all %&gt;% 
  as_tibble() %&gt;% 
  summarise(across(everything(),~any(abs(.x)&gt;0.9))) %&gt;% 
  t() %&gt;% 
  as.data.frame() %&gt;% 
  filter(V1 == F) %&gt;% 
  rownames()
#train3 got
train3 &lt;- train2 %&gt;% 
  select(CustomLabel, all_of(cor_vars))

hope my poor English did not confuse you ...

train2: https://raw.githubusercontent.com/onkaparinga/default/main/train2.csv

答案1

得分: 0

您的数据集train3存在相当多的共线性问题，即至少有4244个双变量组合的相关系数r²至少为0.9：

corr_mat <- cor(train3[-1])
expand.grid(A = dimnames(corr_mat)[[1]],
            B = dimnames(corr_mat)[[2]]
            ) |
  cbind(r2 = as.vector(corr_mat)^2) |
  as_tibble() |
  filter(as.vector(upper.tri(corr_mat)),
         r2 > 0.9,
         r2 < 1
         ) |
  print(n = 3)

# A tibble: 4,244 x 3
  A     B        r2
1 A693  A1228 0.996
2 A693  A1597 0.988
3 A1228 A1597 0.992
# 还有 4,241 行

此外，您的数据集中有31个特征，其至少有3个极端异常值与均值相差四个标准差或更多：

train3 |
summarise(across(where(is.numeric), 
                 ~ sum(abs(mean(.x) - .x) > 4 * sd(.x))
                 )
          ) |> t() |> 
  as.data.frame() |>
  filter(V1 > 3) |>
  nrow()
## + [1] 31

共线性和异常值都可能严重影响您的回归分析。因此，在模型建立之前，最好在您的建模流程中采取一些降维和异常值处理的措施。

英文:

Your data set train3 has substantial collinearity, i. e. 4244 bivariate combinations have an r² of at least .9:

corr_mat &lt;- cor(train3[-1])
expand.grid(A = dimnames(corr_mat)[[1]],
            B = dimnames(corr_mat)[[2]]
            ) |&gt;
  cbind(r2 = as.vector(corr_mat)^2) |&gt;
  as_tibble() |&gt;
  filter(as.vector(upper.tri(corr_mat)),
         r2 &gt; .9,
         r2 &lt; 1
         ) |&gt; 
  print(n = 3)

# A tibble: 4,244 x 3
  A     B        r2
  &lt;fct&gt; &lt;fct&gt; &lt;dbl&gt;
1 A693  A1228 0.996
2 A693  A1597 0.988
3 A1228 A1597 0.992
# i 4,241 more rows

Moreover, 31 of your features have at least 3 extreme outliers off the mean by four [sic] standard deviations or more:

train3 |&gt;
summarise(across(where(is.numeric), 
                 ~ sum(abs(mean(.x) - .x) &gt; 4 * sd(.x))
                 )
          ) |&gt; t() |&gt; 
  as.data.frame() |&gt;
  filter(V1 &gt; 3) |&gt;
  nrow()
## + [1] 31

Both multicollinearity and outliers can severely topple your regression.
It would thus be good to put a dimensionality reduction and some outlier management up front of your modelling chain.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

LASSO回归的奇怪图表，问题是什么，如何修复？

问题

答案1

使用geom_boxplot在每个分面的底部绘制箱线图的计数。

在R中动态创建和使用新的变量/参数名称和数值。

将按钮点击重定向到R闪亮中另一个导航栏选项卡内嵌的导航栏选项卡。

合并两个没有共同列的数据表

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。