在R中为岭回归模型绘制均方误差图的问题。

huangapple go评论64阅读模式
英文:

Problems with drawing graph of MSE error for ridge model in R

问题

我构建了最佳的岭回归模型:

library(glmnet)

data(Hitters, package = "ISLR")

x <- Hitters[, c("AtBat", "Hits", "HmRun", "Runs", "RBI", "Walks", "Years", "CAtBat", "CHits", "CHmRun", "CRuns", "CRBI", "CWalks", "PutOuts", "Assists", "Errors")]
y <- Hitters$Salary

x <- scale(x)

lambda_seq <- 10^seq(10, -2, length = 100)

ridge_model <- glmnet(x, y, alpha = 0, lambda = lambda_seq)

cv_ridge <- cv.glmnet(x, y, alpha = 0)

lambda_optimal <- cv_ridge$lambda.min

ridge_model_optimal <- glmnet(x, y, alpha = 0, lambda = lambda_optimal)

summary(ridge_model_optimal)

我想绘制一个显示均方误差 (MSE) 错误的柱状图。我尝试使用以下函数来实现:

x_train <- model.matrix(Salary ~ ., data = train)[,-1]
y_train <- train$Salary
x_valid <- model.matrix(Salary ~ ., data = valid)[,-1]
y_valid <- valid$Salary

mse_ridge <- caret::RMSE(predict(ridge_model_optimal, newx = x_valid), y_valid)

但我收到了以下错误:

错误:在 predict.glmnet(ridge_model_optimal, newx = x_valid) 中,newx 中的变量数必须为 16。

您知道如何解决这个问题吗?

英文:

I built the optimal ridge model:

library(glmnet)

data(Hitters, package = &quot;ISLR&quot;)

x &lt;- Hitters[, c(&quot;AtBat&quot;, &quot;Hits&quot;, &quot;HmRun&quot;, &quot;Runs&quot;, &quot;RBI&quot;, &quot;Walks&quot;, &quot;Years&quot;, &quot;CAtBat&quot;, &quot;CHits&quot;, &quot;CHmRun&quot;, &quot;CRuns&quot;, &quot;CRBI&quot;, &quot;CWalks&quot;, &quot;PutOuts&quot;, &quot;Assists&quot;, &quot;Errors&quot;)]
y &lt;- Hitters$Salary

x &lt;- scale(x)

lambda_seq &lt;- 10^seq(10, -2, length = 100)

ridge_model &lt;- glmnet(x, y, alpha = 0, lambda = lambda_seq)

cv_ridge &lt;- cv.glmnet(x, y, alpha = 0)

lambda_optimal &lt;- cv_ridge$lambda.min

ridge_model_optimal &lt;- glmnet(x, y, alpha = 0, lambda = lambda_optimal)

summary(ridge_model_optimal)

and I would like to draw a bar graph showing the MSE errors. I try to do it with the function:

x_train &lt;- model.matrix(Salary ~ ., data = train)[,-1]
y_train &lt;- train$Salary
x_valid &lt;- model.matrix(Salary ~ ., data = valid)[,-1]
y_valid &lt;- valid$Salary

mse_ridge &lt;- caret::RMSE(predict(ridge_model_optimal, newx = x_valid), y_valid)

but I receive this error:

>Error in predict.glmnet(ridge_model_optimal, newx = x_valid) :
The number of variables in newx must be 16

Do you know how can I fix it?

答案1

得分: 1

以下是翻译好的部分:

  • You haven't shown in the code how you created the train and valid data sets, yet I suspect this is exactly where your problem lies.

    • 你在代码中没有展示如何创建trainvalid数据集,但我怀疑这正是问题所在。
  • First let us load the data and limit ourselves to complete cases:

    • 首先,让我们加载数据并限制自己只使用完整的案例:
  • Now we can create our x and y data:

    • 现在我们可以创建我们的xy数据:
  • We can create our model like this:

    • 我们可以像这样创建我们的模型:
  • Now let us take samples of Hitters to create a training and validation subset at random, with a 2:1 split

    • 现在让我们随机抽取Hitters的样本,以创建训练和验证子集,按2:1的比例划分
  • And we can get the x and y values for train and valid like so:

    • 我们可以这样获取trainvalid的x和y值:
  • Now we can get the RMSE however we like, be it via caret or a simple manual calculation:

    • 现在,我们可以以任何喜欢的方式获取RMSE,可以通过caret或简单的手动计算:
  • You say that you want to plot the RMSE, but it's not clear what you mean by that, since there is only a single value for RMSE. Perhaps you want a histogram of predicted versus actual?

    • 你说你想绘制RMSE,但不清楚你的意思,因为RMSE只有一个单一值。也许你想要一个预测与实际值的直方图?
  • Or maybe show the individual errors?

    • 或者也许显示个体错误?

这些是您提供的代码和解释的翻译。

英文:

You haven't shown in the code how you created the train and valid data sets, yet I suspect this is exactly where your problem lies.

First let us load the data and limit ourselves to complete cases:

library(glmnet)

data(Hitters, package = &quot;ISLR&quot;)

Hitters &lt;- Hitters[complete.cases(Hitters), ]

Now we can create our x and y data:

x &lt;- Hitters[,c(&quot;AtBat&quot;, &quot;Hits&quot;, &quot;HmRun&quot;, &quot;Runs&quot;, &quot;RBI&quot;, &quot;Walks&quot;,
               &quot;Years&quot;, &quot;CAtBat&quot;, &quot;CHits&quot;, &quot;CHmRun&quot;, &quot;CRuns&quot;, &quot;CRBI&quot;,
               &quot;CWalks&quot;, &quot;PutOuts&quot;, &quot;Assists&quot;, &quot;Errors&quot;)]

x &lt;- scale(x)

y &lt;- Hitters$Salary

We can create our model like this:

lambda_seq &lt;- 10^seq(10, -2, length = 100)

ridge_model &lt;- glmnet(x, y, alpha = 0, lambda = lambda_seq)

cv_ridge &lt;- cv.glmnet(x, y, alpha = 0)

lambda_optimal &lt;- cv_ridge$lambda.min

ridge_model_optimal &lt;- glmnet(x, y, alpha = 0, lambda = lambda_optimal)

Now let us take samples of Htters to create a training and validation subset at random, with a 2:1 split

set.seed(1)
train_test &lt;- sample(1:2, nrow(x), TRUE, prob = 2:1)

train &lt;- as.data.frame(cbind(Salary = y[train_test == 1], x[train_test == 1,]))
valid &lt;- as.data.frame(cbind(Salary = y[train_test == 2], x[train_test == 2,]))

And we can get the x and y values for train and valid like so:

x_train &lt;- model.matrix(Salary ~ ., data = train)[,-1]
y_train &lt;- train$Salary
x_valid &lt;- model.matrix(Salary ~ ., data = valid)[,-1]
y_valid &lt;- valid$Salary

Now we can get the RMSE however we like, be it via caret or a simple manual calculation:

caret::RMSE(predict(ridge_model_optimal, newx = x_valid), y_valid)
#&gt; [1] 389.107

sqrt(mean((predict(ridge_model_optimal, newx = x_valid) - y_valid)^2))
#&gt; [1] 389.107

You say that you want to plot the RMSE, but it's not clear what you mean by that, since there is only a single value for RMSE. Perhaps you want a histogram of predicted versus actual?

hist(predict(ridge_model_optimal, newx = x_valid) - y_valid,
     main = &quot;Residual histogram&quot;, xlab = &quot;Predicted - Actual&quot;)

在R中为岭回归模型绘制均方误差图的问题。

Or maybe show the individual errors?

plot(x_valid[,&quot;AtBat&quot;], y_valid, xlab = &quot;At Bat (normalized)&quot;,
     ylab = &quot;Salary&quot;, main = &quot;Actual (black) versus predicted (red)&quot;)

points(x_valid[,&quot;AtBat&quot;], predict(ridge_model_optimal, newx = x_valid),
       col = &quot;red&quot;)

segments(x_valid[,&quot;AtBat&quot;], y_valid, col = &quot;red&quot;,
         y1 = predict(ridge_model_optimal, newx = x_valid))

在R中为岭回归模型绘制均方误差图的问题。

huangapple
  • 本文由 发表于 2023年2月26日 20:01:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75571834.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定