2023年2月26日 20:01:43go评论98阅读模式

英文:

Problems with drawing graph of MSE error for ridge model in R

问题

我构建了最佳的岭回归模型：

library(glmnet)
data(Hitters, package = "ISLR")
x <- Hitters[, c("AtBat", "Hits", "HmRun", "Runs", "RBI", "Walks", "Years", "CAtBat", "CHits", "CHmRun", "CRuns", "CRBI", "CWalks", "PutOuts", "Assists", "Errors")]
y <- Hitters$Salary
x <- scale(x)
lambda_seq <- 10^seq(10, -2, length = 100)
ridge_model <- glmnet(x, y, alpha = 0, lambda = lambda_seq)
cv_ridge <- cv.glmnet(x, y, alpha = 0)
lambda_optimal <- cv_ridge$lambda.min
ridge_model_optimal <- glmnet(x, y, alpha = 0, lambda = lambda_optimal)
summary(ridge_model_optimal)

我想绘制一个显示均方误差 (MSE) 错误的柱状图。我尝试使用以下函数来实现：

x_train <- model.matrix(Salary ~ ., data = train)[,-1]
y_train <- train$Salary
x_valid <- model.matrix(Salary ~ ., data = valid)[,-1]
y_valid <- valid$Salary
mse_ridge <- caret::RMSE(predict(ridge_model_optimal, newx = x_valid), y_valid)

但我收到了以下错误：

错误：在 predict.glmnet(ridge_model_optimal, newx = x_valid) 中，newx 中的变量数必须为 16。

您知道如何解决这个问题吗？

英文:

I built the optimal ridge model:

library(glmnet)
data(Hitters, package = &quot;ISLR&quot;)
x &lt;- Hitters[, c(&quot;AtBat&quot;, &quot;Hits&quot;, &quot;HmRun&quot;, &quot;Runs&quot;, &quot;RBI&quot;, &quot;Walks&quot;, &quot;Years&quot;, &quot;CAtBat&quot;, &quot;CHits&quot;, &quot;CHmRun&quot;, &quot;CRuns&quot;, &quot;CRBI&quot;, &quot;CWalks&quot;, &quot;PutOuts&quot;, &quot;Assists&quot;, &quot;Errors&quot;)]
y &lt;- Hitters$Salary
x &lt;- scale(x)
lambda_seq &lt;- 10^seq(10, -2, length = 100)
ridge_model &lt;- glmnet(x, y, alpha = 0, lambda = lambda_seq)
cv_ridge &lt;- cv.glmnet(x, y, alpha = 0)
lambda_optimal &lt;- cv_ridge$lambda.min
ridge_model_optimal &lt;- glmnet(x, y, alpha = 0, lambda = lambda_optimal)
summary(ridge_model_optimal)

and I would like to draw a bar graph showing the MSE errors. I try to do it with the function:

x_train &lt;- model.matrix(Salary ~ ., data = train)[,-1]
y_train &lt;- train$Salary
x_valid &lt;- model.matrix(Salary ~ ., data = valid)[,-1]
y_valid &lt;- valid$Salary
mse_ridge &lt;- caret::RMSE(predict(ridge_model_optimal, newx = x_valid), y_valid)

but I receive this error:

>Error in predict.glmnet(ridge_model_optimal, newx = x_valid) :
The number of variables in newx must be 16

Do you know how can I fix it?

答案1

得分: 1

以下是翻译好的部分：

You haven't shown in the code how you created the train and valid data sets, yet I suspect this is exactly where your problem lies.
- 你在代码中没有展示如何创建train和valid数据集，但我怀疑这正是问题所在。
First let us load the data and limit ourselves to complete cases:
- 首先，让我们加载数据并限制自己只使用完整的案例：
Now we can create our x and y data:
- 现在我们可以创建我们的x和y数据：
We can create our model like this:
- 我们可以像这样创建我们的模型：
Now let us take samples of Hitters to create a training and validation subset at random, with a 2:1 split
- 现在让我们随机抽取Hitters的样本，以创建训练和验证子集，按2:1的比例划分
And we can get the x and y values for train and valid like so:
- 我们可以这样获取train和valid的x和y值：
Now we can get the RMSE however we like, be it via caret or a simple manual calculation:
- 现在，我们可以以任何喜欢的方式获取RMSE，可以通过caret或简单的手动计算：
You say that you want to plot the RMSE, but it's not clear what you mean by that, since there is only a single value for RMSE. Perhaps you want a histogram of predicted versus actual?
- 你说你想绘制RMSE，但不清楚你的意思，因为RMSE只有一个单一值。也许你想要一个预测与实际值的直方图？
Or maybe show the individual errors?
- 或者也许显示个体错误？

这些是您提供的代码和解释的翻译。

英文:

You haven't shown in the code how you created the train and valid data sets, yet I suspect this is exactly where your problem lies.

First let us load the data and limit ourselves to complete cases:

library(glmnet)
data(Hitters, package = &quot;ISLR&quot;)
Hitters &lt;- Hitters[complete.cases(Hitters), ]

Now we can create our x and y data:

x &lt;- Hitters[,c(&quot;AtBat&quot;, &quot;Hits&quot;, &quot;HmRun&quot;, &quot;Runs&quot;, &quot;RBI&quot;, &quot;Walks&quot;,
               &quot;Years&quot;, &quot;CAtBat&quot;, &quot;CHits&quot;, &quot;CHmRun&quot;, &quot;CRuns&quot;, &quot;CRBI&quot;,
               &quot;CWalks&quot;, &quot;PutOuts&quot;, &quot;Assists&quot;, &quot;Errors&quot;)]
x &lt;- scale(x)
y &lt;- Hitters$Salary

We can create our model like this:

lambda_seq &lt;- 10^seq(10, -2, length = 100)
ridge_model &lt;- glmnet(x, y, alpha = 0, lambda = lambda_seq)
cv_ridge &lt;- cv.glmnet(x, y, alpha = 0)
lambda_optimal &lt;- cv_ridge$lambda.min
ridge_model_optimal &lt;- glmnet(x, y, alpha = 0, lambda = lambda_optimal)

Now let us take samples of Htters to create a training and validation subset at random, with a 2:1 split

set.seed(1)
train_test &lt;- sample(1:2, nrow(x), TRUE, prob = 2:1)
train &lt;- as.data.frame(cbind(Salary = y[train_test == 1], x[train_test == 1,]))
valid &lt;- as.data.frame(cbind(Salary = y[train_test == 2], x[train_test == 2,]))

And we can get the x and y values for train and valid like so:

x_train &lt;- model.matrix(Salary ~ ., data = train)[,-1]
y_train &lt;- train$Salary
x_valid &lt;- model.matrix(Salary ~ ., data = valid)[,-1]
y_valid &lt;- valid$Salary

Now we can get the RMSE however we like, be it via caret or a simple manual calculation:

caret::RMSE(predict(ridge_model_optimal, newx = x_valid), y_valid)
#&gt; [1] 389.107
sqrt(mean((predict(ridge_model_optimal, newx = x_valid) - y_valid)^2))
#&gt; [1] 389.107

You say that you want to plot the RMSE, but it's not clear what you mean by that, since there is only a single value for RMSE. Perhaps you want a histogram of predicted versus actual?

hist(predict(ridge_model_optimal, newx = x_valid) - y_valid,
     main = &quot;Residual histogram&quot;, xlab = &quot;Predicted - Actual&quot;)

Or maybe show the individual errors?

plot(x_valid[,&quot;AtBat&quot;], y_valid, xlab = &quot;At Bat (normalized)&quot;,
     ylab = &quot;Salary&quot;, main = &quot;Actual (black) versus predicted (red)&quot;)
points(x_valid[,&quot;AtBat&quot;], predict(ridge_model_optimal, newx = x_valid),
       col = &quot;red&quot;)
segments(x_valid[,&quot;AtBat&quot;], y_valid, col = &quot;red&quot;,
         y1 = predict(ridge_model_optimal, newx = x_valid))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中为岭回归模型绘制均方误差图的问题。

问题

答案1

如何在lapply中使用group_by%>%sum？

提取日期和时间戳中的时间。

如何在R中删除一个看起来像另一个数据的数据

在R中基于逻辑条件返回列表中的变量名称。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。