英文:
Problems with drawing graph of MSE error for ridge model in R
问题
我构建了最佳的岭回归模型:
library(glmnet)
data(Hitters, package = "ISLR")
x <- Hitters[, c("AtBat", "Hits", "HmRun", "Runs", "RBI", "Walks", "Years", "CAtBat", "CHits", "CHmRun", "CRuns", "CRBI", "CWalks", "PutOuts", "Assists", "Errors")]
y <- Hitters$Salary
x <- scale(x)
lambda_seq <- 10^seq(10, -2, length = 100)
ridge_model <- glmnet(x, y, alpha = 0, lambda = lambda_seq)
cv_ridge <- cv.glmnet(x, y, alpha = 0)
lambda_optimal <- cv_ridge$lambda.min
ridge_model_optimal <- glmnet(x, y, alpha = 0, lambda = lambda_optimal)
summary(ridge_model_optimal)
我想绘制一个显示均方误差 (MSE) 错误的柱状图。我尝试使用以下函数来实现:
x_train <- model.matrix(Salary ~ ., data = train)[,-1]
y_train <- train$Salary
x_valid <- model.matrix(Salary ~ ., data = valid)[,-1]
y_valid <- valid$Salary
mse_ridge <- caret::RMSE(predict(ridge_model_optimal, newx = x_valid), y_valid)
但我收到了以下错误:
错误:在 predict.glmnet(ridge_model_optimal, newx = x_valid) 中,newx 中的变量数必须为 16。
您知道如何解决这个问题吗?
英文:
I built the optimal ridge model:
library(glmnet)
data(Hitters, package = "ISLR")
x <- Hitters[, c("AtBat", "Hits", "HmRun", "Runs", "RBI", "Walks", "Years", "CAtBat", "CHits", "CHmRun", "CRuns", "CRBI", "CWalks", "PutOuts", "Assists", "Errors")]
y <- Hitters$Salary
x <- scale(x)
lambda_seq <- 10^seq(10, -2, length = 100)
ridge_model <- glmnet(x, y, alpha = 0, lambda = lambda_seq)
cv_ridge <- cv.glmnet(x, y, alpha = 0)
lambda_optimal <- cv_ridge$lambda.min
ridge_model_optimal <- glmnet(x, y, alpha = 0, lambda = lambda_optimal)
summary(ridge_model_optimal)
and I would like to draw a bar graph showing the MSE errors. I try to do it with the function:
x_train <- model.matrix(Salary ~ ., data = train)[,-1]
y_train <- train$Salary
x_valid <- model.matrix(Salary ~ ., data = valid)[,-1]
y_valid <- valid$Salary
mse_ridge <- caret::RMSE(predict(ridge_model_optimal, newx = x_valid), y_valid)
but I receive this error:
>Error in predict.glmnet(ridge_model_optimal, newx = x_valid) :
The number of variables in newx must be 16
Do you know how can I fix it?
答案1
得分: 1
以下是翻译好的部分:
-
You haven't shown in the code how you created the
train
andvalid
data sets, yet I suspect this is exactly where your problem lies.- 你在代码中没有展示如何创建
train
和valid
数据集,但我怀疑这正是问题所在。
- 你在代码中没有展示如何创建
-
First let us load the data and limit ourselves to complete cases:
- 首先,让我们加载数据并限制自己只使用完整的案例:
-
Now we can create our
x
andy
data:- 现在我们可以创建我们的
x
和y
数据:
- 现在我们可以创建我们的
-
We can create our model like this:
- 我们可以像这样创建我们的模型:
-
Now let us take samples of
Hitters
to create a training and validation subset at random, with a 2:1 split- 现在让我们随机抽取
Hitters
的样本,以创建训练和验证子集,按2:1的比例划分
- 现在让我们随机抽取
-
And we can get the x and y values for
train
andvalid
like so:- 我们可以这样获取
train
和valid
的x和y值:
- 我们可以这样获取
-
Now we can get the RMSE however we like, be it via
caret
or a simple manual calculation:- 现在,我们可以以任何喜欢的方式获取RMSE,可以通过
caret
或简单的手动计算:
- 现在,我们可以以任何喜欢的方式获取RMSE,可以通过
-
You say that you want to plot the RMSE, but it's not clear what you mean by that, since there is only a single value for RMSE. Perhaps you want a histogram of predicted versus actual?
- 你说你想绘制RMSE,但不清楚你的意思,因为RMSE只有一个单一值。也许你想要一个预测与实际值的直方图?
-
Or maybe show the individual errors?
- 或者也许显示个体错误?
这些是您提供的代码和解释的翻译。
英文:
You haven't shown in the code how you created the train
and valid
data sets, yet I suspect this is exactly where your problem lies.
First let us load the data and limit ourselves to complete cases:
library(glmnet)
data(Hitters, package = "ISLR")
Hitters <- Hitters[complete.cases(Hitters), ]
Now we can create our x
and y
data:
x <- Hitters[,c("AtBat", "Hits", "HmRun", "Runs", "RBI", "Walks",
"Years", "CAtBat", "CHits", "CHmRun", "CRuns", "CRBI",
"CWalks", "PutOuts", "Assists", "Errors")]
x <- scale(x)
y <- Hitters$Salary
We can create our model like this:
lambda_seq <- 10^seq(10, -2, length = 100)
ridge_model <- glmnet(x, y, alpha = 0, lambda = lambda_seq)
cv_ridge <- cv.glmnet(x, y, alpha = 0)
lambda_optimal <- cv_ridge$lambda.min
ridge_model_optimal <- glmnet(x, y, alpha = 0, lambda = lambda_optimal)
Now let us take samples of Htters
to create a training and validation subset at random, with a 2:1 split
set.seed(1)
train_test <- sample(1:2, nrow(x), TRUE, prob = 2:1)
train <- as.data.frame(cbind(Salary = y[train_test == 1], x[train_test == 1,]))
valid <- as.data.frame(cbind(Salary = y[train_test == 2], x[train_test == 2,]))
And we can get the x and y values for train
and valid
like so:
x_train <- model.matrix(Salary ~ ., data = train)[,-1]
y_train <- train$Salary
x_valid <- model.matrix(Salary ~ ., data = valid)[,-1]
y_valid <- valid$Salary
Now we can get the RMSE however we like, be it via caret
or a simple manual calculation:
caret::RMSE(predict(ridge_model_optimal, newx = x_valid), y_valid)
#> [1] 389.107
sqrt(mean((predict(ridge_model_optimal, newx = x_valid) - y_valid)^2))
#> [1] 389.107
You say that you want to plot the RMSE, but it's not clear what you mean by that, since there is only a single value for RMSE. Perhaps you want a histogram of predicted versus actual?
hist(predict(ridge_model_optimal, newx = x_valid) - y_valid,
main = "Residual histogram", xlab = "Predicted - Actual")
Or maybe show the individual errors?
plot(x_valid[,"AtBat"], y_valid, xlab = "At Bat (normalized)",
ylab = "Salary", main = "Actual (black) versus predicted (red)")
points(x_valid[,"AtBat"], predict(ridge_model_optimal, newx = x_valid),
col = "red")
segments(x_valid[,"AtBat"], y_valid, col = "red",
y1 = predict(ridge_model_optimal, newx = x_valid))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论