英文:
grid search using a pipeline
问题
我不确定我是否正确使用了scikit-learn
中的超参数搜索功能。
请考虑以下代码:
from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler, StandardScaler
iris = datasets.load_iris()
X = iris.data
y = iris.target
scalers = [
StandardScaler(),
# MinMaxScaler(feature_range=(0,1)),
MinMaxScaler(feature_range=(-1,1)),
# PowerTransformer(),
# RobustScaler(unit_variance=True)
]
svm_param = {'scaler': scalers,
'learner': [LinearSVC()],
# 'learner__dual': [True, False], # with True svm selects random features
'learner__C': [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0], # uniform(loc=1e-5, scale=1e+5), # [0.5, 1.0, 2],
'learner__tol': [1e-4], # svm_learner_tol, # [1e-5, 1e-4, 1e-3],
'learner__random_state': [22],
'learner__max_iter': [1000]}
pipe = Pipeline([
("scaler", None),
("learner", None)
])
grid = GridSearchCV(
pipe, param_grid=svm_param,
scoring="accuracy",
verbose=2,
refit=True,
cv=5, return_train_score=True)
n_features = [X.shape[1], 20]
for nf in n_features:
X = iris.data[:, :nf] # we only take the first two features.
print("n_features", nf)
grid = GridSearchCV(
pipe, param_grid=svm_param,
scoring="accuracy",
verbose=2,
refit=True,
cv=5, return_train_score=True)
grid.fit(X, y)
基本上,我想使用两种缩放器和一些分类器的参数执行网格搜索,而在这种情况下,分类器是SVM。
然而,这是我得到的输出。在特征数量的第一次迭代中,我阅读到:
(150, 4)
n_features 4
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
...
看起来很正常。然而,在for循环的第二次迭代中,我得到:
n_features 20
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
...
在第一次迭代期间,learner的打印输出如下:
learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, ...
即默认构造函数后跟参数。
在第二次迭代期间,构造函数与默认构造函数不符:
learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, ...
在这种情况下,构造函数使用了C=100.0
,但learner_C=0.01
。
这是正常的,还是我做错了什么?
英文:
I am not sure I am using correctly the hyperparameter search functions in scikit-learn
.
Please consider this code:
from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler, StandardScaler
iris = datasets.load_iris()
X = iris.data
y = iris.target
scalers = [
StandardScaler(),
# MinMaxScaler(feature_range=(0,1)),
MinMaxScaler(feature_range=(-1,1)),
# PowerTransformer(),
# RobustScaler(unit_variance=True)
]
svm_param = {'scaler': scalers,
'learner': [LinearSVC()],
# 'learner__dual': [True, False], # with True svm selects random features
'learner__C': [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0], # uniform(loc=1e-5, scale=1e+5), # [0.5, 1.0, 2],
'learner__tol': [1e-4], # svm_learner_tol, # [1e-5, 1e-4, 1e-3],
'learner__random_state': [22],
'learner__max_iter': [1000]}
pipe = Pipeline([
("scaler", None),
("learner", None)
])
grid = GridSearchCV(
pipe, param_grid=svm_param,
scoring="accuracy",
verbose=2,
refit=True,
cv = 5, return_train_score=True)
n_features = [X.shape[1], 20]
for nf in n_features:
X = iris.data[:, :nf] # we only take the first two features.
print("n_features", nf)
grid = GridSearchCV(
pipe, param_grid=svm_param,
scoring="accuracy",
verbose=2,
refit=True,
cv = 5, return_train_score=True)
grid.fit(X, y)
Basically, I would like to perform a grid search using two scalers and some parameters of a classifier, that, in this case, is SVM.
However, this is the output I get. In the first iteration over the number of features, I read:
(150, 4)
n_features 4
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
...
that seems fine. However, at the second iteration of the for loop, I get:
n_features 20
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time= 0.0s
...
During the first iteration, the learner is printed out as:
learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, ...
i.e., default constructor followed by the parameters.
During the second iteration, the constructor does not correspond to the default one:
learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, ...
In this case, the constructor uses C=100.0
, but learner_C=0.01
.
Is this normal, or am I doing something wrong?
答案1
得分: 1
使用refit=True
时,GridSearchCV
将在网格搜索完成后使用最佳参数拟合LinearSVC
。这就是为什么在第一次网格搜索之后,LinearSVC()
已经更改为C=100.0
。
这对于您的第二次网格搜索没有影响,无论如何搜索过程中C
仍然会被更改。
但是,如果您不希望GridSearchCV
更改LinearSVC
,只需使用refit=False
。
英文:
With refit=True
the GridSearchCV
will fit LinearSVC
with the best parameters after the grid search is complete. This is why after the first grid search, LinearSVC()
has been changed with C=100.0
.
This has no impact on your second grid seach, C
will still be changed during the search anyway.
But if you do not want GridSearchCV
to change the LinearSVC
, just use refit=False
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论