网格搜索使用管道

huangapple go评论56阅读模式
英文:

grid search using a pipeline

问题

我不确定我是否正确使用了scikit-learn中的超参数搜索功能。

请考虑以下代码:

from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler, StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target

scalers = [
            StandardScaler(),
            # MinMaxScaler(feature_range=(0,1)), 
            MinMaxScaler(feature_range=(-1,1)), 
            # PowerTransformer(),
            # RobustScaler(unit_variance=True)
        ]

svm_param = {'scaler': scalers, 
        'learner': [LinearSVC()],
        # 'learner__dual': [True, False],  # with True svm selects random features
        'learner__C': [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0],  # uniform(loc=1e-5, scale=1e+5),  # [0.5, 1.0, 2],
        'learner__tol': [1e-4],  # svm_learner_tol,  # [1e-5, 1e-4, 1e-3],
        'learner__random_state': [22],
        'learner__max_iter': [1000]}

pipe = Pipeline([
        ("scaler", None),
        ("learner", None)
    ])

grid = GridSearchCV(
    pipe, param_grid=svm_param, 
    scoring="accuracy",
    verbose=2,
    refit=True, 
    cv=5, return_train_score=True)

n_features = [X.shape[1], 20]

for nf in n_features:
    X = iris.data[:, :nf]  # we only take the first two features.
    print("n_features", nf)

    grid = GridSearchCV(
        pipe, param_grid=svm_param, 
        scoring="accuracy",
        verbose=2,
        refit=True, 
        cv=5, return_train_score=True)

    grid.fit(X, y)

基本上,我想使用两种缩放器和一些分类器的参数执行网格搜索,而在这种情况下,分类器是SVM。

然而,这是我得到的输出。在特征数量的第一次迭代中,我阅读到:

(150, 4)
n_features 4
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
...

看起来很正常。然而,在for循环的第二次迭代中,我得到:

n_features 20
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
...

在第一次迭代期间,learner的打印输出如下:

learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, ...

即默认构造函数后跟参数。

在第二次迭代期间,构造函数与默认构造函数不符:

learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, ...

在这种情况下,构造函数使用了C=100.0,但learner_C=0.01

这是正常的,还是我做错了什么?

英文:

I am not sure I am using correctly the hyperparameter search functions in scikit-learn.

Please consider this code:

from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler, StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target

scalers = [
            StandardScaler(),
            # MinMaxScaler(feature_range=(0,1)), 
            MinMaxScaler(feature_range=(-1,1)), 
            # PowerTransformer(),
            # RobustScaler(unit_variance=True)
        ]

svm_param = {'scaler': scalers, 
        'learner': [LinearSVC()],
        # 'learner__dual': [True, False],  # with True svm selects random features
        'learner__C': [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0],  # uniform(loc=1e-5, scale=1e+5),  # [0.5, 1.0, 2],
        'learner__tol': [1e-4],  # svm_learner_tol,  # [1e-5, 1e-4, 1e-3],
        'learner__random_state': [22],
        'learner__max_iter': [1000]}

pipe = Pipeline([
        ("scaler", None),
        ("learner", None)
    ])

grid = GridSearchCV(
    pipe, param_grid=svm_param, 
    scoring="accuracy",
    verbose=2,
    refit=True, 
    cv = 5, return_train_score=True)

n_features = [X.shape[1], 20]

for nf in n_features:
    X = iris.data[:, :nf]  # we only take the first two features.
    print("n_features", nf)

    grid = GridSearchCV(
        pipe, param_grid=svm_param, 
        scoring="accuracy",
        verbose=2,
        refit=True, 
        cv = 5, return_train_score=True)

    grid.fit(X, y)

Basically, I would like to perform a grid search using two scalers and some parameters of a classifier, that, in this case, is SVM.

However, this is the output I get. In the first iteration over the number of features, I read:

(150, 4)
n_features 4
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
...

that seems fine. However, at the second iteration of the for loop, I get:

n_features 20
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
...

During the first iteration, the learner is printed out as:

learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, ...

i.e., default constructor followed by the parameters.

During the second iteration, the constructor does not correspond to the default one:

learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, ...

In this case, the constructor uses C=100.0, but learner_C=0.01.

Is this normal, or am I doing something wrong?

答案1

得分: 1

使用refit=True时,GridSearchCV将在网格搜索完成后使用最佳参数拟合LinearSVC。这就是为什么在第一次网格搜索之后,LinearSVC()已经更改为C=100.0

这对于您的第二次网格搜索没有影响,无论如何搜索过程中C仍然会被更改。
但是,如果您不希望GridSearchCV更改LinearSVC,只需使用refit=False

英文:

With refit=True the GridSearchCV will fit LinearSVC with the best parameters after the grid search is complete. This is why after the first grid search, LinearSVC() has been changed with C=100.0.

This has no impact on your second grid seach, C will still be changed during the search anyway.
But if you do not want GridSearchCV to change the LinearSVC, just use refit=False.

huangapple
  • 本文由 发表于 2023年6月29日 22:55:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76582236.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定