2023年6月29日 22:55:12go评论71阅读模式

英文:

grid search using a pipeline

问题

我不确定我是否正确使用了scikit-learn中的超参数搜索功能。

请考虑以下代码：

from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler, StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target

scalers = [
            StandardScaler(),
            # MinMaxScaler(feature_range=(0,1)), 
            MinMaxScaler(feature_range=(-1,1)), 
            # PowerTransformer(),
            # RobustScaler(unit_variance=True)
        ]

svm_param = {'scaler': scalers, 
        'learner': [LinearSVC()],
        # 'learner__dual': [True, False],  # with True svm selects random features
        'learner__C': [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0],  # uniform(loc=1e-5, scale=1e+5),  # [0.5, 1.0, 2],
        'learner__tol': [1e-4],  # svm_learner_tol,  # [1e-5, 1e-4, 1e-3],
        'learner__random_state': [22],
        'learner__max_iter': [1000]}

pipe = Pipeline([
        ("scaler", None),
        ("learner", None)
    ])

grid = GridSearchCV(
    pipe, param_grid=svm_param, 
    scoring="accuracy",
    verbose=2,
    refit=True, 
    cv=5, return_train_score=True)

n_features = [X.shape[1], 20]

for nf in n_features:
    X = iris.data[:, :nf]  # we only take the first two features.
    print("n_features", nf)

    grid = GridSearchCV(
        pipe, param_grid=svm_param, 
        scoring="accuracy",
        verbose=2,
        refit=True, 
        cv=5, return_train_score=True)

    grid.fit(X, y)

基本上，我想使用两种缩放器和一些分类器的参数执行网格搜索，而在这种情况下，分类器是SVM。

然而，这是我得到的输出。在特征数量的第一次迭代中，我阅读到：

(150, 4)
n_features 4
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
...

看起来很正常。然而，在for循环的第二次迭代中，我得到：

n_features 20
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
...

在第一次迭代期间，learner的打印输出如下：

learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, ...

即默认构造函数后跟参数。

在第二次迭代期间，构造函数与默认构造函数不符：

learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, ...

在这种情况下，构造函数使用了C=100.0，但learner_C=0.01。

这是正常的，还是我做错了什么？

英文:

I am not sure I am using correctly the hyperparameter search functions in scikit-learn.

Please consider this code:

from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler, StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target

scalers = [
            StandardScaler(),
            # MinMaxScaler(feature_range=(0,1)), 
            MinMaxScaler(feature_range=(-1,1)), 
            # PowerTransformer(),
            # RobustScaler(unit_variance=True)
        ]

svm_param = {&#39;scaler&#39;: scalers, 
        &#39;learner&#39;: [LinearSVC()],
        # &#39;learner__dual&#39;: [True, False],  # with True svm selects random features
        &#39;learner__C&#39;: [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0],  # uniform(loc=1e-5, scale=1e+5),  # [0.5, 1.0, 2],
        &#39;learner__tol&#39;: [1e-4],  # svm_learner_tol,  # [1e-5, 1e-4, 1e-3],
        &#39;learner__random_state&#39;: [22],
        &#39;learner__max_iter&#39;: [1000]}

pipe = Pipeline([
        (&quot;scaler&quot;, None),
        (&quot;learner&quot;, None)
    ])

grid = GridSearchCV(
    pipe, param_grid=svm_param, 
    scoring=&quot;accuracy&quot;,
    verbose=2,
    refit=True, 
    cv = 5, return_train_score=True)

n_features = [X.shape[1], 20]

for nf in n_features:
    X = iris.data[:, :nf]  # we only take the first two features.
    print(&quot;n_features&quot;, nf)

    grid = GridSearchCV(
        pipe, param_grid=svm_param, 
        scoring=&quot;accuracy&quot;,
        verbose=2,
        refit=True, 
        cv = 5, return_train_score=True)

    grid.fit(X, y)

Basically, I would like to perform a grid search using two scalers and some parameters of a classifier, that, in this case, is SVM.

However, this is the output I get. In the first iteration over the number of features, I read:

(150, 4)
n_features 4
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
...

that seems fine. However, at the second iteration of the for loop, I get:

n_features 20
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
[CV] END learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, learner__random_state=22, learner__tol=0.0001, scaler=StandardScaler(); total time=   0.0s
...

During the first iteration, the learner is printed out as:

learner=LinearSVC(), learner__C=0.01, learner__max_iter=1000, ...

i.e., default constructor followed by the parameters.

During the second iteration, the constructor does not correspond to the default one:

learner=LinearSVC(C=100.0, random_state=22), learner__C=0.01, learner__max_iter=1000, ...

In this case, the constructor uses C=100.0, but learner_C=0.01.

Is this normal, or am I doing something wrong?

答案1

得分: 1

使用refit=True时，GridSearchCV将在网格搜索完成后使用最佳参数拟合LinearSVC。这就是为什么在第一次网格搜索之后，LinearSVC()已经更改为C=100.0。

这对于您的第二次网格搜索没有影响，无论如何搜索过程中C仍然会被更改。
但是，如果您不希望GridSearchCV更改LinearSVC，只需使用refit=False。

英文:

With refit=True the GridSearchCV will fit LinearSVC with the best parameters after the grid search is complete. This is why after the first grid search, LinearSVC() has been changed with C=100.0.

This has no impact on your second grid seach, C will still be changed during the search anyway.
But if you do not want GridSearchCV to change the LinearSVC, just use refit=False.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

网格搜索使用管道

问题

答案1

如何使用Numpy的genfromtxt打开文件，但只指定目录路径的一部分？

Console program checks if number is prime. Why does `threading.Lock()` cause it to fail for products of non-tiny primes?

如何在Lambda启动任务中设置间隔。

继承/子类化自 `tuple`，具有正确的索引和切片功能。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论