Sklearn SequentialFeatureSelector:“Pipeline 应该是一个分类器”,当使用分类器时

huangapple go评论157阅读模式
英文:

Sklearn SequentialFeatureSelector "Pipeline should either be a classifier" when using a classifier

问题

当在使用sklearn管道的分类器和SFS时,我遇到了以下错误:

Traceback (most recent call last):
  File "main.py", line 45, in <module>
    rs.fit(X_train, y_train)
  File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py", line 898, in fit
    self._run_search(evaluate_candidates)
  File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py", line 1419, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py", line 845, in evaluate_candidates
    out = parallel(
  File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 65, in __call__
    return super().__call__(iterable_with_config)
  File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/joblib/parallel.py", line 1855, in __call__
    return output if self.return_generator else list(output)
  File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/joblib/parallel.py", line 1784, in _get_sequential_output
    res = func(*args, **kwargs)
  File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 127, in __call__
    return self.function(*args, **kwargs)
  File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 754, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, error_score)
  File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 813, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 266, in __call__
    return this._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
  File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 459, in _score
    y_pred = method_caller(clf, "decision_function", X, pos_label=pos_label)
  File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 86, in _cached_call
    result, _ = _get_response_values(
  File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/utils/_response.py", line 103, in _get_response_values
    raise ValueError(
ValueError: Pipeline should either be a classifier to be used with response_method=decision_function or the response_method should be 'predict'. Got a regressor with response_method=decision_function instead.

要重现此问题的代码位于此处

包版本:

  • Python = 3.10.8
  • scikit-learn = 1.3.0
英文:

I get this error when using a classifier and SFS as a part of sklearn pipeline:

Traceback (most recent call last):
  File &quot;main.py&quot;, line 45, in &lt;module&gt;
    rs.fit(X_train, y_train)
  File &quot;/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/base.py&quot;, line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File &quot;/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py&quot;, line 898, in fit
    self._run_search(evaluate_candidates)
  File &quot;/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py&quot;, line 1419, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File &quot;/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py&quot;, line 845, in evaluate_candidates
    out = parallel(
  File &quot;/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/utils/parallel.py&quot;, line 65, in __call__
    return super().__call__(iterable_with_config)
  File &quot;/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/joblib/parallel.py&quot;, line 1855, in __call__
    return output if self.return_generator else list(output)
  File &quot;/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/joblib/parallel.py&quot;, line 1784, in _get_sequential_output
    res = func(*args, **kwargs)
  File &quot;/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/utils/parallel.py&quot;, line 127, in __call__
    return self.function(*args, **kwargs)
  File &quot;/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py&quot;, line 754, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, error_score)
  File &quot;/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py&quot;, line 813, in _score
    scores = scorer(estimator, X_test, y_test)
  File &quot;/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py&quot;, line 266, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
  File &quot;/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py&quot;, line 459, in _score
    y_pred = method_caller(clf, &quot;decision_function&quot;, X, pos_label=pos_label)
  File &quot;/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py&quot;, line 86, in _cached_call
    result, _ = _get_response_values(
  File &quot;/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/utils/_response.py&quot;, line 103, in _get_response_values
    raise ValueError(
ValueError: Pipeline should either be a classifier to be used with response_method=decision_function or the response_method should be &#39;predict&#39;. Got a regressor with response_method=decision_function instead.

Code to reproduce (replit):

clf = LogisticRegression()
cv = StratifiedKFold(n_splits=2)
sfs = SFS(clf, n_features_to_select=1, scoring=&#39;accuracy&#39;, cv=cv, n_jobs=-1)
imputer = SimpleImputer(missing_values=np.nan, strategy=&#39;median&#39;)
lr_param_grid = {
  &#39;sequentialfeatureselector__estimator__class_weight&#39;: [&#39;balanced&#39;, None]
}
pipe = make_pipeline(imputer, sfs)
rs = GridSearchCV(estimator=pipe,
                  param_grid=lr_param_grid,
                  cv=cv,
                  scoring=&quot;roc_auc&quot;,
                  error_score=&quot;raise&quot;)

# Generate random data for binary classification
X, y = make_classification(
  n_samples=10,  # Number of samples
  n_features=3,  # Number of features
  n_informative=2,  # Number of informative features
  n_redundant=1,  # Number of redundant features
  n_clusters_per_class=1,  # Number of clusters per class
  random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rs.fit(X_train, y_train)

I get the same error when using other classifiers, other performance metrics, and when using mlxtend version of SFS.

Versions of packages:

  • python = 3.10.8
  • scikit-learn = 1.3.0

答案1

得分: 1

你遇到的问题源于Sequential Feature SelectorGridSearchCV和所使用的评分方法之间的交互。

GridSearchCV在内部使用交叉验证来验证模型。对于一些评分方法,如'roc_auc',它需要模型提供的类别概率。这些概率通常通过分类器的predict_proba()decision_function()方法获得。

然而,SFS没有暴露所封装的分类器中的这些方法。因此,当GridSearchCV尝试应用评分函数'roc_auc'时,它会遇到错误,因为它无法访问所需的概率估计。

类似地,如果你将评分函数更改为'accuracy'或其他依赖于predict()方法的函数,你可能会遇到另一个问题,因为SFS没有暴露封装分类器的predict()方法。

这就是你看到的错误消息的根本原因 - 由于分类器在SFS中的封装,无法访问所需的方法。

至于mlxtend,看起来你可能遇到了相同的问题。如果mlxtendSequential Feature Selector也没有暴露predict_proba()decision_function()predict()方法,你将面临类似的问题。

英文:

The issue you're encountering stems from an interaction between the Sequential Feature Selector, GridSearchCV, and the scoring method being used.

GridSearchCV validates your model using cross-validation internally. For some scoring methods, such as &#39;roc_auc&#39;, it requires class probabilities provided by the model. These probabilities are typically obtained via the predict_proba() or decision_function() methods from the classifier.

However, the SFS does not expose these methods from the classifier it encapsulates. As a result, when GridSearchCV attempts to apply the scoring function &#39;roc_auc&#39;, it encounters an error because it cannot access the required probability estimates.

Similarly, if you change the scoring function to &#39;accuracy&#39; or others that rely on the predict() method, you may face another issue since SFS does not expose the predict() method from the encapsulated classifier.

This is the root cause of the error message you're seeing - a lack of access to required methods due to the encapsulation of the classifier within the SFS.

As for mlxtend, it seems probable that you're encountering the same issue. If mlxtend's Sequential Feature Selector also does not expose the predict_proba(), decision_function(), or predict() methods from the encapsulated classifier, you would face a similar problem.

huangapple
  • 本文由 发表于 2023年7月3日 21:53:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76605421.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定