英文:
Sklearn SequentialFeatureSelector "Pipeline should either be a classifier" when using a classifier
问题
当在使用sklearn管道的分类器和SFS时,我遇到了以下错误:
Traceback (most recent call last):
File "main.py", line 45, in <module>
rs.fit(X_train, y_train)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/base.py", line 1151, in wrapper
return fit_method(estimator, *args, **kwargs)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py", line 898, in fit
self._run_search(evaluate_candidates)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py", line 1419, in _run_search
evaluate_candidates(ParameterGrid(self.param_grid))
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py", line 845, in evaluate_candidates
out = parallel(
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 65, in __call__
return super().__call__(iterable_with_config)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/joblib/parallel.py", line 1855, in __call__
return output if self.return_generator else list(output)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/joblib/parallel.py", line 1784, in _get_sequential_output
res = func(*args, **kwargs)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 127, in __call__
return self.function(*args, **kwargs)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 754, in _fit_and_score
test_scores = _score(estimator, X_test, y_test, scorer, error_score)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 813, in _score
scores = scorer(estimator, X_test, y_test)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 266, in __call__
return this._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 459, in _score
y_pred = method_caller(clf, "decision_function", X, pos_label=pos_label)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 86, in _cached_call
result, _ = _get_response_values(
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/utils/_response.py", line 103, in _get_response_values
raise ValueError(
ValueError: Pipeline should either be a classifier to be used with response_method=decision_function or the response_method should be 'predict'. Got a regressor with response_method=decision_function instead.
要重现此问题的代码位于此处。
包版本:
- Python = 3.10.8
- scikit-learn = 1.3.0
英文:
I get this error when using a classifier and SFS as a part of sklearn pipeline:
Traceback (most recent call last):
File "main.py", line 45, in <module>
rs.fit(X_train, y_train)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/base.py", line 1151, in wrapper
return fit_method(estimator, *args, **kwargs)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py", line 898, in fit
self._run_search(evaluate_candidates)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py", line 1419, in _run_search
evaluate_candidates(ParameterGrid(self.param_grid))
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py", line 845, in evaluate_candidates
out = parallel(
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 65, in __call__
return super().__call__(iterable_with_config)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/joblib/parallel.py", line 1855, in __call__
return output if self.return_generator else list(output)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/joblib/parallel.py", line 1784, in _get_sequential_output
res = func(*args, **kwargs)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 127, in __call__
return self.function(*args, **kwargs)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 754, in _fit_and_score
test_scores = _score(estimator, X_test, y_test, scorer, error_score)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 813, in _score
scores = scorer(estimator, X_test, y_test)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 266, in __call__
return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 459, in _score
y_pred = method_caller(clf, "decision_function", X, pos_label=pos_label)
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 86, in _cached_call
result, _ = _get_response_values(
File "/home/runner/SFSpredictproba/venv/lib/python3.10/site-packages/sklearn/utils/_response.py", line 103, in _get_response_values
raise ValueError(
ValueError: Pipeline should either be a classifier to be used with response_method=decision_function or the response_method should be 'predict'. Got a regressor with response_method=decision_function instead.
Code to reproduce (replit):
clf = LogisticRegression()
cv = StratifiedKFold(n_splits=2)
sfs = SFS(clf, n_features_to_select=1, scoring='accuracy', cv=cv, n_jobs=-1)
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
lr_param_grid = {
'sequentialfeatureselector__estimator__class_weight': ['balanced', None]
}
pipe = make_pipeline(imputer, sfs)
rs = GridSearchCV(estimator=pipe,
param_grid=lr_param_grid,
cv=cv,
scoring="roc_auc",
error_score="raise")
# Generate random data for binary classification
X, y = make_classification(
n_samples=10, # Number of samples
n_features=3, # Number of features
n_informative=2, # Number of informative features
n_redundant=1, # Number of redundant features
n_clusters_per_class=1, # Number of clusters per class
random_state=42)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rs.fit(X_train, y_train)
I get the same error when using other classifiers, other performance metrics, and when using mlxtend
version of SFS.
Versions of packages:
- python = 3.10.8
- scikit-learn = 1.3.0
答案1
得分: 1
你遇到的问题源于Sequential Feature Selector
、GridSearchCV
和所使用的评分方法之间的交互。
GridSearchCV
在内部使用交叉验证来验证模型。对于一些评分方法,如'roc_auc'
,它需要模型提供的类别概率。这些概率通常通过分类器的predict_proba()
或decision_function()
方法获得。
然而,SFS没有暴露所封装的分类器中的这些方法。因此,当GridSearchCV
尝试应用评分函数'roc_auc'
时,它会遇到错误,因为它无法访问所需的概率估计。
类似地,如果你将评分函数更改为'accuracy'
或其他依赖于predict()
方法的函数,你可能会遇到另一个问题,因为SFS没有暴露封装分类器的predict()
方法。
这就是你看到的错误消息的根本原因 - 由于分类器在SFS中的封装,无法访问所需的方法。
至于mlxtend
,看起来你可能遇到了相同的问题。如果mlxtend
的Sequential Feature Selector
也没有暴露predict_proba()
、decision_function()
或predict()
方法,你将面临类似的问题。
英文:
The issue you're encountering stems from an interaction between the Sequential Feature Selector
, GridSearchCV
, and the scoring method being used.
GridSearchCV
validates your model using cross-validation internally. For some scoring methods, such as 'roc_auc'
, it requires class probabilities provided by the model. These probabilities are typically obtained via the predict_proba()
or decision_function()
methods from the classifier.
However, the SFS does not expose these methods from the classifier it encapsulates. As a result, when GridSearchCV
attempts to apply the scoring function 'roc_auc'
, it encounters an error because it cannot access the required probability estimates.
Similarly, if you change the scoring function to 'accuracy'
or others that rely on the predict()
method, you may face another issue since SFS does not expose the predict()
method from the encapsulated classifier.
This is the root cause of the error message you're seeing - a lack of access to required methods due to the encapsulation of the classifier within the SFS.
As for mlxtend
, it seems probable that you're encountering the same issue. If mlxtend
's Sequential Feature Selector
also does not expose the predict_proba()
, decision_function()
, or predict()
methods from the encapsulated classifier, you would face a similar problem.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论