英文:
Sklearn BaggingClassifier doesn't work with a pipeline(preprocessor, KNeighborsClassifier)
问题
使用sklearn,我有一个完美运行的流水线,基本上看起来和工作如下:
model_1_KNeighborsClassifier = make_pipeline(preprocessor, KNeighborsClassifier())
model_1_KNeighborsClassifier.fit(X_train, y_train)
但是如果我使用这个流水线进行装袋:
model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)
model_bagging.fit(X_train,y_train)
它就不再工作了:
File c:\Users\gui-r\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\__init__.py:423, in _get_column_indices(X, key)
422 try:
--> 423 all_columns = X.columns
424 except AttributeError:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[9], line 6
1 from sklearn.ensemble import BaggingClassifier
2 model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)
----> 6 model_bagging.fit(X_train,y_train)
7 #model_bagging.score(X_test,y_test)
File c:\Users\gui-r\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:1151, in _fit_context..decorator..wrapper(estimator, *args, **kwargs)
1144 estimator._validate_params()
1146 with config_context(
1147 skip_parameter_validation=(
1148 prefer_skip_nested_validation or global_skip_validation
1149 )
1150 ):
...
428 )
429 if isinstance(key, str):
430 columns = [key]
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
好像装袋无法通过流水线处理数据。
完整的代码如下:
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder,StandardScaler
import seaborn as sns
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
import seaborn as sns
from sklearn.compose import make_column_transformer
from sklearn.ensemble import BaggingClassifier
titanic = sns.load_dataset('titanic')
y = titanic['survived']
X = titanic.drop('survived', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
numerical_features = [ 'age', 'fare']
categorical_features = ['sex', 'deck', 'alone']
other_features=['pclass']
numerical_pipeline = make_pipeline(SimpleImputer(strategy='mean'), StandardScaler())
categorical_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
other_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'))
preprocessor = make_column_transformer((numerical_pipeline, numerical_features),
(categorical_pipeline, categorical_features),
(other_pipeline, other_features),)
processed_data=preprocessor.fit_transform(titanic)
model_1_KNeighborsClassifier = make_pipeline(preprocessor, KNeighborsClassifier(algorithm='ball_tree',metric='manhattan',n_neighbors=11))
model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)
""" here those 2 lines work :
model_1_KNeighborsClassifier.fit(X_train,y_train)
print(model_1_KNeighborsClassifier.score(X_test,y_test)) """
model_bagging.fit(X_train,y_train)
print(model_bagging.score(X_test,y_test))
有什么问题吗?
再次强调,流水线本身是可以工作的。
英文:
Using sklearn, I have a pipleline that works perfectly and basically looks and works like that :
model_1_KNeighborsClassifier = make_pipeline(preprocessor, KNeighborsClassifier())
model_1_KNeighborsClassifier.fit(X_train, y_train)
But if I do bagging using this pipeline :
model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)
model_bagging.fit(X_train,y_train)
It doesn't work anymore :
File c:\Users\gui-r\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\__init__.py:423, in _get_column_indices(X, key)
422 try:
--> 423 all_columns = X.columns
424 except AttributeError:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[9], line 6
1 from sklearn.ensemble import BaggingClassifier
2 model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)
----> 6 model_bagging.fit(X_train,y_train)
7 #model_bagging.score(X_test,y_test)
File c:\Users\gui-r\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:1151, in _fit_context..decorator..wrapper(estimator, *args, **kwargs)
1144 estimator._validate_params()
1146 with config_context(
1147 skip_parameter_validation=(
1148 prefer_skip_nested_validation or global_skip_validation
1149 )
1150 ):
...
428 )
429 if isinstance(key, str):
430 columns = [key]
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
As if bagging cannot take processed data through the pipeline.
The entire code is the following :
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder,StandardScaler
import seaborn as sns
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
import seaborn as sns
from sklearn.compose import make_column_transformer
from sklearn.ensemble import BaggingClassifier
titanic = sns.load_dataset('titanic')
y = titanic['survived']
X = titanic.drop('survived', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
numerical_features = [ 'age', 'fare']
categorical_features = ['sex', 'deck', 'alone']
other_features=['pclass']
numerical_pipeline = make_pipeline(SimpleImputer(strategy='mean'), StandardScaler())
categorical_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
other_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'))
preprocessor = make_column_transformer((numerical_pipeline, numerical_features),
(categorical_pipeline, categorical_features),
(other_pipeline, other_features),)
processed_data=preprocessor.fit_transform(titanic)
model_1_KNeighborsClassifier = make_pipeline(preprocessor, KNeighborsClassifier(algorithm='ball_tree',metric='manhattan',n_neighbors=11))
model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)
""" here those 2 lines work :
model_1_KNeighborsClassifier.fit(X_train,y_train)
print(model_1_KNeighborsClassifier.score(X_test,y_test)) """
model_bagging.fit(X_train,y_train)
print(model_bagging.score(X_test,y_test))
Any idea on what's wrong ?
Again, the pipeline itself works
答案1
得分: 0
错误告诉你出了什么问题:只支持使用字符串指定列名的 pandas DataFrame
。
我认为这是因为估计器类(如 BaggingClassifier
)是 BaseEstimator
的子类,它对输入进行验证。在此过程中,使用 sklearn.utils.check_array()
将 X
和 y
转换为 NumPy 数组。你可以尝试在自己的 DataFrame 上运行此函数,看看它生成的数组。
结果是,当你将 DataFrame 传递给以预处理器为第一步的流水线时,组件可以看到特征名称。但是当你将所有内容都包装在 bagging 分类器中时,名称会被其验证过程移除。
我认为使用位置索引代替字符串名称可能会起作用,但可能还有其他方法。例如,你可以直接将 KNN 放入 bagging 分类器中,然后将结果放入流水线中:
knn = KNeighborsClassifier(algorithm='ball_tree',
metric='manhattan',
n_neighbors=11))
classifier = BaggingClassifier(base_estimator=knn, n_estimators=10)
pipeline = make_pipeline(preprocessor, classifier)
英文:
The error tells you what is wrong: Specifying the columns using strings is only supported for pandas DataFrames
.
I believe this is because estimator classes (like BaggingClassifier
) are subclasses of BaseEstimator
, which performs validation on its inputs. Part of this process casts X
and y
to NumPy arrays using sklearn.utils.check_array()
. You can try running this function on your own DataFrame to see that it produces arrays.
The net result is that when you pass a DataFrame in to a pipeline with the preprocessor as the first step, the component can see your feature names. But when you wrap everything in the bagging classifier, the names are removed by its validation process.
I think using positional indices instead will work, but there are probably other ways. For example, you could give the KNN directly to the bagging classifier, then put the result in the pipeline:
knn = KNeighborsClassifier(algorithm='ball_tree',
metric='manhattan',
n_neighbors=11))
classifier = BaggingClassifier(base_estimator=knn, n_estimators=10)
pipeline = make_pipeline(preprocessor, classifier)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论