Sklearn的BaggingClassifier无法与管道(预处理器,KNeighborsClassifier)一起使用。

huangapple go评论72阅读模式
英文:

Sklearn BaggingClassifier doesn't work with a pipeline(preprocessor, KNeighborsClassifier)

问题

使用sklearn,我有一个完美运行的流水线,基本上看起来和工作如下:

model_1_KNeighborsClassifier = make_pipeline(preprocessor, KNeighborsClassifier())

model_1_KNeighborsClassifier.fit(X_train, y_train)

但是如果我使用这个流水线进行装袋:

model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)

model_bagging.fit(X_train,y_train)

它就不再工作了:

File c:\Users\gui-r\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\__init__.py:423, in _get_column_indices(X, key)
    422 try:
--> 423     all_columns = X.columns
    424 except AttributeError:

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[9], line 6
      1 from sklearn.ensemble import BaggingClassifier
      2 model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)
----> 6 model_bagging.fit(X_train,y_train)
      7 #model_bagging.score(X_test,y_test)

File c:\Users\gui-r\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:1151, in _fit_context..decorator..wrapper(estimator, *args, **kwargs)
   1144     estimator._validate_params()
   1146 with config_context(
   1147     skip_parameter_validation=(
   1148         prefer_skip_nested_validation or global_skip_validation
   1149     )
   1150 ):
...
    428     )
    429 if isinstance(key, str):
    430     columns = [key]

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

好像装袋无法通过流水线处理数据。

完整的代码如下:

import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder,StandardScaler
import seaborn as sns
from sklearn.pipeline import make_pipeline
from sklearn.compose import  make_column_transformer
from sklearn.impute import SimpleImputer
import seaborn as sns
from sklearn.compose import make_column_transformer
from sklearn.ensemble import BaggingClassifier


titanic = sns.load_dataset('titanic')

y = titanic['survived']
X = titanic.drop('survived', axis=1) 

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

numerical_features = [ 'age', 'fare'] 
categorical_features = ['sex', 'deck', 'alone'] 
other_features=['pclass']


numerical_pipeline = make_pipeline(SimpleImputer(strategy='mean'), StandardScaler())
categorical_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
other_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent')) 


preprocessor = make_column_transformer((numerical_pipeline, numerical_features),
                                   (categorical_pipeline, categorical_features),
                                   (other_pipeline, other_features),)  

processed_data=preprocessor.fit_transform(titanic)


model_1_KNeighborsClassifier = make_pipeline(preprocessor, KNeighborsClassifier(algorithm='ball_tree',metric='manhattan',n_neighbors=11))


model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)


""" here those 2 lines work :
model_1_KNeighborsClassifier.fit(X_train,y_train)
print(model_1_KNeighborsClassifier.score(X_test,y_test)) """

model_bagging.fit(X_train,y_train)
print(model_bagging.score(X_test,y_test)) 

有什么问题吗?

再次强调,流水线本身是可以工作的。

英文:

Using sklearn, I have a pipleline that works perfectly and basically looks and works like that :

model_1_KNeighborsClassifier = make_pipeline(preprocessor, KNeighborsClassifier())

model_1_KNeighborsClassifier.fit(X_train, y_train)

But if I do bagging using this pipeline :

model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)

model_bagging.fit(X_train,y_train)

It doesn't work anymore :

File c:\Users\gui-r\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\__init__.py:423, in _get_column_indices(X, key)
    422 try:
--> 423     all_columns = X.columns
    424 except AttributeError:

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[9], line 6
      1 from sklearn.ensemble import BaggingClassifier
      2 model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)
----> 6 model_bagging.fit(X_train,y_train)
      7 #model_bagging.score(X_test,y_test)

File c:\Users\gui-r\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:1151, in _fit_context..decorator..wrapper(estimator, *args, **kwargs)
   1144     estimator._validate_params()
   1146 with config_context(
   1147     skip_parameter_validation=(
   1148         prefer_skip_nested_validation or global_skip_validation
   1149     )
   1150 ):
...
    428     )
    429 if isinstance(key, str):
    430     columns = [key]

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

As if bagging cannot take processed data through the pipeline.

The entire code is the following :

import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder,StandardScaler
import seaborn as sns
from sklearn.pipeline import make_pipeline
from sklearn.compose import  make_column_transformer
from sklearn.impute import SimpleImputer
import seaborn as sns
from sklearn.compose import make_column_transformer
from sklearn.ensemble import BaggingClassifier


titanic = sns.load_dataset('titanic')

y = titanic['survived']
X = titanic.drop('survived', axis=1) 

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

numerical_features = [ 'age', 'fare'] 
categorical_features = ['sex', 'deck', 'alone'] 
other_features=['pclass']


numerical_pipeline = make_pipeline(SimpleImputer(strategy='mean'), StandardScaler())
categorical_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
other_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent')) 


preprocessor = make_column_transformer((numerical_pipeline, numerical_features),
                                   (categorical_pipeline, categorical_features),
                                   (other_pipeline, other_features),)  

processed_data=preprocessor.fit_transform(titanic)


model_1_KNeighborsClassifier = make_pipeline(preprocessor, KNeighborsClassifier(algorithm='ball_tree',metric='manhattan',n_neighbors=11))


model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)


""" here those 2 lines work :
model_1_KNeighborsClassifier.fit(X_train,y_train)
print(model_1_KNeighborsClassifier.score(X_test,y_test)) """

model_bagging.fit(X_train,y_train)
print(model_bagging.score(X_test,y_test)) 

Any idea on what's wrong ?

Again, the pipeline itself works

答案1

得分: 0

错误告诉你出了什么问题:只支持使用字符串指定列名的 pandas DataFrame

我认为这是因为估计器类(如 BaggingClassifier)是 BaseEstimator 的子类,它对输入进行验证。在此过程中,使用 sklearn.utils.check_array()Xy 转换为 NumPy 数组。你可以尝试在自己的 DataFrame 上运行此函数,看看它生成的数组。

结果是,当你将 DataFrame 传递给以预处理器为第一步的流水线时,组件可以看到特征名称。但是当你将所有内容都包装在 bagging 分类器中时,名称会被其验证过程移除。

我认为使用位置索引代替字符串名称可能会起作用,但可能还有其他方法。例如,你可以直接将 KNN 放入 bagging 分类器中,然后将结果放入流水线中:

knn = KNeighborsClassifier(algorithm='ball_tree',
                           metric='manhattan',
                           n_neighbors=11))
classifier = BaggingClassifier(base_estimator=knn, n_estimators=10)
pipeline = make_pipeline(preprocessor, classifier)
英文:

The error tells you what is wrong: Specifying the columns using strings is only supported for pandas DataFrames.

I believe this is because estimator classes (like BaggingClassifier) are subclasses of BaseEstimator, which performs validation on its inputs. Part of this process casts X and y to NumPy arrays using sklearn.utils.check_array(). You can try running this function on your own DataFrame to see that it produces arrays.

The net result is that when you pass a DataFrame in to a pipeline with the preprocessor as the first step, the component can see your feature names. But when you wrap everything in the bagging classifier, the names are removed by its validation process.

I think using positional indices instead will work, but there are probably other ways. For example, you could give the KNN directly to the bagging classifier, then put the result in the pipeline:

knn = KNeighborsClassifier(algorithm='ball_tree',
                           metric='manhattan',
                           n_neighbors=11))
classifier = BaggingClassifier(base_estimator=knn, n_estimators=10)
pipeline = make_pipeline(preprocessor, classifier)

huangapple
  • 本文由 发表于 2023年8月8日 22:39:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76860621.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定