2023年8月8日 22:39:52go评论109阅读模式

英文:

Sklearn BaggingClassifier doesn't work with a pipeline(preprocessor, KNeighborsClassifier)

问题

使用sklearn，我有一个完美运行的流水线，基本上看起来和工作如下：

model_1_KNeighborsClassifier = make_pipeline(preprocessor, KNeighborsClassifier())
model_1_KNeighborsClassifier.fit(X_train, y_train)

但是如果我使用这个流水线进行装袋：

model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)
model_bagging.fit(X_train,y_train)

它就不再工作了：

File c:\Users\gui-r\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\__init__.py:423, in _get_column_indices(X, key)
    422 try:
--&gt; 423     all_columns = X.columns
    424 except AttributeError:
AttributeError: &#39;numpy.ndarray&#39; object has no attribute &#39;columns&#39;
During handling of the above exception, another exception occurred:
ValueError                                Traceback (most recent call last)
Cell In[9], line 6
      1 from sklearn.ensemble import BaggingClassifier
      2 model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)
----&gt; 6 model_bagging.fit(X_train,y_train)
      7 #model_bagging.score(X_test,y_test)
File c:\Users\gui-r\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:1151, in _fit_context..decorator..wrapper(estimator, *args, **kwargs)
   1144     estimator._validate_params()
   1146 with config_context(
   1147     skip_parameter_validation=(
   1148         prefer_skip_nested_validation or global_skip_validation
   1149     )
   1150 ):
...
    428     )
    429 if isinstance(key, str):
    430     columns = [key]
ValueError: Specifying the columns using strings is only supported for pandas DataFrames

好像装袋无法通过流水线处理数据。

完整的代码如下：

import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder,StandardScaler
import seaborn as sns
from sklearn.pipeline import make_pipeline
from sklearn.compose import  make_column_transformer
from sklearn.impute import SimpleImputer
import seaborn as sns
from sklearn.compose import make_column_transformer
from sklearn.ensemble import BaggingClassifier
titanic = sns.load_dataset('titanic')
y = titanic['survived']
X = titanic.drop('survived', axis=1) 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
numerical_features = [ 'age', 'fare'] 
categorical_features = ['sex', 'deck', 'alone'] 
other_features=['pclass']
numerical_pipeline = make_pipeline(SimpleImputer(strategy='mean'), StandardScaler())
categorical_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
other_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent')) 
preprocessor = make_column_transformer((numerical_pipeline, numerical_features),
                                   (categorical_pipeline, categorical_features),
                                   (other_pipeline, other_features),)  
processed_data=preprocessor.fit_transform(titanic)
model_1_KNeighborsClassifier = make_pipeline(preprocessor, KNeighborsClassifier(algorithm='ball_tree',metric='manhattan',n_neighbors=11))
model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)
""" here those 2 lines work :
model_1_KNeighborsClassifier.fit(X_train,y_train)
print(model_1_KNeighborsClassifier.score(X_test,y_test)) """
model_bagging.fit(X_train,y_train)
print(model_bagging.score(X_test,y_test))

有什么问题吗？

再次强调，流水线本身是可以工作的。

英文:

Using sklearn, I have a pipleline that works perfectly and basically looks and works like that :

model_1_KNeighborsClassifier = make_pipeline(preprocessor, KNeighborsClassifier())
model_1_KNeighborsClassifier.fit(X_train, y_train)

But if I do bagging using this pipeline :

model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)
model_bagging.fit(X_train,y_train)

It doesn't work anymore :

File c:\Users\gui-r\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\__init__.py:423, in _get_column_indices(X, key)
    422 try:
--&gt; 423     all_columns = X.columns
    424 except AttributeError:
AttributeError: &#39;numpy.ndarray&#39; object has no attribute &#39;columns&#39;
During handling of the above exception, another exception occurred:
ValueError                                Traceback (most recent call last)
Cell In[9], line 6
      1 from sklearn.ensemble import BaggingClassifier
      2 model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)
----&gt; 6 model_bagging.fit(X_train,y_train)
      7 #model_bagging.score(X_test,y_test)
File c:\Users\gui-r\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:1151, in _fit_context..decorator..wrapper(estimator, *args, **kwargs)
   1144     estimator._validate_params()
   1146 with config_context(
   1147     skip_parameter_validation=(
   1148         prefer_skip_nested_validation or global_skip_validation
   1149     )
   1150 ):
...
    428     )
    429 if isinstance(key, str):
    430     columns = [key]
ValueError: Specifying the columns using strings is only supported for pandas DataFrames

As if bagging cannot take processed data through the pipeline.

The entire code is the following :

import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder,StandardScaler
import seaborn as sns
from sklearn.pipeline import make_pipeline
from sklearn.compose import  make_column_transformer
from sklearn.impute import SimpleImputer
import seaborn as sns
from sklearn.compose import make_column_transformer
from sklearn.ensemble import BaggingClassifier
titanic = sns.load_dataset(&#39;titanic&#39;)
y = titanic[&#39;survived&#39;]
X = titanic.drop(&#39;survived&#39;, axis=1) 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
numerical_features = [ &#39;age&#39;, &#39;fare&#39;] 
categorical_features = [&#39;sex&#39;, &#39;deck&#39;, &#39;alone&#39;] 
other_features=[&#39;pclass&#39;]
numerical_pipeline = make_pipeline(SimpleImputer(strategy=&#39;mean&#39;), StandardScaler())
categorical_pipeline = make_pipeline(SimpleImputer(strategy=&#39;most_frequent&#39;), OneHotEncoder())
other_pipeline = make_pipeline(SimpleImputer(strategy=&#39;most_frequent&#39;)) 
preprocessor = make_column_transformer((numerical_pipeline, numerical_features),
                                   (categorical_pipeline, categorical_features),
                                   (other_pipeline, other_features),)  
processed_data=preprocessor.fit_transform(titanic)
model_1_KNeighborsClassifier = make_pipeline(preprocessor, KNeighborsClassifier(algorithm=&#39;ball_tree&#39;,metric=&#39;manhattan&#39;,n_neighbors=11))
model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)
&quot;&quot;&quot; here those 2 lines work :
model_1_KNeighborsClassifier.fit(X_train,y_train)
print(model_1_KNeighborsClassifier.score(X_test,y_test)) &quot;&quot;&quot;
model_bagging.fit(X_train,y_train)
print(model_bagging.score(X_test,y_test))

Any idea on what's wrong ?

Again, the pipeline itself works

答案1

得分: 0

错误告诉你出了什么问题：只支持使用字符串指定列名的 pandas DataFrame。

我认为这是因为估计器类（如 BaggingClassifier）是 BaseEstimator 的子类，它对输入进行验证。在此过程中，使用 sklearn.utils.check_array() 将 X 和 y 转换为 NumPy 数组。你可以尝试在自己的 DataFrame 上运行此函数，看看它生成的数组。

结果是，当你将 DataFrame 传递给以预处理器为第一步的流水线时，组件可以看到特征名称。但是当你将所有内容都包装在 bagging 分类器中时，名称会被其验证过程移除。

我认为使用位置索引代替字符串名称可能会起作用，但可能还有其他方法。例如，你可以直接将 KNN 放入 bagging 分类器中，然后将结果放入流水线中：

knn = KNeighborsClassifier(algorithm='ball_tree',
                           metric='manhattan',
                           n_neighbors=11))
classifier = BaggingClassifier(base_estimator=knn, n_estimators=10)
pipeline = make_pipeline(preprocessor, classifier)

英文:

The error tells you what is wrong: Specifying the columns using strings is only supported for pandas DataFrames.

I believe this is because estimator classes (like BaggingClassifier) are subclasses of BaseEstimator, which performs validation on its inputs. Part of this process casts X and y to NumPy arrays using sklearn.utils.check_array(). You can try running this function on your own DataFrame to see that it produces arrays.

The net result is that when you pass a DataFrame in to a pipeline with the preprocessor as the first step, the component can see your feature names. But when you wrap everything in the bagging classifier, the names are removed by its validation process.

I think using positional indices instead will work, but there are probably other ways. For example, you could give the KNN directly to the bagging classifier, then put the result in the pipeline:

knn = KNeighborsClassifier(algorithm=&#39;ball_tree&#39;,
                           metric=&#39;manhattan&#39;,
                           n_neighbors=11))
classifier = BaggingClassifier(base_estimator=knn, n_estimators=10)
pipeline = make_pipeline(preprocessor, classifier)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Sklearn的BaggingClassifier无法与管道（预处理器，KNeighborsClassifier）一起使用。

问题

答案1

‘KerasTensor’ 对象不可调用

如何在SQL中计算classification_report（精确度、召回率、F1分数和支持）？

改进深度学习模型以检测不同条件下的火车车厢间隙。

寻找使黑盒模型返回最大输出的最佳输入组合。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。