2023年5月11日 01:04:25go评论90阅读模式

英文:

Feature importance scores with GridSearchCV

问题

I am trying to get the feature importance scores of my variables. Besides the actual values I am trying to link them to the column names and create a dataframe.

This is how I am using the GridSearchCV function:

grid_search = GridSearchCV(model, parameters, cv=10) #tuning
grid_search.fit(X_train2, y_train2.values.ravel())

Now, I run this function that outputs an array of importance scores:

grid_search.best_estimator_.named_steps["regressor"].feature_importances_

However, this array contains more scores than columns in my dataset, so I don't know how to match them up. Is there any way to directly output the importance scores with the assigned column names from the GridSearchCV function?

Also, this is how my model's pipeline looks like for reference:

# Define the pipeline
numeric_transformer = Pipeline(steps=[
    #('imputer', KNNImputer()), # for missing values
    ('scaler', StandardScaler()), # standardizing
    #('scaler', MinMaxScaler()), # normalizing
    #('to_df', FunctionTransformer(lambda x: pd.DataFrame(x, columns=X_train2.select_dtypes(include=['int64', 'float64']).columns)))
])
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, X_train2.select_dtypes(include=['int64', 'float64']).columns.tolist()),
        ('cat', categorical_transformer, categorical_cols),
        # ("pca", PCA(random_state=548, n_components=25), indices_pca),
        # ('smotenc', SmoteNCWrapper(categorical_features=[1, 2], random_state=548)),
    ])
model = Pipeline(steps=[
    # ('over', SMOTE(random_state=548)),
    # ('smotenc', SmoteNCWrapper(categorical_features=[1, 2], random_state=548)),
    ('preprocessor', preprocessor),
    ('regressor', XGBClassifier(random_state=548))
])
parameters = { 'regressor__n_estimators': [1000],
    'regressor__max_depth': [5],
    'regressor__learning_rate': [0.01],
    # 'regressor__num_leaves': [31],
    # 'regressor__min_child_samples': [20],
    'regressor__reg_alpha': [0.1], 
    'regressor__reg_lambda': [0.1]
}

英文:

I am trying to get the feature importance scores of my variables. Besides the actual values I am trying to link them to the column names and create a dataframe.

This is how I am using the GridSearchCV function:

grid_search = GridSearchCV(model, parameters, cv=10) #tuning
grid_search.fit(X_train2, y_train2.values.ravel())

Now, I run this function that outputs an array of importance scores:

grid_search.best_estimator_.named_steps["regressor"].feature_importances_

Also, this is how my model's pipeline looks like for reference:

# Define the pipeline
numeric_transformer = Pipeline(steps=[
    #(&#39;imputer&#39;, KNNImputer()), # for missing values
    (&#39;scaler&#39;, StandardScaler()), # standardizing
    #(&#39;scaler&#39;, MinMaxScaler()), # normalizing
    #(&#39;to_df&#39;, FunctionTransformer(lambda x: pd.DataFrame(x,                 columns=X_train2.select_dtypes(include=[&#39;int64&#39;, &#39;float64&#39;]).columns)))
])
categorical_transformer = Pipeline(steps=[
    (&#39;onehot&#39;, OneHotEncoder(handle_unknown=&#39;ignore&#39;))
])
preprocessor = ColumnTransformer(
    transformers=[
        (&#39;num&#39;, numeric_transformer, X_train2.select_dtypes(include=[&#39;int64&#39;,         float64&#39;]).columns.tolist()),
        (&#39;cat&#39;, categorical_transformer, categorical_cols),
        # (&quot;pca&quot;, PCA(random_state=548, n_components=25), indices_pca),
        # (&#39;smotenc&#39;, SmoteNCWrapper(categorical_features=[1, 2], random_state=548)),
    ])
model = Pipeline(steps=[
    # (&#39;over&#39;, SMOTE(random_state=548)),
    # (&#39;smotenc&#39;, SmoteNCWrapper(categorical_features=[1, 2], random_state=548)),
    (&#39;preprocessor&#39;, preprocessor),
    (&#39;regressor&#39;, XGBClassifier(random_state=548))
])
parameters = {  &#39;regressor__n_estimators&#39;: [1000],
    &#39;regressor__max_depth&#39;: [5],
    &#39;regressor__learning_rate&#39;: [0.01],
    # &#39;regressor__num_leaves&#39;: [31],
    # &#39;regressor__min_child_samples&#39;: [20],
    &#39;regressor__reg_alpha&#39;: [0.1], 
    &#39;regressor__reg_lambda&#39;: [0.1]
}

答案1

得分: 1

你之所以获得"更多"特征是因为你使用了一种独热编码，它会创建虚拟特征，你可以使用".get_feature_names_out"来获取这些特征的名称。

# 获取独热编码后的特征名称
feature_names = preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_cols)
numeric_cols = X_train2.select_dtypes(include=['int64', 'float64']).columns.tolist()
all_feature_names = numeric_cols + feature_names

英文:

You are getting "more" features because you are using a onehot encoding that creates dummy features, you can get the names using ".get_feature_names_out"

# Get feature names after one-hot encoding
feature_names = preprocessor.named_transformers_[&#39;cat&#39;].named_steps[&#39;onehot&#39;].get_feature_names_out(categorical_cols)
numeric_cols = X_train2.select_dtypes(include=[&#39;int64&#39;, &#39;float64&#39;]).columns.tolist()
all_feature_names = numeric_cols + feature_names

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Feature importance scores with GridSearchCV

问题

答案1

如何在一个tkinter框架中显示多个图像副本和源图像。

如何将多个 CloudWatch 日志导出到 S3 存储桶中的一个文本文件。

如何在Django中使用具有ManyToManyFields的模型对象填充测试数据库？

转换 Pandas 系列中的日期。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。