Feature importance scores with GridSearchCV

huangapple go评论65阅读模式
英文:

Feature importance scores with GridSearchCV

问题

I am trying to get the feature importance scores of my variables. Besides the actual values I am trying to link them to the column names and create a dataframe.

This is how I am using the GridSearchCV function:

grid_search = GridSearchCV(model, parameters, cv=10) #tuning
grid_search.fit(X_train2, y_train2.values.ravel())

Now, I run this function that outputs an array of importance scores:

grid_search.best_estimator_.named_steps["regressor"].feature_importances_

However, this array contains more scores than columns in my dataset, so I don't know how to match them up. Is there any way to directly output the importance scores with the assigned column names from the GridSearchCV function?

Also, this is how my model's pipeline looks like for reference:

# Define the pipeline
numeric_transformer = Pipeline(steps=[
    #('imputer', KNNImputer()), # for missing values
    ('scaler', StandardScaler()), # standardizing
    #('scaler', MinMaxScaler()), # normalizing
    #('to_df', FunctionTransformer(lambda x: pd.DataFrame(x, columns=X_train2.select_dtypes(include=['int64', 'float64']).columns)))
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, X_train2.select_dtypes(include=['int64', 'float64']).columns.tolist()),
        ('cat', categorical_transformer, categorical_cols),
        # ("pca", PCA(random_state=548, n_components=25), indices_pca),
        # ('smotenc', SmoteNCWrapper(categorical_features=[1, 2], random_state=548)),
    ])

model = Pipeline(steps=[
    # ('over', SMOTE(random_state=548)),
    # ('smotenc', SmoteNCWrapper(categorical_features=[1, 2], random_state=548)),
    ('preprocessor', preprocessor),
    ('regressor', XGBClassifier(random_state=548))
])

parameters = { 'regressor__n_estimators': [1000],
    'regressor__max_depth': [5],
    'regressor__learning_rate': [0.01],
    # 'regressor__num_leaves': [31],
    # 'regressor__min_child_samples': [20],
    'regressor__reg_alpha': [0.1], 
    'regressor__reg_lambda': [0.1]
}
英文:

I am trying to get the feature importance scores of my variables. Besides the actual values I am trying to link them to the column names and create a dataframe.

This is how I am using the GridSearchCV function:

grid_search = GridSearchCV(model, parameters, cv=10) #tuning
grid_search.fit(X_train2, y_train2.values.ravel())

Now, I run this function that outputs an array of importance scores:

grid_search.best_estimator_.named_steps["regressor"].feature_importances_

However, this array contains more scores than columns in my dataset, so I don't know how to match them up. Is there any way to directly output the importance scores with the assigned column names from the GridSearchCV function?

Also, this is how my model's pipeline looks like for reference:

# Define the pipeline
numeric_transformer = Pipeline(steps=[
    #('imputer', KNNImputer()), # for missing values
    ('scaler', StandardScaler()), # standardizing
    #('scaler', MinMaxScaler()), # normalizing
    #('to_df', FunctionTransformer(lambda x: pd.DataFrame(x,                 columns=X_train2.select_dtypes(include=['int64', 'float64']).columns)))
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, X_train2.select_dtypes(include=['int64',         float64']).columns.tolist()),
        ('cat', categorical_transformer, categorical_cols),
        # ("pca", PCA(random_state=548, n_components=25), indices_pca),
        # ('smotenc', SmoteNCWrapper(categorical_features=[1, 2], random_state=548)),

    ])

model = Pipeline(steps=[
    # ('over', SMOTE(random_state=548)),
    # ('smotenc', SmoteNCWrapper(categorical_features=[1, 2], random_state=548)),
    ('preprocessor', preprocessor),
    ('regressor', XGBClassifier(random_state=548))
])

parameters = {  'regressor__n_estimators': [1000],
    'regressor__max_depth': [5],
    'regressor__learning_rate': [0.01],
    # 'regressor__num_leaves': [31],
    # 'regressor__min_child_samples': [20],
    'regressor__reg_alpha': [0.1], 
    'regressor__reg_lambda': [0.1]
}

答案1

得分: 1

你之所以获得"更多"特征是因为你使用了一种独热编码,它会创建虚拟特征,你可以使用".get_feature_names_out"来获取这些特征的名称。

# 获取独热编码后的特征名称
feature_names = preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_cols)
numeric_cols = X_train2.select_dtypes(include=['int64', 'float64']).columns.tolist()
all_feature_names = numeric_cols + feature_names
英文:

You are getting "more" features because you are using a onehot encoding that creates dummy features, you can get the names using ".get_feature_names_out"

# Get feature names after one-hot encoding
feature_names = preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_cols)
numeric_cols = X_train2.select_dtypes(include=['int64', 'float64']).columns.tolist()
all_feature_names = numeric_cols + feature_names

huangapple
  • 本文由 发表于 2023年5月11日 01:04:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76220970.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定