英文:
Feature importance scores with GridSearchCV
问题
I am trying to get the feature importance scores of my variables. Besides the actual values I am trying to link them to the column names and create a dataframe.
This is how I am using the GridSearchCV function:
grid_search = GridSearchCV(model, parameters, cv=10) #tuning
grid_search.fit(X_train2, y_train2.values.ravel())
Now, I run this function that outputs an array of importance scores:
grid_search.best_estimator_.named_steps["regressor"].feature_importances_
However, this array contains more scores than columns in my dataset, so I don't know how to match them up. Is there any way to directly output the importance scores with the assigned column names from the GridSearchCV function?
Also, this is how my model's pipeline looks like for reference:
# Define the pipeline
numeric_transformer = Pipeline(steps=[
#('imputer', KNNImputer()), # for missing values
('scaler', StandardScaler()), # standardizing
#('scaler', MinMaxScaler()), # normalizing
#('to_df', FunctionTransformer(lambda x: pd.DataFrame(x, columns=X_train2.select_dtypes(include=['int64', 'float64']).columns)))
])
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, X_train2.select_dtypes(include=['int64', 'float64']).columns.tolist()),
('cat', categorical_transformer, categorical_cols),
# ("pca", PCA(random_state=548, n_components=25), indices_pca),
# ('smotenc', SmoteNCWrapper(categorical_features=[1, 2], random_state=548)),
])
model = Pipeline(steps=[
# ('over', SMOTE(random_state=548)),
# ('smotenc', SmoteNCWrapper(categorical_features=[1, 2], random_state=548)),
('preprocessor', preprocessor),
('regressor', XGBClassifier(random_state=548))
])
parameters = { 'regressor__n_estimators': [1000],
'regressor__max_depth': [5],
'regressor__learning_rate': [0.01],
# 'regressor__num_leaves': [31],
# 'regressor__min_child_samples': [20],
'regressor__reg_alpha': [0.1],
'regressor__reg_lambda': [0.1]
}
英文:
I am trying to get the feature importance scores of my variables. Besides the actual values I am trying to link them to the column names and create a dataframe.
This is how I am using the GridSearchCV function:
grid_search = GridSearchCV(model, parameters, cv=10) #tuning
grid_search.fit(X_train2, y_train2.values.ravel())
Now, I run this function that outputs an array of importance scores:
grid_search.best_estimator_.named_steps["regressor"].feature_importances_
However, this array contains more scores than columns in my dataset, so I don't know how to match them up. Is there any way to directly output the importance scores with the assigned column names from the GridSearchCV function?
Also, this is how my model's pipeline looks like for reference:
# Define the pipeline
numeric_transformer = Pipeline(steps=[
#('imputer', KNNImputer()), # for missing values
('scaler', StandardScaler()), # standardizing
#('scaler', MinMaxScaler()), # normalizing
#('to_df', FunctionTransformer(lambda x: pd.DataFrame(x, columns=X_train2.select_dtypes(include=['int64', 'float64']).columns)))
])
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, X_train2.select_dtypes(include=['int64', float64']).columns.tolist()),
('cat', categorical_transformer, categorical_cols),
# ("pca", PCA(random_state=548, n_components=25), indices_pca),
# ('smotenc', SmoteNCWrapper(categorical_features=[1, 2], random_state=548)),
])
model = Pipeline(steps=[
# ('over', SMOTE(random_state=548)),
# ('smotenc', SmoteNCWrapper(categorical_features=[1, 2], random_state=548)),
('preprocessor', preprocessor),
('regressor', XGBClassifier(random_state=548))
])
parameters = { 'regressor__n_estimators': [1000],
'regressor__max_depth': [5],
'regressor__learning_rate': [0.01],
# 'regressor__num_leaves': [31],
# 'regressor__min_child_samples': [20],
'regressor__reg_alpha': [0.1],
'regressor__reg_lambda': [0.1]
}
答案1
得分: 1
你之所以获得"更多"特征是因为你使用了一种独热编码,它会创建虚拟特征,你可以使用".get_feature_names_out"来获取这些特征的名称。
# 获取独热编码后的特征名称
feature_names = preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_cols)
numeric_cols = X_train2.select_dtypes(include=['int64', 'float64']).columns.tolist()
all_feature_names = numeric_cols + feature_names
英文:
You are getting "more" features because you are using a onehot encoding that creates dummy features, you can get the names using ".get_feature_names_out"
# Get feature names after one-hot encoding
feature_names = preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_cols)
numeric_cols = X_train2.select_dtypes(include=['int64', 'float64']).columns.tolist()
all_feature_names = numeric_cols + feature_names
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论