英文:
Sklearn Random Forest: determine the name of features ascertained by parameter grid for model fit and prediction
问题
-
对于上述代码,它确实确定了
'max features'
为 3。 -
如果 #1 是正确的,那么你可以通过以下方式打印出用于最佳预测的这 3 个特征,以及获得一个 R2 值为 0.998:
# 获取最佳模型
best_model = grid_search.best_estimator_
# 获取用于训练的特征列的名称
feature_names = list(features.columns)
# 获取最佳特征的索引
best_feature_indices = best_model.feature_importances_.argsort()[-3:][::-1]
# 获取最佳特征的名称
best_features = [feature_names[i] for i in best_feature_indices]
# 打印最佳特征和其重要性
for feature, importance in zip(best_features, best_model.feature_importances_[best_feature_indices]):
print(f"Feature: {feature}, Importance: {importance}")
# 计算并打印 R2 值
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
print(f"Train R2: {train_r2}, Test R2: {test_r2}")
这段代码将打印出用于最佳预测的 3 个特征的名称以及它们的重要性,并且会输出训练集和测试集的 R2 值。
英文:
New to ML here and trying my hands on fitting a model using Random Forest. Here is my simplified code:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.15, random_state=42)
model = RandomForestRegressor()
param_grid = {
'n_estimators': [100, 200, 500],#, 300],
'max_depth': [3, 5, 7],
'max_features': [3, 5, 7],
'random_state': [42]
}
Next, I perform grid search for the best parameters:
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
This yields the output:
{'max_depth': 7, 'max_features': 3, 'n_estimators': 500, 'random_state': 42}
Next, I implement prediction for the model. I get the output R2= 0.998 for test and train data:
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
Question:
The above code did ascertain the 'max features'
to be 3.
- I suppose those 3 features were used to predict the model and then calculate R2. Is that right?
- If #1 is correct then how do I print the 3 features which were used for the best prediction and obtain a R2 of 0.998?
答案1
得分: 0
在每次分裂时,你从所有特征中随机选择 max_features 个,且不重复选择。
在最后,你会使用所有的特征。
你的第一个观点隐含地建议你进行特征选择来移除最不重要的特征,但事实并非如此。
英文:
See this thread : https://stackoverflow.com/questions/23939750/understanding-max-features-parameter-in-randomforestregressor
At each split, you select max_features from all the features at random without replacement.
You use all yours features at the end.
You #1 point implicitly suggest that you do features selection to remove the least important features, which is not the case.
答案2
得分: 0
"RandomForestRegressor
"中的"max_features
"参数并不是指模型使用的前3个最重要的特征,而是确定在寻找最佳分裂时要考虑的特征数量。
具体来说:
- 如果"
max_features
"是一个整数,那么在每次分裂时考虑那么多的特征。 - 如果"
max_features
"是一个浮点数,它表示一个百分比,每次分裂时考虑"int(max_features * n_features)
"个特征。 - 如果"
max_features
"是"auto
",那么"max_features=sqrt(n_features)
"。 - 如果"
max_features
"是"sqrt
",那么"max_features=sqrt(n_features)
"(与"auto
"相同)。 - 如果"
max_features
"是"log2
",那么"max_features=log2(n_features)
"。 - 如果"
max_features
"是"None
",那么"max_features=n_features
"。
因此,当你在最佳参数中找到"max_features: 3
"时,意味着随机森林在每个节点上使用3个特征来进行最佳分裂,但不一定是每次都使用相同的3个特征。在你的随机森林中,每棵树和每次分裂的特征可能会有所变化。
在随机森林的上下文中,你可以获取特征重要性,这会给你每个特征在进行预测时的重要性分数。以下是如何操作的示例:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_names = features.columns
important_features_dict = {}
for x in range(len(feature_names)):
important_features_dict[feature_names[x]] = feature_importances[x]
important_features_list = sorted(important_features_dict,
key=important_features_dict.get,
reverse=True)
print('Most important features: %s' %important_features_list[:3])
这将为你提供随机森林中所有树中最重要的前3个特征,而不是在模型中的每个单独分裂中使用的特定3个特征。你应该将其解释为模型总体上认为哪些特征重要,而不是特定指示模型在任何特定决策中使用了哪3个特征。
英文:
The 'max_features'
parameter in the RandomForestRegressor
does not refer to the top 3 most important features used by the model but rather determines the number of features to consider when looking for the best split.
Specifically:
- If
max_features
is an integer, then it considers that many features at each split. - If
max_features
is a float, it is a percentage andint(max_features * n_features)
features are considered at each split. - If
max_features
isauto
, thenmax_features=sqrt(n_features)
. - If
max_features
issqrt
, thenmax_features=sqrt(n_features)
(same asauto
). - If
max_features
islog2
, thenmax_features=log2(n_features)
. - If
max_features
isNone
, thenmax_features=n_features
.
So, when you find 'max_features': 3
in your best parameters, it means that the random forest is using 3 features to make the best split at each node, not necessarily the same 3 features each time. The features might change for each tree and each split in your random forest.
In the context of random forests, you can get the feature importances, which gives you the importance score of each feature in making predictions. Here is how you can do it:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_names = features.columns
important_features_dict = {}
for x in range(len(feature_names)):
important_features_dict[feature_names[x]] = feature_importances[x]
important_features_list = sorted(important_features_dict,
key=important_features_dict.get,
reverse=True)
print('Most important features: %s' %important_features_list[:3])
This gives you the top 3 features that are most important across all trees in the forest, not necessarily the ones used at each individual split. You should interpret this as a general measure of which features the model considers important overall, rather than a specific indication of which 3 features were used in any particular decision within the model.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论