
huangapple go评论57阅读模式

Sklearn Random Forest: determine the name of features ascertained by parameter grid for model fit and prediction


  1. 对于上述代码,它确实确定了 'max features' 为 3。

  2. 如果 #1 是正确的,那么你可以通过以下方式打印出用于最佳预测的这 3 个特征,以及获得一个 R2 值为 0.998:

  1. # 获取最佳模型
  2. best_model = grid_search.best_estimator_
  3. # 获取用于训练的特征列的名称
  4. feature_names = list(features.columns)
  5. # 获取最佳特征的索引
  6. best_feature_indices = best_model.feature_importances_.argsort()[-3:][::-1]
  7. # 获取最佳特征的名称
  8. best_features = [feature_names[i] for i in best_feature_indices]
  9. # 打印最佳特征和其重要性
  10. for feature, importance in zip(best_features, best_model.feature_importances_[best_feature_indices]):
  11. print(f"Feature: {feature}, Importance: {importance}")
  12. # 计算并打印 R2 值
  13. y_train_pred = best_model.predict(X_train)
  14. y_test_pred = best_model.predict(X_test)
  15. train_r2 = r2_score(y_train, y_train_pred)
  16. test_r2 = r2_score(y_test, y_test_pred)
  17. print(f"Train R2: {train_r2}, Test R2: {test_r2}")

这段代码将打印出用于最佳预测的 3 个特征的名称以及它们的重要性,并且会输出训练集和测试集的 R2 值。


New to ML here and trying my hands on fitting a model using Random Forest. Here is my simplified code:

  1. X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.15, random_state=42)
  2. model = RandomForestRegressor()
  3. param_grid = {
  4. 'n_estimators': [100, 200, 500],#, 300],
  5. 'max_depth': [3, 5, 7],
  6. 'max_features': [3, 5, 7],
  7. 'random_state': [42]
  8. }

Next, I perform grid search for the best parameters:

  1. grid_search = GridSearchCV(model, param_grid, cv=5)
  2., y_train)
  3. print(grid_search.best_params_)

This yields the output:

  1. {'max_depth': 7, 'max_features': 3, 'n_estimators': 500, 'random_state': 42}

Next, I implement prediction for the model. I get the output R2= 0.998 for test and train data:

  1. y_train_pred = best_model.predict(X_train)
  2. y_test_pred = best_model.predict(X_test)
  3. train_r2 = r2_score(y_train, y_train_pred)
  4. test_r2 = r2_score(y_test, y_test_pred)


The above code did ascertain the 'max features' to be 3.

  1. I suppose those 3 features were used to predict the model and then calculate R2. Is that right?
  2. If #1 is correct then how do I print the 3 features which were used for the best prediction and obtain a R2 of 0.998?


得分: 0

在每次分裂时,你从所有特征中随机选择 max_features 个,且不重复选择。




See this thread :

At each split, you select max_features from all the features at random without replacement.

You use all yours features at the end.

You #1 point implicitly suggest that you do features selection to remove the least important features, which is not the case.


得分: 0



  • 如果"max_features"是一个整数,那么在每次分裂时考虑那么多的特征。
  • 如果"max_features"是一个浮点数,它表示一个百分比,每次分裂时考虑"int(max_features * n_features)"个特征。
  • 如果"max_features"是"auto",那么"max_features=sqrt(n_features)"。
  • 如果"max_features"是"sqrt",那么"max_features=sqrt(n_features)"(与"auto"相同)。
  • 如果"max_features"是"log2",那么"max_features=log2(n_features)"。
  • 如果"max_features"是"None",那么"max_features=n_features"。

因此,当你在最佳参数中找到"max_features: 3"时,意味着随机森林在每个节点上使用3个特征来进行最佳分裂,但不一定是每次都使用相同的3个特征。在你的随机森林中,每棵树和每次分裂的特征可能会有所变化。


  1. feature_importances = grid_search.best_estimator_.feature_importances_
  2. feature_names = features.columns
  3. important_features_dict = {}
  4. for x in range(len(feature_names)):
  5. important_features_dict[feature_names[x]] = feature_importances[x]
  6. important_features_list = sorted(important_features_dict,
  7. key=important_features_dict.get,
  8. reverse=True)
  9. print('Most important features: %s' %important_features_list[:3])



The 'max_features' parameter in the RandomForestRegressor does not refer to the top 3 most important features used by the model but rather determines the number of features to consider when looking for the best split.


  • If max_features is an integer, then it considers that many features at each split.
  • If max_features is a float, it is a percentage and int(max_features * n_features) features are considered at each split.
  • If max_features is auto, then max_features=sqrt(n_features).
  • If max_features is sqrt, then max_features=sqrt(n_features) (same as auto).
  • If max_features is log2, then max_features=log2(n_features).
  • If max_features is None, then max_features=n_features.

So, when you find 'max_features': 3 in your best parameters, it means that the random forest is using 3 features to make the best split at each node, not necessarily the same 3 features each time. The features might change for each tree and each split in your random forest.

In the context of random forests, you can get the feature importances, which gives you the importance score of each feature in making predictions. Here is how you can do it:

  1. feature_importances = grid_search.best_estimator_.feature_importances_
  2. feature_names = features.columns
  3. important_features_dict = {}
  4. for x in range(len(feature_names)):
  5. important_features_dict[feature_names[x]] = feature_importances[x]
  6. important_features_list = sorted(important_features_dict,
  7. key=important_features_dict.get,
  8. reverse=True)
  9. print('Most important features: %s' %important_features_list[:3])

This gives you the top 3 features that are most important across all trees in the forest, not necessarily the ones used at each individual split. You should interpret this as a general measure of which features the model considers important overall, rather than a specific indication of which 3 features were used in any particular decision within the model.

  • 本文由 发表于 2023年7月6日 12:45:41
  • 转载请务必保留本文链接:



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
