问题

如果输入数据有300个特征，而我将XGBoost设置为3棵树，每棵树深度为3级，不能确保所有300个特征都被使用。但是当我调用model.get_booster().feature_names时，返回了所有300个特征。

我猜测model.get_booster().feature_names返回的是训练数据中的所有特征，而不是XGBoost模型使用的特征。

有没有一种方法可以检查模型实际使用了哪些变量？非常感谢！

英文:

If the input data have 300 features, and if I setup the XGBoost with 3 trees and 3 level for depth of each tree, the 300 features cannot all be used for sure. But when I call model.get_booster().feature_names, all 300 features were returned.

My guess is that the model.get_booster().feature_names returns all the features in the training data, not the features used by the XGBoost model.

Is there a way to check which variables are actually used by the model? Thank you very much in advance!

答案1

得分: 2

The Booster.feature_names attribute is the description of the training dataset - which features, in which order.

In principle, you could query feature importances as model.feature_importances. The thinking goes that "used" features have non-zero values, and "unused" features have zero values.

The correct approach would be to traverse XGBoost tree data structure, and collect node split indices (which correspond to column indices in your training dataset). If your model config is n_estimators = 3 and max_depth = 3 then, by definition, there can be at most 3 * 2^3 unique "used" features.

It's probably hard to implement such tree traversal in Python code, because the internal state of the Booster object is not exposed via public (Python facing-) APIs. But you could dump the Booster object in JSON data format and parse/traverse that.

For reference purposes, you could export your XGBoost model into a PMML document using the JPMML-XGBoost library (or its SkLearn2PMML package front-end), and open the resulting PMML XML file in a text editor. The conversion to the PMML representation only retains "used" features.

英文:

The Booster.feature_names attribute is the description of the training dataset - which features, in which order.

In principle, you could query feature importances as model.feature_importances. The thinking goes that "used" features have non-zero values, and "unused" features have zero values.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

获取由XGBoost实际使用的特征列表

问题

答案1

How can I implement RandomizedSearchCV for GradientBoostingRegressor in scikit-learn instead of GridSearchCV?

Shap summary plots for XGBoost with categorical data inputs

我的准确度分数在XGBoost（多分类模型）超参数调优后为什么下降？

如何反向转换加载的 pickle XGBoost 模型的预测输出？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论