获取由XGBoost实际使用的特征列表

huangapple go评论116阅读模式
英文:

How to get the list of features which are actually used by XGBoost

问题

如果输入数据有300个特征,而我将XGBoost设置为3棵树,每棵树深度为3级,不能确保所有300个特征都被使用。但是当我调用model.get_booster().feature_names时,返回了所有300个特征。

我猜测model.get_booster().feature_names返回的是训练数据中的所有特征,而不是XGBoost模型使用的特征。

有没有一种方法可以检查模型实际使用了哪些变量?非常感谢!

英文:

If the input data have 300 features, and if I setup the XGBoost with 3 trees and 3 level for depth of each tree, the 300 features cannot all be used for sure. But when I call model.get_booster().feature_names, all 300 features were returned.

My guess is that the model.get_booster().feature_names returns all the features in the training data, not the features used by the XGBoost model.

Is there a way to check which variables are actually used by the model? Thank you very much in advance!

答案1

得分: 2

The Booster.feature_names attribute is the description of the training dataset - which features, in which order.

In principle, you could query feature importances as model.feature_importances. The thinking goes that "used" features have non-zero values, and "unused" features have zero values.

The correct approach would be to traverse XGBoost tree data structure, and collect node split indices (which correspond to column indices in your training dataset). If your model config is n_estimators = 3 and max_depth = 3 then, by definition, there can be at most 3 * 2^3 unique "used" features.

It's probably hard to implement such tree traversal in Python code, because the internal state of the Booster object is not exposed via public (Python facing-) APIs. But you could dump the Booster object in JSON data format and parse/traverse that.

For reference purposes, you could export your XGBoost model into a PMML document using the JPMML-XGBoost library (or its SkLearn2PMML package front-end), and open the resulting PMML XML file in a text editor. The conversion to the PMML representation only retains "used" features.

英文:

The Booster.feature_names attribute is the description of the training dataset - which features, in which order.

In principle, you could query feature importances as model.feature_importances. The thinking goes that "used" features have non-zero values, and "unused" features have zero values.

The correct approach would be to traverse XGBoost tree data structure, and collect node split indices (which correspond to column indices in your training dataset). If your model config is n_estimators = 3 and max_depth = 3 then, by definition, there can be at most 3 * 2^3 unique "used" features.

It's probably hard to implement such tree traversal in Python code, because the internal state of the Booster object is not exposed via public (Python facing-) APIs. But you could dump the Booster object in JSON data format and parse/traverse that.

For reference purposes, you could export your XGBoost model into a PMML document using the JPMML-XGBoost library (or its SkLearn2PMML package front-end), and open the resulting PMML XML file in a text editor. The conversion to the PMML representation only retains "used" features.

huangapple
  • 本文由 发表于 2023年4月17日 08:13:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76030943.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定