英文:
How to get the list of features which are actually used by XGBoost
问题
如果输入数据有300个特征,而我将XGBoost设置为3棵树,每棵树深度为3级,不能确保所有300个特征都被使用。但是当我调用model.get_booster().feature_names
时,返回了所有300个特征。
我猜测model.get_booster().feature_names
返回的是训练数据中的所有特征,而不是XGBoost模型使用的特征。
有没有一种方法可以检查模型实际使用了哪些变量?非常感谢!
英文:
If the input data have 300 features, and if I setup the XGBoost with 3 trees and 3 level for depth of each tree, the 300 features cannot all be used for sure. But when I call model.get_booster().feature_names
, all 300 features were returned.
My guess is that the model.get_booster().feature_names
returns all the features in the training data, not the features used by the XGBoost model.
Is there a way to check which variables are actually used by the model? Thank you very much in advance!
答案1
得分: 2
The Booster.feature_names
attribute is the description of the training dataset - which features, in which order.
In principle, you could query feature importances as model.feature_importances
. The thinking goes that "used" features have non-zero values, and "unused" features have zero values.
The correct approach would be to traverse XGBoost tree data structure, and collect node split indices (which correspond to column indices in your training dataset). If your model config is n_estimators = 3
and max_depth = 3
then, by definition, there can be at most 3 * 2^3
unique "used" features.
It's probably hard to implement such tree traversal in Python code, because the internal state of the Booster
object is not exposed via public (Python facing-) APIs. But you could dump the Booster
object in JSON data format and parse/traverse that.
For reference purposes, you could export your XGBoost model into a PMML document using the JPMML-XGBoost library (or its SkLearn2PMML package front-end), and open the resulting PMML XML file in a text editor. The conversion to the PMML representation only retains "used" features.
英文:
The Booster.feature_names
attribute is the description of the training dataset - which features, in which order.
In principle, you could query feature importances as model.feature_importances
. The thinking goes that "used" features have non-zero values, and "unused" features have zero values.
The correct approach would be to traverse XGBoost tree data structure, and collect node split indices (which correspond to column indices in your training dataset). If your model config is n_estimators = 3
and max_depth = 3
then, by definition, there can be at most 3 * 2^3
unique "used" features.
It's probably hard to implement such tree traversal in Python code, because the internal state of the Booster
object is not exposed via public (Python facing-) APIs. But you could dump the Booster
object in JSON data format and parse/traverse that.
For reference purposes, you could export your XGBoost model into a PMML document using the JPMML-XGBoost library (or its SkLearn2PMML package front-end), and open the resulting PMML XML file in a text editor. The conversion to the PMML representation only retains "used" features.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论