使用从0到1的平滑标签来训练XGB分类器。

huangapple go评论130阅读模式
英文:

Using smoothed labels from 0 to 1 to train a XGB classifier

问题

  1. 1st code example actually takes the smoothed labels into account during training, not just internally converting the real values to 0 or 1. In the provided code, you are using the XGBoost native interface (xgb.train) to train the model, and you have defined your custom labels as float values between 0 and 1 in the train_label and test_label arrays. XGBoost's native interface allows you to specify custom labels, so it takes these smoothed labels into account during the training process.

  2. The XGBClassifier method from scikit-learn (sklearn wrapper) may not work with smoothed labels because it expects binary labels (0 or 1) for binary classification tasks. The error you encountered indicates that it inferred non-binary classes from the unique values in train_label. To make it work with smoothed labels, you would need to preprocess your labels to convert them into binary labels before using the XGBClassifier. Typically, you would set a threshold (e.g., 0.5) and consider values above the threshold as class 1 and values below as class 0. However, using the native XGBoost interface, as shown in your first code example, is a more direct way to handle smoothed labels without this extra step.

英文:

I want to train a XGB classifier using smoothed labels between 0 and 1 instead of binary labels.

The native XGB model seems to be able to accept smoothed labels for a binary classifier.

  1. from xgboost import XGBClassifier
  2. import numpy as np
  3. import xgboost as xgb
  4. train_data = np.random.rand(20, 10)
  5. train_label = np.random.random(20)
  6. dtrain = xgb.DMatrix(train_data, label=train_label)
  7. test_data = np.random.rand(20, 10)
  8. test_label = np.random.random(20)
  9. dtest = xgb.DMatrix(test_data, label=test_label)
  10. param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic', 'eval_metric': 'auc'}
  11. evallist = [(dtrain, 'train'), (dtest, 'eval')]
  12. bst = xgb.train(params=param, dtrain=dtrain, num_boost_round=10, evals=evallist)
  13. [0] train-auc:0.68952 eval-auc:0.53327
  14. [1] train-auc:0.74847 eval-auc:0.49597
  15. [2] train-auc:0.79158 eval-auc:0.45795
  16. ...

However, when I tried to use the sklearn wrapper XGBClassifier, I got the following error.

  1. model = XGBClassifier(**param)
  2. model.fit(train_data, train_label)
  3. ---------------------------------------------------------------------------
  4. ValueError Traceback (most recent call last)
  5. /tmp/ipykernel_12603/1675654556.py in <cell line: 1>()
  6. ----> 1 model.fit(train_data, train_label)
  7. ~/.pyenv/versions/btc-p2p/lib/python3.9/site-packages/xgboost/core.py in inner_f(*args, **kwargs)
  8. 618 for k, arg in zip(sig.parameters, args):
  9. 619 kwargs[k] = arg
  10. --> 620 return func(**kwargs)
  11. 621
  12. 622 return inner_f
  13. ~/.pyenv/versions/btc-p2p/lib/python3.9/site-packages/xgboost/sklearn.py in fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights, callbacks)
  14. 1464 or not (self.classes_ == expected_classes).all()
  15. 1465 ):
  16. -> 1466 raise ValueError(
  17. 1467 f"Invalid classes inferred from unique values of `y`. "
  18. 1468 f"Expected: {expected_classes}, got {self.classes_}"
  19. ValueError: Invalid classes inferred from unique values...

I have 2 questions here:

  1. Does the 1st code example actually take the smoothed labels into
    account during training or it just internally converts the real
    values to 0 or 1?
  2. Why doesn't the XGBClassifier method work with smoothed labels?
    Is it possible to get it work?

答案1

得分: 1

答案 1: 在第一个代码示例中,train_labeltest_label 是随机生成的,产生一个在 0 到 1 之间的值。因此,在代码中没有进行平滑处理。XGB 内部使用 sigmoid 函数将这些标签解释为 0 和 1。

答案 2: XGBClassifier 不适用于平滑处理后的标签,因为它期望用于分类任务的二进制标签。

要将平滑处理后的标签转换为二进制标签,您可以考虑使用 threshold 值进行预处理。

平滑处理到二进制

  1. from xgboost import XGBClassifier
  2. import numpy as np
  3. import xgboost as xgb
  4. train_data = np.random.rand(20, 10)
  5. train_label = np.random.random(20)
  6. train_label_binary = np.where(train_label >= 0.5, 1, 0) # 应用阈值将平滑标签转换为二进制标签
  7. dtrain = xgb.DMatrix(train_data, label=train_label_binary)
  8. test_data = np.random.rand(20, 10)
  9. test_label = np.random.random(20)
  10. test_label_binary = np.where(test_label >= 0.5, 1, 0) # 应用阈值将平滑标签转换为二进制标签
  11. dtest = xgb.DMatrix(test_data, label=test_label_binary)
  12. param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic', 'eval_metric': 'auc'}
  13. evallist = [(dtrain, 'train'), (dtest, 'eval')]
  14. bst = xgb.train(params=param, dtrain=dtrain, num_boost_round=10, evals=evallist)

输出:

  1. [0] train-auc:0.80500 eval-auc:0.51000
  2. [1] train-auc:0.93500 eval-auc:0.61500
  3. [2] train-auc:0.95000 eval-auc:0.67500
  4. [3] train-auc:1.00000 eval-auc:0.58000
  5. [4] train-auc:1.00000 eval-auc:0.57500
  6. [5] train-auc:1.00000 eval-auc:0.57500
  7. [6] train-auc:1.00000 eval-auc:0.57500
  8. [7] train-auc:1.00000 eval-auc:0.61500
  9. [8] train-auc:1.00000 eval-auc:0.60000
  10. [9] train-auc:1.00000 eval-auc:0.62000
英文:

Answer 1 : In the first code example, train_label and test_label are randomly generated, producing a value between 0 and 1. Hence not smoothened withing the code. XGB internally interpret these labels as 0 and 1 using a sigmoid function.

Answer 2 : XGBClassifier doesn't work with smoothened labels as it expects binary labels for classification tasks.

To convert smoothened labels into binary labels, you can consider pre-processing the labels by using threshold value.

Smoothened to Binary

  1. from xgboost import XGBClassifier
  2. import numpy as np
  3. import xgboost as xgb
  4. train_data = np.random.rand(20, 10)
  5. train_label = np.random.random(20)
  6. train_label_binary = np.where(train_label >= 0.5, 1, 0) # Apply threshold to convert smoothed labels to binary labels
  7. dtrain = xgb.DMatrix(train_data, label=train_label_binary)
  8. test_data = np.random.rand(20, 10)
  9. test_label = np.random.random(20)
  10. test_label_binary = np.where(test_label >= 0.5, 1, 0) # Apply threshold to convert smoothed labels to binary labels
  11. dtest = xgb.DMatrix(test_data, label=test_label_binary)
  12. param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic', 'eval_metric': 'auc'}
  13. evallist = [(dtrain, 'train'), (dtest, 'eval')]
  14. bst = xgb.train(params=param, dtrain=dtrain, num_boost_round=10, evals=evallist)

Output:

  1. [0] train-auc:0.80500 eval-auc:0.51000
  2. [1] train-auc:0.93500 eval-auc:0.61500
  3. [2] train-auc:0.95000 eval-auc:0.67500
  4. [3] train-auc:1.00000 eval-auc:0.58000
  5. [4] train-auc:1.00000 eval-auc:0.57500
  6. [5] train-auc:1.00000 eval-auc:0.57500
  7. [6] train-auc:1.00000 eval-auc:0.57500
  8. [7] train-auc:1.00000 eval-auc:0.61500
  9. [8] train-auc:1.00000 eval-auc:0.60000
  10. [9] train-auc:1.00000 eval-auc:0.62000

huangapple
  • 本文由 发表于 2023年7月10日 14:34:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76651189.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定