使用从0到1的平滑标签来训练XGB分类器。

huangapple go评论107阅读模式
英文:

Using smoothed labels from 0 to 1 to train a XGB classifier

问题

  1. 1st code example actually takes the smoothed labels into account during training, not just internally converting the real values to 0 or 1. In the provided code, you are using the XGBoost native interface (xgb.train) to train the model, and you have defined your custom labels as float values between 0 and 1 in the train_label and test_label arrays. XGBoost's native interface allows you to specify custom labels, so it takes these smoothed labels into account during the training process.

  2. The XGBClassifier method from scikit-learn (sklearn wrapper) may not work with smoothed labels because it expects binary labels (0 or 1) for binary classification tasks. The error you encountered indicates that it inferred non-binary classes from the unique values in train_label. To make it work with smoothed labels, you would need to preprocess your labels to convert them into binary labels before using the XGBClassifier. Typically, you would set a threshold (e.g., 0.5) and consider values above the threshold as class 1 and values below as class 0. However, using the native XGBoost interface, as shown in your first code example, is a more direct way to handle smoothed labels without this extra step.

英文:

I want to train a XGB classifier using smoothed labels between 0 and 1 instead of binary labels.

The native XGB model seems to be able to accept smoothed labels for a binary classifier.

from xgboost import XGBClassifier
import numpy as np
import xgboost as xgb
train_data = np.random.rand(20, 10)
train_label = np.random.random(20)
dtrain = xgb.DMatrix(train_data, label=train_label)

test_data = np.random.rand(20, 10)
test_label = np.random.random(20)
dtest = xgb.DMatrix(test_data, label=test_label)

param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic', 'eval_metric': 'auc'}
evallist = [(dtrain, 'train'), (dtest, 'eval')]

bst = xgb.train(params=param, dtrain=dtrain, num_boost_round=10, evals=evallist)
[0]	train-auc:0.68952	eval-auc:0.53327
[1]	train-auc:0.74847	eval-auc:0.49597
[2]	train-auc:0.79158	eval-auc:0.45795
...

However, when I tried to use the sklearn wrapper XGBClassifier, I got the following error.


model = XGBClassifier(**param)
model.fit(train_data, train_label)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_12603/1675654556.py in <cell line: 1>()
----> 1 model.fit(train_data, train_label)

~/.pyenv/versions/btc-p2p/lib/python3.9/site-packages/xgboost/core.py in inner_f(*args, **kwargs)
    618             for k, arg in zip(sig.parameters, args):
    619                 kwargs[k] = arg
--> 620             return func(**kwargs)
    621 
    622         return inner_f

~/.pyenv/versions/btc-p2p/lib/python3.9/site-packages/xgboost/sklearn.py in fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights, callbacks)
   1464                 or not (self.classes_ == expected_classes).all()
   1465             ):
-> 1466                 raise ValueError(
   1467                     f"Invalid classes inferred from unique values of `y`.  "
   1468                     f"Expected: {expected_classes}, got {self.classes_}"

ValueError: Invalid classes inferred from unique values...

I have 2 questions here:

  1. Does the 1st code example actually take the smoothed labels into
    account during training or it just internally converts the real
    values to 0 or 1?
  2. Why doesn't the XGBClassifier method work with smoothed labels?
    Is it possible to get it work?

答案1

得分: 1

答案 1: 在第一个代码示例中,train_labeltest_label 是随机生成的,产生一个在 0 到 1 之间的值。因此,在代码中没有进行平滑处理。XGB 内部使用 sigmoid 函数将这些标签解释为 0 和 1。

答案 2: XGBClassifier 不适用于平滑处理后的标签,因为它期望用于分类任务的二进制标签。

要将平滑处理后的标签转换为二进制标签,您可以考虑使用 threshold 值进行预处理。

平滑处理到二进制

from xgboost import XGBClassifier
import numpy as np
import xgboost as xgb

train_data = np.random.rand(20, 10)
train_label = np.random.random(20)
train_label_binary = np.where(train_label >= 0.5, 1, 0)  # 应用阈值将平滑标签转换为二进制标签
dtrain = xgb.DMatrix(train_data, label=train_label_binary)

test_data = np.random.rand(20, 10)
test_label = np.random.random(20)
test_label_binary = np.where(test_label >= 0.5, 1, 0)  # 应用阈值将平滑标签转换为二进制标签
dtest = xgb.DMatrix(test_data, label=test_label_binary)

param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic', 'eval_metric': 'auc'}
evallist = [(dtrain, 'train'), (dtest, 'eval')]

bst = xgb.train(params=param, dtrain=dtrain, num_boost_round=10, evals=evallist)

输出:

[0]	train-auc:0.80500	eval-auc:0.51000
[1]	train-auc:0.93500	eval-auc:0.61500
[2]	train-auc:0.95000	eval-auc:0.67500
[3]	train-auc:1.00000	eval-auc:0.58000
[4]	train-auc:1.00000	eval-auc:0.57500
[5]	train-auc:1.00000	eval-auc:0.57500
[6]	train-auc:1.00000	eval-auc:0.57500
[7]	train-auc:1.00000	eval-auc:0.61500
[8]	train-auc:1.00000	eval-auc:0.60000
[9]	train-auc:1.00000	eval-auc:0.62000

英文:

Answer 1 : In the first code example, train_label and test_label are randomly generated, producing a value between 0 and 1. Hence not smoothened withing the code. XGB internally interpret these labels as 0 and 1 using a sigmoid function.

Answer 2 : XGBClassifier doesn't work with smoothened labels as it expects binary labels for classification tasks.

To convert smoothened labels into binary labels, you can consider pre-processing the labels by using threshold value.

Smoothened to Binary

from xgboost import XGBClassifier
import numpy as np
import xgboost as xgb

train_data = np.random.rand(20, 10)
train_label = np.random.random(20)
train_label_binary = np.where(train_label >= 0.5, 1, 0)  # Apply threshold to convert smoothed labels to binary labels
dtrain = xgb.DMatrix(train_data, label=train_label_binary)

test_data = np.random.rand(20, 10)
test_label = np.random.random(20)
test_label_binary = np.where(test_label >= 0.5, 1, 0)  # Apply threshold to convert smoothed labels to binary labels
dtest = xgb.DMatrix(test_data, label=test_label_binary)

param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic', 'eval_metric': 'auc'}
evallist = [(dtrain, 'train'), (dtest, 'eval')]

bst = xgb.train(params=param, dtrain=dtrain, num_boost_round=10, evals=evallist)

Output:

[0]	train-auc:0.80500	eval-auc:0.51000
[1]	train-auc:0.93500	eval-auc:0.61500
[2]	train-auc:0.95000	eval-auc:0.67500
[3]	train-auc:1.00000	eval-auc:0.58000
[4]	train-auc:1.00000	eval-auc:0.57500
[5]	train-auc:1.00000	eval-auc:0.57500
[6]	train-auc:1.00000	eval-auc:0.57500
[7]	train-auc:1.00000	eval-auc:0.61500
[8]	train-auc:1.00000	eval-auc:0.60000
[9]	train-auc:1.00000	eval-auc:0.62000

huangapple
  • 本文由 发表于 2023年7月10日 14:34:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76651189.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定