Почему вызов fit сбрасывает пользовательскую целевую функцию в XGBClassifier? - PullRequest
4 голосов
/ 06 апреля 2020

Я попытался настроить XGBoost sklearn API XGBClassifier для использования пользовательской целевой функции (brier) в соответствии с документацией:

    .. note::  Custom objective function

        A custom objective function can be provided for the ``objective``
        parameter. In this case, it should have the signature
        ``objective(y_true, y_pred) -> grad, hess``:

        y_true: array_like of shape [n_samples]
            The target values
        y_pred: array_like of shape [n_samples]
            The predicted values

        grad: array_like of shape [n_samples]
            The value of the gradient for each sample point.
        hess: array_like of shape [n_samples]
            The value of the second derivative for each sample point

Вот моя попытка:

import numpy as np
from xgboost import XGBClassifier
from sklearn.datasets import load_svmlight_file
train_data = load_svmlight_file('~/agaricus.txt.train')
X = train_data[0].toarray()
y = train_data[1]

def brier(y_true, y_pred):
    y_pred = 1.0 / (1.0 + np.exp(-y_pred))
    grad = 2 * y_pred * (y_true - y_pred) * (y_pred - 1)
    hess = 2 * y_pred ** (1 - y_pred) * (2 * y_pred * (y_true + 1) - y_true - 3 * y_pred ** 2)
    return grad, hess

m = XGBClassifier(objective=brier, seed=42)

Это, по-видимому, приводит к правильному объекту:

XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None, gamma=None,
              gpu_id=None, importance_type='gain', interaction_constraints=None,
              learning_rate=None, max_delta_step=None, max_depth=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              objective=<function brier at 0x7fe7ac418290>, random_state=None,
              reg_alpha=None, reg_lambda=None, scale_pos_weight=None, seed=42,
              subsample=None, tree_method=None, validate_parameters=False,
              verbosity=None)

Однако, вызов метода .fit, похоже, сбрасывает m объект к настройке по умолчанию:

m.fit(X, y)
m
XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=42, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, seed=42, subsample=1,
              tree_method=None, validate_parameters=False, verbosity=None)

с objective='binary:logistic'. Я заметил, что во время расследования, почему у меня ухудшается оценка Бриера при оптимизации непосредственно для brier, чем при использовании по умолчанию binary:logistic, как описано здесь .

Итак, как мне правильно настроить XGBClassifier для использования моей функции brier в качестве пользовательской цели?

1 Ответ

2 голосов
/ 12 апреля 2020

Я полагаю, что вы ошибаетесь с target функцией target (в качестве параметра используется obj), документация по xgboost иногда довольно запутанная.

Короче говоря, вам просто нужно исправить это:

m = XGBClassifier(obj=brier, seed=42)

Немного глубже, цель в том, как xgboost оптимизирует данную целевую функцию. Обычно xgboost выводит оптимизацию из числа классов в вашем y-векторе.

Я взял фрагмент из исходного кода , как вы можете видеть, когда у вас есть только два класса, цель устанавливается в двоичный: logisti c:

class XGBClassifier(XGBModel, XGBClassifierBase):
    def __init__(self, objective="binary:logistic", **kwargs):
        super().__init__(objective=objective, **kwargs)

    def fit(self, X, y, sample_weight=None, base_margin=None,
            eval_set=None, eval_metric=None,
            early_stopping_rounds=None, verbose=True, xgb_model=None,
            sample_weight_eval_set=None, callbacks=None):

        evals_result = {}
        self.classes_ = np.unique(y)
        self.n_classes_ = len(self.classes_)

        xgb_options = self.get_xgb_params() # <-- obj function is set here

        if callable(self.objective):
            obj = _objective_decorator(self.objective) # <----- here is the mismatch of the names, if you pass objective as your brie func it will become "binary:logistic"
            xgb_options["objective"] = "binary:logistic"
        else:
            obj = None

        if self.n_classes_ > 2:
            xgb_options['objective'] = 'multi:softprob' # <----- objective is being set here if n_classes> 2
            xgb_options['num_class'] = self.n_classes_

+-- 35 lines: feval = eval_metric if callable(eval_metric) else None-----------------------------------------------------------------------------------------------------------------------------------------------------

        self._Booster = train(xgb_options, train_dmatrix, # <----- objective is being passed in xgb_options dictionary
                              self.get_num_boosting_rounds(),
                              evals=evals,
                              early_stopping_rounds=early_stopping_rounds,
                              evals_result=evals_result, obj=obj, feval=feval, # <----- obj function is being passed to lower level api here
                              verbose_eval=verbose, xgb_model=xgb_model,
                              callbacks=callbacks)

+-- 12 lines: self.objective = xgb_options["objective"]------------------------------------------------------------------------------------------------------------------------------------------------------------------

        return self

Существует фиксированный список целей списки целей, которые вы можете установить:

цель [по умолчанию = reg: squarederror]

reg:squarederror: regression with squared loss.

reg:squaredlogerror: regression with squared log loss 12[???(????+1)−???(?????+1)]2. All input labels are required to be greater than -1. Also, see metric rmsle for possible issue with this objective.

reg:logistic: logistic regression

binary:logistic: logistic regression for binary classification, output probability

binary:logitraw: logistic regression for binary classification, output score before logistic transformation

binary:hinge: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.

count:poisson –poisson regression for count data, output mean of poisson distribution

max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)

survival:cox: Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale (i.e., as HR = exp(marginal_prediction) in the proportional hazard function h(t) = h0(t) * HR).

multi:softmax: set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)

multi:softprob: same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.

rank:pairwise: Use LambdaMART to perform pairwise ranking where the pairwise loss is minimized

rank:ndcg: Use LambdaMART to perform list-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximized

rank:map: Use LambdaMART to perform list-wise ranking where Mean Average Precision (MAP) is maximized

reg:gamma: gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.

reg:tweedie: Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.

Просто подтвердить, что цель может это не ваша функция br ie, вручную устанавливая цель, чтобы она была вашей функцией br ie внутри исходного кода прямо перед вызовом нижнего уровня api

class XGBClassifier(XGBModel, XGBClassifierBase):
    def __init__(self, objective="binary:logistic", **kwargs):
        super().__init__(objective=objective, **kwargs)

    def fit(self, X, y, sample_weight=None, base_margin=None,
            eval_set=None, eval_metric=None,
            early_stopping_rounds=None, verbose=True, xgb_model=None,
            sample_weight_eval_set=None, callbacks=None):

+-- 54 lines: evals_result = {}--------------------------------------------------------------------
        xgb_options["objective"] = xgb_options["obj"]
        self._Booster = train(xgb_options, train_dmatrix,
                              self.get_num_boosting_rounds(),
                              evals=evals,
                              early_stopping_rounds=early_stopping_rounds,
                              evals_result=evals_result, obj=obj, feval=feval,
                              verbose_eval=verbose, xgb_model=xgb_model,
                              callbacks=callbacks)

+-- 14 lines: self.objective = xgb_options["objective"]--------------------------------------------

Выдает эту ошибку:

    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [10:09:53] /private/var/folders/z5/mchb9bz51cx3h97nkw9v0wkr0000gn/T/pip-install-kh801rm0/xgboost/xgboost/src/objective/objective.cc:26: Unknown objective function: `<function brier at 0x10b630d08>`
Objective candidate: binary:hinge
Objective candidate: multi:softmax
Objective candidate: multi:softprob
Objective candidate: rank:pairwise
Objective candidate: rank:ndcg
Objective candidate: rank:map
Objective candidate: reg:squarederror
Objective candidate: reg:squaredlogerror
Objective candidate: reg:logistic
Objective candidate: binary:logistic
Objective candidate: binary:logitraw
Objective candidate: reg:linear
Objective candidate: count:poisson
Objective candidate: survival:cox
Objective candidate: reg:gamma
Objective candidate: reg:tweedie

...