LightGBM против Sklearn LightGBM - ошибка в реализации - точно такие же параметры дают разные результаты - PullRequest
0 голосов
/ 17 октября 2019

При передаче точно таких же параметров в LightGBM и реализации LightGBM в sklearn я получаю разные результаты. Первоначально я получал точно такие же результаты при этом, однако я сделал некоторые изменения в своем коде, и теперь я не могу понять, почему они не совпадают. Это означает, что показатели производительности и важность функций меняются по-разному. Пожалуйста, помогите мне разобраться, я не могу понять, какую ошибку я совершаю. Это может быть либо ошибкой в ​​том, как я реализую LightGBM с использованием исходной библиотеки, либо в реализации sklearn. Ссылка для объяснения того, почему мы должны получать идентичные результаты - light gbm - Python API против Scikit-learn API

x_train, x_test, y_train, y_test = train_test_split(df_dummy[df_merge.columns], labels, test_size=0.25,random_state=42)

n_folds = 5

lgb_train = lgb.Dataset(x_train, y_train)

def objective(params, n_folds = n_folds):
    """Objective function for Gradient Boosting Machine Hyperparameter Tuning"""

    print(params)

    params['max_depth'] = int(params['max_depth'])
    params['num_leaves'] = int(params['num_leaves'])

    params['min_child_samples'] = int(params['min_child_samples'])
    params['subsample_freq'] = int(params['subsample_freq'])

    # Perform n_fold cross validation with hyperparameters

    # Use early stopping and evalute based on ROC AUC
    cv_results = lgb.cv(params, lgb_train, nfold=n_folds, num_boost_round=10000, 
                        early_stopping_rounds=100, metrics='auc')

    # Extract the best score
    best_score = max(cv_results['auc-mean'])

    # Loss must be minimized
    loss = 1 - best_score
    num_iteration = int(np.argmax(cv_results['auc-mean']) + 1)

    of_connection = open(out_file, 'a')
    writer = csv.writer(of_connection)
    writer.writerow([loss, params, num_iteration])

    # Dictionary with information for evaluation
    return {'loss': loss, 'params': params, 'status': STATUS_OK, 'estimators': num_iteration}

space = {
    'min_child_samples': hp.quniform('min_child_samples', 5, 100, 5), 
    'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
    'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1.0),
    'max_depth' : hp.quniform('max_depth', 3, 10, 1),
    'subsample' : hp.quniform('subsample', 0.6, 1, 0.05),
    'num_leaves': hp.quniform('num_leaves', 20, 150, 1),  
    'subsample_freq': hp.quniform('subsample_freq',0,10,1),
    'min_gain_to_split': hp.quniform('min_gain_to_split', 0.01, 0.1, 0.01),


    'learning_rate' : 0.05,
    'objective' : 'binary',

}

out_file = 'results/gbm_trials.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

writer.writerow(['loss', 'params', 'estimators'])
of_connection.close()

trials = Trials()
best = fmin(objective, space, algo=tpe.suggest, trials=trials, max_evals=10)
bayes_trials_results = sorted(trials.results, key = lambda x: x['loss'])

results = pd.read_csv('results/gbm_trials.csv')

# Sort with best scores on top and reset index for slicing
results.sort_values('loss', ascending = True, inplace = True)
results.reset_index(inplace = True, drop = True)
results.head()
best_bayes_estimators = int(results.loc[0, 'estimators'])

best['max_depth'] = int(best['max_depth'])
best['num_leaves'] = int(best['num_leaves'])

best['min_child_samples'] = int(best['min_child_samples'])

num_boost_round=int(best_bayes_estimators * 1.1)
best['objective'] = 'binary'
best['boosting_type'] = 'gbdt'

best['subsample_freq'] = int(best['subsample_freq'])

#Actual LightGBM

best_gbm = lgb.train(params=best, train_set=lgb_train, num_boost_round=num_boost_round)

print('Plotting feature importances...')
ax = lgb.plot_importance(best_gbm, max_num_features=15)
plt.show()

feature_imp = pd.DataFrame()
feature_imp["feature"] = list(x_train.columns)
feature_imp["importance_gain"] = best_gbm.feature_importance(importance_type='gain')
feature_imp["importance_split"] = best_gbm.feature_importance(importance_type='split')
feature_imp.to_clipboard()

y_pred_score = best_gbm.predict(x_test)

roc_auc_score_list = []
f1_score_list = []
accuracy_score_list = []
precision_score_list = []
recall_score_list = []

thresholds = [0.4,0.5,0.6,0.7]
for threshold in thresholds:
    print("threshold is {}".format(threshold))
    y_pred = np.where(y_pred_score>=threshold, 1, 0)
    print(roc_auc_score(y_test,y_pred_score))
    print(f1_score(y_test,y_pred))
    print(accuracy_score(y_test,y_pred))
    print(precision_score(y_test,y_pred))
    print(recall_score(y_test,y_pred))

    roc_auc_score_list.append(roc_auc_score(y_test,y_pred_score))
    f1_score_list.append(f1_score(y_test,y_pred))
    accuracy_score_list.append(accuracy_score(y_test,y_pred))
    precision_score_list.append(precision_score(y_test,y_pred))
    recall_score_list.append(recall_score(y_test,y_pred))

performance_metrics = pd.DataFrame(
        {'thresholds':thresholds,
         'roc_auc_score':roc_auc_score_list,
         'f1_score':f1_score_list,
         'accuracy_score':accuracy_score_list,
         'precision_score':precision_score_list,
         'recall_score':recall_score_list })

performance_metrics.transpose().to_clipboard()

#Sklearn's Implementation of LightGBM

best_sk = dict(best)
del best_sk['min_gain_to_split']
sk_best_gbm = lgb.LGBMClassifier(**best_sk, n_estimators=num_boost_round, learning_rate=0.05, min_split_gain=best['min_gain_to_split'])
sk_best_gbm.fit(x_train, y_train)

sk_best_gbm.get_params()

print('Plotting feature importances...')
ax = lgb.plot_importance(sk_best_gbm, max_num_features=15)
plt.show()

feature_imp = pd.DataFrame()
feature_imp["feature"] = list(x_train.columns)
feature_imp["Importance"] = sk_best_gbm.feature_importances_
feature_imp.to_clipboard()

y_pred_score = sk_best_gbm.predict_proba(x_test)[:,1]

roc_auc_score_list = []
f1_score_list = []
accuracy_score_list = []
precision_score_list = []
recall_score_list = []

thresholds = [0.4,0.5,0.6,0.7]
for threshold in thresholds:
    print("threshold is {}".format(threshold))
    y_pred = np.where(y_pred_score>=threshold, 1, 0)
    print(roc_auc_score(y_test,y_pred_score))
    print(f1_score(y_test,y_pred))
    print(accuracy_score(y_test,y_pred))
    print(precision_score(y_test,y_pred))
    print(recall_score(y_test,y_pred))

    roc_auc_score_list.append(roc_auc_score(y_test,y_pred_score))
    f1_score_list.append(f1_score(y_test,y_pred))
    accuracy_score_list.append(accuracy_score(y_test,y_pred))
    precision_score_list.append(precision_score(y_test,y_pred))
    recall_score_list.append(recall_score(y_test,y_pred))

performance_metrics = pd.DataFrame(
        {'thresholds':thresholds,
         'roc_auc_score':roc_auc_score_list,
         'f1_score':f1_score_list,
         'accuracy_score':accuracy_score_list,
         'precision_score':precision_score_list,
         'recall_score':recall_score_list })

performance_metrics.transpose().to_clipboard()
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...