XGBoost выдает KeyError: 'best_msg' - PullRequest
0 голосов
/ 19 мая 2018

Я пытаюсь заставить мой код работать, раньше он не выдавал ошибок, пока я не изменил некоторые данные в своих данных, и теперь он полностью не дает никакого вывода.Кажется, что предиктор предсказывает значения Nan, что я нахожу странным, поскольку ни одно из входных значений не является значением Nan.Эта ошибка возникает, когда я запускаю xgb.train для выборки из 5000 наборов данных (с более чем 300000 наблюдений).Когда я запускаю его на меньшем образце набора данных, эта ошибка не возникает.

Код, который я запустил:

Statadata= pd.read_stata('figtemp.dta')
Statadata = Statadata.drop(Statadata[(Statadata['periodf'] == 3) | (Statadata['periodf'] == 4)].index)
Statadata = Statadata.drop(Statadata[(Statadata['periods'] == 3) | (Statadata['periods'] == 4)].index)
Statadata.drop(Statadata[Statadata['zcstscoreela'].isnull()].index, inplace=True)
Statadata.drop(Statadata[Statadata['zcstscoremath'].isnull()].index, inplace=True)


eng = Statadata[Statadata['department']=='english']
eng = eng.drop(eng[eng['zcstscoreelaprior'].isnull()].index)

math = Statadata[Statadata['department']=='math']
math = math.drop(math[math['zcstscoremathprior'].isnull()].index)

y_en_gpa = eng['gpatotal']
y_en_cst = eng['zcstscoreela']
X_en = eng.copy()
del X_en['gpatotal']
del X_en['zcstscoremath']
del X_en['zcstscoreela']
del X_en['pareduccode']
del X_en['cstscoreela']
del X_en['cstscoremath']


y_math_gpa = math['gpatotal']
y_math_cst = math['zcstscoremath']
X_math = math.copy()
del X_math['gpatotal']
del X_math['zcstscoremath']
del X_math['zcstscoreela']
del X_math['pareduccode']
del X_math['cstscoreela']
del X_math['cstscoremath']

# english:
# deleting the columns and rows with missing values:

missing_en=X_en.isnull().sum()
missingbool_en=missing_en<25
selected_en=X_en.columns[missingbool_en]
selected_en=X_en[selected_en]
selected_en=selected_en.dropna(0)
y_en_cst=y_en_cst[selected_en.index]
y_en_gpa=y_en_gpa[selected_en.index]

# math:
# deleting the columns and rows with missing values:
missing_math=X_math.isnull().sum()
missingbool_math=missing_math<25
selected_math=X_math.columns[missingbool_math]
selected_math=X_math[selected_math]
selected_math=selected_math.dropna(0)
y_math_cst=y_math_cst[selected_math.index]
y_math_gpa=y_math_gpa[selected_math.index]

columns_to_overwrite = ['department', 'crsnamef', 'markf', 'crsnames', 'marks', 'cstlevelela', 'cstlevelmath', 'status', 'grade', 'gpaavg']
columns_to_overwrite2 = [ 'markf', 'crsnames', 'marks', 'cstlevelela', 'cstlevelmath', 'status', 'grade']

new_en=pd.get_dummies(selected_en['crsnamef'])
for i in columns_to_overwrite2:
    nieuw_en=pd.get_dummies(selected_en[i])
    new_en=new_en.merge(nieuw_en, left_index=True, right_index=True, suffixes=['_1','_2'])

selected_en=selected_en.drop(labels=columns_to_overwrite, axis="columns")
selected_en=new_en.merge(selected_en,left_index=True, right_index=True)

# math:
# Creating the dummy variables for the categorical string variables
new_math=pd.get_dummies(selected_math['crsnamef'])
for i in columns_to_overwrite2:
    nieuw_math=pd.get_dummies(selected_math[i])
    new_math=new_math.merge(nieuw_math, left_index=True, right_index=True, suffixes=['_1','_2'])

selected_math=selected_math.drop(labels=columns_to_overwrite, axis="columns")
selected_math=new_math.merge(selected_math,left_index=True, right_index=True)

X_train_math_gpa, X_test_math_gpa, y_train_math_gpa, y_test_math_gpa = train_test_split(selected_math, y_math_gpa, random_state=4)
X_train_math_cst, X_test_math_cst, y_train_math_cst, y_test_math_cst = train_test_split(selected_math, y_math_cst, random_state=4)

paramstest2 = {
    'max_depth': 8,
    'min_child_weight': 3,
    'gamma': 0.4,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
}
data_train = xgb.DMatrix(X_train_math_gpa, label=y_train_math_gpa)
data_test = xgb.DMatrix(X_test_math_gpa, label=y_test_math_gpa)

model=xgb.train(paramstest2, data_train, 5000, evals=[(data_test, "test")], verbose_eval=100, early_stopping_rounds=50)

Полученная ошибка:

[13:24:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 6 pruned nodes, max_depth=6
[0] test-rmse:nan
Will train until test-rmse hasn't improved in 50 rounds.
[13:24:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 66 extra nodes, 2 pruned nodes, max_depth=8
[13:24:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 36 extra nodes, 46 pruned nodes, max_depth=8
[13:24:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 44 pruned nodes, max_depth=6
[13:24:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 92 pruned nodes, max_depth=7
[13:24:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 80 pruned nodes, max_depth=7
[13:24:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 50 pruned nodes, max_depth=4
[13:24:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 92 pruned nodes, max_depth=5
[13:24:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 102 pruned nodes, max_depth=5
[13:24:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 112 pruned nodes, max_depth=5
[13:24:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 

...

[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 170 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 206 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 160 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 178 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 142 pruned nodes, max_depth=3
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 154 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 188 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 150 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 160 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 166 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 182 pruned nodes, max_depth=0
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/Users/catlinbruys/PycharmProjects/Bachelor_Thesis/venv/lib/python3.6/site-packages/xgboost/training.py", line 204, in train
    xgb_model=xgb_model, callbacks=callbacks)
  File "/Users/catlinbruys/PycharmProjects/Bachelor_Thesis/venv/lib/python3.6/site-packages/xgboost/training.py", line 99, in _train_internal
    evaluation_result_list=evaluation_result_list))
  File "/Users/catlinbruys/PycharmProjects/Bachelor_Thesis/venv/lib/python3.6/site-packages/xgboost/callback.py", line 247, in callback
    best_msg = state['best_msg']
KeyError: 'best_msg'

Что я могу сделать, чтобы решить эту проблему?Мне действительно нужно решение, так как это для очень важного проекта.Спасибо

...