После разложения PCA все классификаторы дают мне одинаковую точность - PullRequest
0 голосов
/ 27 ноября 2018

Я выполняю некоторый код машинного обучения, и часть кода выглядит следующим образом:

classifiers = [XGBClassifier(), DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()]

print("Original data")
print("=============")
print(features.shape)
for name, clf in zip(names, classifiers):
    print(name)
    X_train, X_test, y_train, y_test = train_test_split(features, loan_status, test_size = 0.2, random_state = 0)
    result = train_predict(clf, len(y_train), X_train, y_train, X_test, y_test)
    print(result)
    print('-----------------------------------')

print("PCA data")
print("=============")
for pca_comp in range(1,6):
    print("PCA component size: " + str(pca_comp))
    pca = decomposition.PCA(n_components=pca_comp)
    pca.fit(features)
    features_pca = pca.transform(features)
    for name, clf in zip(names, classifiers):
        X_train, X_test, y_train, y_test = train_test_split(features_pca, loan_status, test_size = 0.2, random_state = 0)
        result = train_predict(clf, len(y_train), X_train, y_train, X_test, y_test)
        print(result)
        print('-----------------------------------')

По сути, я перебираю несколько классификаторов и печатаю их результаты.Затем я перебираю различные размеры n_component для декомпозиции PCA, а затем снова запускаю все классификаторы.

Я обнаружил, что, как только я начинаю делать PCA, точность (acc_test и acc_train) остается неизменной независимо от того, чтоКлассификатор, который я использую, или какое значение n_component я выберу.

Вот вывод этой части кода.Обратите внимание, что после запуска PCA 'acc_test' всегда равен 0.8079021551332182.

К сожалению, я не могу поделиться данными.Но я ищу что-то явно неправильное в моем коде.

Спасибо

Original data
=============
(769790, 207)
XGBoost
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 273.7087504863739, 'pred_time': 4.388766288757324, 'acc_train': 0.848625923953286, 'acc_test': 0.8481793735953962, 'f_train': 0.877928251001055, 'f_test': 0.8775348027423189}
-----------------------------------
Decision Tree
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 11.388459920883179, 'pred_time': 0.38187479972839355, 'acc_train': 0.8347195338988556, 'acc_test': 0.8338183140856598, 'f_train': 0.8735138626721308, 'f_test': 0.8728762797972536}
-----------------------------------
Random Forest
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 1.3620502948760986, 'pred_time': 0.8454875946044922, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
Neural Net
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 130.09251832962036, 'pred_time': 8.788004636764526, 'acc_train': 0.810022863378324, 'acc_test': 0.8106106860312553, 'f_train': 0.8429408284567822, 'f_test': 0.84336348394109}
-----------------------------------
AdaBoost
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 114.49720454216003, 'pred_time': 6.846264839172363, 'acc_train': 0.8319898933475364, 'acc_test': 0.830836981514439, 'f_train': 0.8676524880554248, 'f_test': 0.866917350579005}
-----------------------------------
Naive Bayes
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 2.338545322418213, 'pred_time': 2.913602828979492, 'acc_train': 0.696707868379688, 'acc_test': 0.6979565855622962, 'f_train': 0.8374139063372146, 'f_test': 0.8381986507744102}
-----------------------------------
QDA
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 17.64940857887268, 'pred_time': 6.382497072219849, 'acc_train': 0.5545554631782694, 'acc_test': 0.5551124332610192, 'f_train': 0.7616845459479327, 'f_test': 0.7619965387905216}
-----------------------------------
PCA data
=============
PCA component size: 1
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 12.907331943511963, 'pred_time': 2.0308330059051514, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 0.6030781269073486, 'pred_time': 0.03420734405517578, 'acc_train': 0.8074718429701607, 'acc_test': 0.8079021551332182, 'f_train': 0.8398076830188118, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 4.2026519775390625, 'pred_time': 0.5144689083099365, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 13.960830450057983, 'pred_time': 0.7337024211883545, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 9.310431957244873, 'pred_time': 2.949209451675415, 'acc_train': 0.807460476233778, 'acc_test': 0.8078956598552852, 'f_train': 0.8398003208188749, 'f_test': 0.8401793542652027}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.028026819229125977, 'pred_time': 0.019958019256591797, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.039576053619384766, 'pred_time': 0.021703481674194336, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
PCA component size: 2
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 17.529640436172485, 'pred_time': 2.1811327934265137, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 0.9235944747924805, 'pred_time': 0.03514695167541504, 'acc_train': 0.8074588524142948, 'acc_test': 0.8079021551332182, 'f_train': 0.8397974448899658, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 3.8425581455230713, 'pred_time': 0.519752025604248, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 17.796229362487793, 'pred_time': 1.4105899333953857, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 14.433330059051514, 'pred_time': 2.9874980449676514, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.09282994270324707, 'pred_time': 0.06884241104125977, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.06534266471862793, 'pred_time': 0.06316208839416504, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
PCA component size: 3
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 22.586288690567017, 'pred_time': 2.132150650024414, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 1.3756062984466553, 'pred_time': 0.0391697883605957, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 3.6991543769836426, 'pred_time': 0.5463252067565918, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 13.745409488677979, 'pred_time': 1.617872714996338, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 18.745909929275513, 'pred_time': 3.02945613861084, 'acc_train': 0.8074539809558451, 'acc_test': 0.8078956598552852, 'f_train': 0.8397946213935711, 'f_test': 0.8401793542652027}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.09948086738586426, 'pred_time': 0.07936644554138184, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.07803058624267578, 'pred_time': 0.07502388954162598, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
PCA component size: 4
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 28.096595287322998, 'pred_time': 2.079728364944458, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 1.9280765056610107, 'pred_time': 0.04021263122558594, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 4.067602872848511, 'pred_time': 0.5436885356903076, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 18.260048389434814, 'pred_time': 2.397339344024658, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 24.486289501190186, 'pred_time': 3.059351921081543, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.10924768447875977, 'pred_time': 0.08964681625366211, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.09738326072692871, 'pred_time': 0.08622312545776367, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------

1 Ответ

0 голосов
/ 28 ноября 2018

Я не вижу ничего явно неправильного в вашем коде.

Пара мыслей:

Я бы ожидал, что классификаторы будут становиться все более похожими, когда вы понижаете n_components к 1. Но не идентичны, как вы наблюдаете.

Вы работаете только с (1,6) компонентами PCA.Убедитесь, что классификаторы обучаются правильно, зацикливаясь на, возможно, (1,10,20,30,100) компонентах.Если классификаторы все еще имеют одинаковую производительность, то вы делаете что-то не так -

Также, возможно, посмотрите и вручную убедитесь, что с features во время PCA transform не происходит ничего странного.Просто выполните тот же код и посмотрите на гистограммы новых функций ... может случиться что-то странное.

Проверьте объясненную дисперсию и убедитесь, что дополнительные компоненты добавляют информацию.print(pca.explained_variance_ratio_)

Учитывая, насколько схожи классификаторы со всеми 207 features, возможно, они просто видят одно и то же, когда вы запускаете PCA.

При использовании параметров по умолчанию (т. Е. Очень простых классификаторов) возможно, но маловероятно, что классификаторы будут вести себя одинаково на (1,6) компонентах.

Также убедитесь, что вы выполняете цикл правильно (похоже, чтовы) и придерживаться некоторых проверок вменяемости.Удачи!

...