Функция перекрестной проверки множественного ансамбля классификации, слишком много значений для распаковки (ожидается 2) - PullRequest
0 голосов
/ 11 июля 2020

[Ссылка на SampleFile] [1] [1]: https://www.dropbox.com/s/vk0ht1bowdhz85n/StackoverFlow_Example.csv?dl=0

Код ниже состоит из двух частей: функция и основной код, вызывающий функцию. Попутно есть множество операторов печати, которые помогут устранить неполадки. Я считаю, что проблема связана с переменной « mean_feature_importances ». Эта процедура работает и выполняет сравнение бинарных классификаторов без проблем. Я попытался изменить его, чтобы оценить мультиклассовые классификаторы, поэтому сравниваю их производительность. Имеет смысл, почему он ожидает только 2 метки, потому что это то, для чего она была, но у этой модели есть 5 разных меток на выбор. Я изменил каждое отдельное значение, которое, по моему мнению, должно быть изменено для размещения 5 разных меток вместо 2. Пожалуйста, сообщите, если я что-то пропустил, проблема возникает при возврате после print (19)

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression, SGDClassifier, Perceptron  # linear classifiers
from sklearn.model_selection import StratifiedKFold  # train/test splitting tool for cross-validation
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, \
                             GradientBoostingClassifier, RandomForestClassifier, VotingClassifier
from sklearn.multiclass import OneVsOneClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, roc_curve, auc  # scoring metrics


Here is the function used to process classifier ensemble cross validation
def train_MultiClass_classifier_ensemble_CV(classifiers, X_data, y_data, clf_params=None, cv_splits=10, 
                                 random_state=21, return_trained_classifiers=True, verbose=0, prtParam=0):
    """
    Trains a list of classifiers on the input training data and returns cross-validated accuracy and f1 scores
    as well as feature_importances (where available). The list of trained classifier objects is also returned
    upon request.

    : param classifiers : List of classifier objects; expects each has a scikit-learn wrapper.

    : param X_data : Pandas dataframe containing our training features.

    : param y_data : Pandas dataframe containing our training class labels.

    : param clf_params : (Optional) List of dictionaries containing parameters for each classifier object
                         in the list 'classifiers'. If not provided, the already-initialized parameters of
                         each classifier object will be used.

    : param cv_splits : Integer number of cross-validation splits.

    : param random_state : Seed for reproducibility between executions.

    : param return_trained_classifiers : Boolean; if True, function will also return a list containing thefit classifier objects.

    : param verbose : The amount of status text displayed during execution; 0 for less, 1 for more.

    : return clf_comparison : A pandas dataframe tabulating the cross-validated performance of each classifier.

    : return mean_feature_importances : An array containing the ranked feature importances for each classifier having the feature_importances_ attribute.

    : return trained_classifiers : (if return_trained_classifiers=True) A list of trained classifier objects.

    """
    # initialization
    kfold = StratifiedKFold(n_splits=cv_splits, random_state=random_state)

    train_accuracy_mean = []
    train_accuracy_std = []
    test_accuracy_mean = []
    test_accuracy_std = []
    f1_score_mean = []
    f1_score_std = []
    mean_feature_importances = []
    trained_classifiers = []
    classifier_name = []
    
    if clf_params is None:  # construct using classifier's existing parameter assignment
        clf_params = []
        
        for clf in classifiers:
            #print(clf)
            params = clf.get_params() 
            if 'random_state' in params.keys():  # assign random state / seed
                params['random_state'] = random_state
            elif 'seed' in params.keys():
                params['seed'] = random_state
            clf_params.append(params)
    
    # step through the classifiers for training and scoring with cross-validation
    for clf, params in zip(classifiers, clf_params):
       #print(clf)
        #print(params)
        # automatically obtain the name of the classifier
        name = get_clf_name(clf)
        classifier_name.append(name)
        if prtParam == 1:
            print(clf)    
        if verbose == 1:  # print status
            print('\nPerforming Cross-Validation on Classifier %s of %s:' 
                  % (len(classifier_name), len(classifiers)))
            print(name)
        
        # perform k-fold cross validation for this classifier and calculate scores for each split
        kth_train_accuracy = []
        kth_test_accuracy = []
        kth_test_f1_score = []
        kth_feature_importances = []
        
        for (train, test) in kfold.split(X_data, y_data):
        
            clf.set_params(**params)
            print(clf)
            print(params)
            OneVsOneClassifier(clf.fit(X_data.iloc[train], y_data.iloc[train]))
            
            kth_train_accuracy.append(clf.score(X_data.iloc[train], y_data.iloc[train]))
            print('1.1')
            kth_test_accuracy.append(clf.score(X_data.iloc[test], y_data.iloc[test]))
            print('2.2')
            kth_test_f1_score.append(f1_score(y_true=y_data.iloc[test], y_pred=clf.predict(X_data.iloc[test]), average='weighted'))
            print('3.3')
            
            if hasattr(clf, 'feature_importances_'):  # some classifiers (like linReg) lack this attribute
                print(clf.feature_importances_)
                kth_feature_importances.append(clf.feature_importances_)
        
        # populate scoring statistics for this classifier (over all cross-validation splits)
        train_accuracy_mean.append(np.mean(kth_train_accuracy))
        print('4')
        train_accuracy_std.append(np.std(kth_train_accuracy))
        print('5')
        test_accuracy_mean.append(np.mean(kth_test_accuracy))
        print('6')
        test_accuracy_std.append(np.std(kth_test_accuracy))
        print('7')
        f1_score_mean.append(np.mean(kth_test_f1_score))
        print('8')
        print('8-1')
        f1_score_std.append(np.std(kth_test_f1_score))
        print('9')
        print(kth_test_f1_score)
    
        # obtain array of mean feature importances, if this classifier had that attribute
        print('9-1')
        print(kth_feature_importances)
        
        if len(kth_feature_importances) == 0:
            print('10')
            print(mean_feature_importances)
            mean_feature_importances.append(False)
        else:
            print('10.1')
            mean_feature_importances.append(np.mean(kth_feature_importances, axis=0))
        
        # if requested, also export classifier after fitting on the complete training set 
        if return_trained_classifiers is not False:
            print('12')
            clf.fit(X_data, y_data)
            print('13')
            trained_classifiers.append(clf)
            print('14')
        # remove AdaBoost feature importances (we won't discuss their interpretation)
        if type(clf) == type(AdaBoostClassifier()):
            print('15')
            mean_feature_importances[-1] = False
        
    print('16')
    # construct dataframe for comparison of classifiers
    clf_comparison = pd.DataFrame({'Classifier Name' : classifier_name, 
                                   'Mean Train Accuracy' : train_accuracy_mean, 
                                   'Train Accuracy Standard Deviation' : train_accuracy_std,
                                   'Mean Test Accuracy' : test_accuracy_mean, 
                                   'Test Accuracy Standard Deviation' : test_accuracy_std, 
                                   'Mean Test F1-Score' : f1_score_mean,
                                   'F1-Score Standard Deviation' : f1_score_std})
    print('17')
    # enforce the desired column order
    clf_comparison = clf_comparison[['Classifier Name', 'Mean Train Accuracy',
                                     'Train Accuracy Standard Deviation', 'Mean Test Accuracy',
                                     'Test Accuracy Standard Deviation', 'Mean Test F1-Score',
                                     'F1-Score Standard Deviation']]

    print('18')
    # add return_trained_classifiers to the function return, if requested, otherwise omit
    if return_trained_classifiers is not False:
        print('19')
        print(clf_comparison)
        print(mean_feature_importances)
        print(trained_classifiers)
        return clf_comparison, mean_feature_importances, trained_classifiers
    else:
        print('20')
        return clf_comparison, mean_feature_importances

Это код и вложение должно помочь вам воспроизвести ошибку. Dataframe можно скачать выше и разместить здесь для запуска кода. Я считаю, что я включил все пакеты, необходимые для запуска кода, если нет, пожалуйста, импортируйте

dfage_train = pd.read_csv('StackoverFlow_Example.csv')
y1 = dfage_train['AgeBin']
X1 = dfage_train
X1 = X1.drop(['AgeBin'], axis=1)

num_jobs=-1  # I'll use all available CPUs when possible

Ageclassifier_list = [LogisticRegression(n_jobs=num_jobs, solver='lbfgs'),
                      RandomForestClassifier(criterion = 'entropy',n_estimators=100, n_jobs=num_jobs),
                      LinearSVC(class_weight=None,random_state=27,multi_class='ovr')]
   

X1['Pclass'] = X1['Pclass'].astype(int)
X1['isMale'] = X1['isMale'].astype(bool)
X1['Embarked'] = X1['Embarked'].astype(int)

clf_comp_Full_FeatureSet, mean_feature_importances = train_MultiClass_classifier_ensemble_CV(classifiers=Ageclassifier_list, prtParam = 1,
                                                                                             verbose=1,
                                                                                             X_data=X1,
                                                                                             y_data=y1)

Вывод ошибок

ValueError: too many values to unpack (expected 2)

1 Ответ

0 голосов
/ 11 июля 2020

В зависимости от условия ваша функция train_MultiClass_classifier_ensemble_CV возвращает 2 или 3 аргумента. Не делай этого. Потому что, когда вы хотите назначить возвращаемые переменные, может возникнуть несоответствие. Теперь он возвращает 3 значения, но вы хотите присвоить это только двум значениям. Вот проблема: c часть:

    if return_trained_classifiers is not False:
        print('19')
        print(clf_comparison)
        print(mean_feature_importances)
        print(trained_classifiers)
        return clf_comparison, mean_feature_importances, trained_classifiers # three here
    else:
        print('20')
        return clf_comparison, mean_feature_importances # two here
...