Балансировка набора данных с использованием imblearn недосэмплинга, передискретизации и объединения - PullRequest
0 голосов
/ 05 марта 2020

У меня есть несбалансированный набор данных

data['Class'].value_counts()
Out[22]: 
0    137757
1      4905
Name: Class, dtype: int64
X_train, X_valid, y_train, y_valid = train_test_split(input_x, input_y, test_size=0.20, random_state=seed)
print(sorted(Counter(y_train).items()))
[(0, 110215), (1, 3914)]

Я пробовал разные функции imblearn

from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.over_sampling import  ADASYN, BorderlineSMOTE, RandomOverSampler, SMOTE
from imblearn.under_sampling import CondensedNearestNeighbour, EditedNearestNeighbours, RepeatedEditedNearestNeighbours
from imblearn.under_sampling import AllKNN, InstanceHardnessThreshold, NeighbourhoodCleaningRule, TomekLinks

smote_enn = SMOTEENN(random_state=27)
smote_tomek = SMOTETomek(random_state=27)

adasyn = ADASYN(random_state=27)
borderline = BorderlineSMOTE(random_state=27)
ran_oversample = RandomOverSampler(random_state=27)
smote = SMOTE(random_state=27)

cnn = CondensedNearestNeighbour(random_state=27) 
enn = EditedNearestNeighbours(random_state=27)
renn = RepeatedEditedNearestNeighbours(random_state=27)
allknn = AllKNN(random_state=27)
iht = InstanceHardnessThreshold(random_state=0)
ncr = NeighbourhoodCleaningRule(random_state=27)
tomek = TomekLinks(random_state=27)

создал разные поезда

def BalancingData(function):
    X_train_resampled, y_train_resampled = function.fit_sample(X_train, y_train)
    print(sorted(Counter(y_train_resampled).items()))
    return X_train_resampled, y_train_resampled

X_train_smote_enn, y_train_smote_enn = BalancingData(smote_enn)
X_train_smote_tomek, y_train_smote_tomek = BalancingData(smote_tomek)

X_train_adasyn, y_train_adasyn = BalancingData(adasyn)
X_train_borderline, y_train_borderline = BalancingData(borderline)
X_train_ran_oversample, y_train_ran_oversample = BalancingData(ran_oversample)
X_train_smote, y_train_smote = BalancingData(smote)

X_train_cnn, y_train_cnn = BalancingData(cnn)
X_train_enn, y_train_enn = BalancingData(enn)
X_train_renn, y_train_renn = BalancingData(renn)
X_train_allknn, y_train_allknn = BalancingData(allknn)
X_train_iht, y_train_iht = BalancingData(iht)
X_train_ncr, y_train_ncr = BalancingData(ncr)
X_train_tomek, y_train_tomek = BalancingData(tomek)

Затем построил Keras модель

def my_model(X,y):
    Keras_model = Sequential()
    Keras_model.add(Dense(33,activation='sigmoid', kernel_initializer='glorot_uniform', kernel_constraint=maxnorm(12), input_shape=(input_len,))) 
    Keras_model.add(Dense(1, activation='sigmoid', kernel_initializer='glorot_uniform')) 
    Keras_model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
    Keras_model.fit(X, y, validation_data=(X_valid,y_valid), batch_size = 1000, epochs = 10, verbose = 0)
    scores_valid = Keras_model.evaluate(X_valid, y_valid, verbose=1)
    scores_train = Keras_model.evaluate(X, y, verbose=1)
    scores_full = Keras_model.evaluate(X_train, y_train, verbose=1)

вызов с другим поездом imb и действительными наборами

my_model(X_train, y_train)
my_model(X_train_smote_enn, y_train_smote_enn)
my_model(X_train_smote_tomek, y_train_smote_tomek)
and so on

Я нахожу очень незначительное улучшение точности в allKNN, это начальный тест, я знал, что мы все еще можем настроить параметры в allKNN и попытаться улучшить, мой вопрос, почему я не могу найти много улучшений, без использования функций балансировки сама базовая модель, показывающая лучшую точность? enter image description here

...