Обучение Keras терпит крах в середине эпохи после нескольких правильных казней - PullRequest
0 голосов
/ 27 сентября 2018

Я пытаюсь создать модель на основе Cudgru, которая предсказывает последовательность из 7 взаимосвязанных функций.Вот мой краткий обзор модели keras:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
cu_dnngru_1 (CuDNNGRU)       (None, 49, 100)           32700
_________________________________________________________________
dropout_1 (Dropout)          (None, 49, 100)           0
_________________________________________________________________
cu_dnngru_2 (CuDNNGRU)       (None, 49, 100)           60600
_________________________________________________________________
dropout_2 (Dropout)          (None, 49, 100)           0
_________________________________________________________________
cu_dnngru_3 (CuDNNGRU)       (None, 49, 100)           60600
_________________________________________________________________
dropout_3 (Dropout)          (None, 49, 100)           0
_________________________________________________________________
cu_dnngru_4 (CuDNNGRU)       (None, 49, 100)           60600
_________________________________________________________________
dropout_4 (Dropout)          (None, 49, 100)           0
_________________________________________________________________
cu_dnngru_5 (CuDNNGRU)       (None, 49, 100)           60600
_________________________________________________________________
dropout_5 (Dropout)          (None, 49, 100)           0
_________________________________________________________________
cu_dnngru_6 (CuDNNGRU)       (None, 49, 100)           60600
_________________________________________________________________
dropout_6 (Dropout)          (None, 49, 100)           0
_________________________________________________________________
cu_dnngru_7 (CuDNNGRU)       (None, 49, 100)           60600
_________________________________________________________________
dropout_7 (Dropout)          (None, 49, 100)           0
_________________________________________________________________
flatten_1 (Flatten)          (None, 4900)              0
_________________________________________________________________
dense_1 (Dense)              (None, 7)                 34307
=================================================================
Total params: 430,607
Trainable params: 430,607
Non-trainable params: 0

Я пытаюсь запустить эту модель для большего числа эпох.Первые несколько эпох хороши, но затем выдают ошибки:

Model] Model Compiled
Time taken: 0:00:02.314468
[Model] Training Started
[Model] 100 epochs, 1000 batch size, 20.0 batches per epoch
Epoch 1/100
20/20 [==============================] - 5s 240ms/step - loss: 0.1631 - acc: 0.2905
Epoch 2/100
20/20 [==============================] - 2s 81ms/step - loss: 0.1288 - acc: 0.2455
Epoch 3/100
20/20 [==============================] - 1s 73ms/step - loss: 0.0952 - acc: 0.5058
Epoch 4/100
20/20 [==============================] - 2s 76ms/step - loss: 0.1141 - acc: 0.3288
Epoch 5/100
20/20 [==============================] - 2s 75ms/step - loss: 0.1064 - acc: 0.3425
Epoch 6/100
20/20 [==============================] - 1s 75ms/step - loss: 0.0767 - acc: 0.4213
Epoch 7/100
20/20 [==============================] - 1s 75ms/step - loss: 0.0635 - acc: 0.4764
Epoch 8/100
20/20 [==============================] - 1s 74ms/step - loss: 0.0555 - acc: 0.5274
Epoch 9/100
20/20 [==============================] - 1s 74ms/step - loss: 0.0544 - acc: 0.5141
Epoch 10/100
...
Epoch 61/100
20/20 [==============================] - 1s 74ms/step - loss: 0.0506 - acc: 0.3925
Epoch 62/100
20/20 [==============================] - 1s 72ms/step - loss: 0.0495 - acc: 0.4323
Epoch 63/100
20/20 [==============================] - 1s 73ms/step - loss: 0.0495 - acc: 0.4118
Epoch 64/100
 2/20 [==>...........................] - ETA: 1s - loss: 0.0495 - acc: 0.4885Traceback (most recent call last):
  File "./run.py", line 118, in <module>
    main()
  File "./run.py", line 92, in main
    steps_per_epoch=steps_per_epoch)
  File "/home/sridhar/PE_CSV/alarmProj/rnn/lstm/core/model.py", line 149, in train_generator
    workers=70)
  File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/engine/training.py", line 1415, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/engine/training_generator.py", line 213, in fit_generator
    class_weight=class_weight)
  File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/engine/training.py", line 1209, in train_on_batch
    class_weight=class_weight)
  File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/engine/training.py", line 749, in _standardize_user_data
    exception_prefix='input')
  File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/engine/training_utils.py", line 127, in standardize_input_data
    'with shape ' + str(data_shape))
ValueError: Error when checking input: expected cu_dnngru_1_input to have 3 dimensions, but got array with shape (380, 1)

Если я уменьшу количество эпох до значения меньше (скажем, здесь эпоха 64), у меня не возникнет никаких проблем, но увеличениеколичество эпох вызывает вышеуказанную ошибку в некоторый момент.Точное количество эпох, в которых происходит сбой, может меняться при любом изменении конфигурации.Та же проблема наблюдается со слоями vanilla GRU / LSTM.

Это keras-2.2.2, а модель компилируется из 70 рабочих потоков.

Есть ли что-то, что я мог бы избежатьэтот вопрос?

Редактировать: Вот соответствующий приблизительный код, используемый:

session_conf = tf.ConfigProto(
            inter_op_parallelism_threads=multiprocessing.cpu_count(),
            intra_op_parallelism_threads=multiprocessing.cpu_count())
        sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)

self.model.add(CuDNNGRU(
               100,
               input_shape=(49,7),
               kernel_initializer='orthogonal',
               return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
               100,
               input_shape=(None,None),
               kernel_initializer='orthogonal',
               return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
               100,
               input_shape=(None,None),
               kernel_initializer='orthogonal',
               return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
               100,
               input_shape=(None,None),
               kernel_initializer='orthogonal',
               return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
               100,
               input_shape=(None,None),
               kernel_initializer='orthogonal',
               return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
               100,
               input_shape=(None,None),
               kernel_initializer='orthogonal',
               return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
               100,
               input_shape=(None,None),
               kernel_initializer='orthogonal',
               return_sequences=true))
self.model.add(Dropout(0.4))

elf.model.add(Flatten())
self.model.add(Dense(7, activation='relu'))

sgd = SGD(lr=0.1, decay=1e-2, clipnorm=5.0)

self.model.compile(
            loss='mse',
            metrics=["accuracy"],
            optimizer=sgd)
===================

 def train_generator(self, data_gen, epochs, batch_size, steps_per_epoch):
        timer = Timer()
        timer.start()
        print('[Model] Training Started')
        print('[Model] %s epochs, %s batch size, %s batches per epoch' %
              (epochs, batch_size, steps_per_epoch))

        save_fname = '%s/%s-e%s.h5' % (self.model_dir, dt.datetime.now()
                                       .strftime('%d%m%Y-%H%M%S'), str(epochs))
        callbacks = [
            ModelCheckpoint(
                filepath=save_fname, monitor='loss', save_best_only=True)
        ]
        try:
            self.model.fit_generator(
                data_gen,
                steps_per_epoch=steps_per_epoch,
                epochs=epochs,
                callbacks=callbacks)
        except:
            pdb.set_trace()
)

        print('[Model] Training Completed. Model saved as %s' % save_fname)
        timer.stop()
=============
    #invoked from main function
    model.train_generator(
        data_gen=data.generate_train_batch(
            seq_len=50,
            batch_size=1000,
            normalise=false),
            epochs=100,
            batch_size=1000,
            steps_per_epoch=steps_per_epoch)
=============

    def generate_train_batch(self, seq_len, batch_size, normalise):
        '''Yield a generator of training data from filename on given list of cols split for train/test'''
        i = 0
        while i < (self.len_train - seq_len):
            x_batch = []
            y_batch = []
            for b in range(batch_size):
                if i >= (self.len_train - seq_len):
                    # stop-condition for a smaller final batch if data doesn't divide evenly

                    yield np.array(x_batch), np.array(y_batch)
                x, y = self._next_window(i, seq_len, normalise)
                x_batch.append(x)
                y_batch.append(y)
                i += 1

            yield np.array(x_batch), np.array(y_batch)
=======================

1 Ответ

0 голосов
/ 29 сентября 2018

Генератор был не прав.Это неправильно предполагает конечный генератор, в то время как keras ожидает бесконечный.

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...