Потеря не уменьшается в Keras Seq2seq Двунаправленный LSTM с вниманием - PullRequest
1 голос
/ 20 октября 2019

Кто-нибудь может понять, почему потери в этой модели не уменьшаются?

Я пытался интегрировать двунаправленный LSTM с моделью внимания в конце специализации Эндрю Нга по глубокому обучению (https://www.coursera.org/learn/nlp-sequence-models/notebook/npjGi/neural-machine-translation-with-attention), но по какой-то причине модель, похоже, не сходится

Я запускаю его в Google Colab

Сеть принимает два тензора в качестве входных данных с фигурами:

encoder_input_data[m, 262, 28]
decoder_target_data[m, 28, 28]

В результате получается список из 27 векторов oneHot

Длинавекторов oneHot - 28:

(26 символов из алфавита + конечная клавиша + стартовая клавиша)

* Общая структура: *

0) Вход [262,28] ->

1) Кодер: BidirectionalLSTM ->

2) назад и вперед объединяются в encoder_outputs ->

3) ДекодерLSTM + Внимание ->

___ * объединить предыдущее декодированное состояние s с каждым a (t) кодера

___ * пропустить его через два плотных слоя и вычислить альфа

___ * суммируйте каждый альфа (a (t)) * a (t)

4) слой Softmax и получите результаты

from keras.models import Model
from keras.layers import merge, Input, LSTM, Dense, Bidirectional, concatenate, Concatenate
from keras.layers import RepeatVector, Activation, Permute, Dot, Input, Multiply
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.activations import softmax
from textwrap import wrap
import re
import random
import string
import numpy as np
import copy
from google.colab import files
from google.colab import drive

#drive.mount('/content/drive')
#files.upload()

    #returnData() creates 3 vectors:
    #encoder_input_data[m, 262, 28]
    #decoder_input_data[m, 28, 28] <- not used for now
    #decoder_i_data[m, 28, 28]

#special softmax needed for the attention layer
def softMaxAxis1(x):
    return softmax(x,axis=1)

#layers needed for the attention
repeator = RepeatVector(262)
concatenator = Concatenate(axis=-1)
densor1 = Dense(10, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation(softMaxAxis1, name='attention_weights')
dotor = Dot(axes = 1)

#compute one timestep of attention
#repeat s(t-1) for all the a(t) so far and concatenate them so that
#the algorithm can select the old a(t) based on current s
#let a dense layer compute the energies and a softmax decide
def one_step_attention(a, s_prev):
    s_prev = repeator(s_prev)
    concat = concatenator([a, s_prev])
    e = densor1(concat)
    energies = densor2(e)
    alphas = activator(energies)
    context = dotor([alphas, a])    
    return context

#variables needed for the model
encoder_input_data, decoder_input_data, decoder_target_data = returnData()
batch_size = 64
epochs = 50
latent_dim = 128
num_samples = 1000
num_tokens = 28
Tx = 262

#encoder part with a bi-LSTM with dropout
encoder_inputs = Input(shape=(Tx, num_tokens))
encoder = Bidirectional(LSTM(latent_dim, return_sequences=True ,dropout = .7))
encoder_outputs = encoder(encoder_inputs)

#decoder part with a regular LSTM
decoder_lstm = LSTM(latent_dim*2, return_state=True)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')

#initialize the parameters needed for computing attention alphas 
s0 = Input(shape=(latent_dim*2,))
c0 = Input(shape=(latent_dim*2,))
s = s0
c = c0
outputs=[]

#run attention for each target timestep Ty
for t in range(num_tokens-1):
        context = one_step_attention(encoder_outputs, s)
        s, _, c = decoder_lstm(context, initial_state = [s, c])
        out = decoder_dense(s)
        outputs.append(out)

#define the model and connect the graph
model = Model([encoder_inputs, s0, c0], outputs)

#select optimizer, loss, early_stopping
model.compile(optimizer='adam', loss='categorical_crossentropy')
keras_callbacks = [EarlyStopping(monitor='val_loss', patience=30)]

#prepare empty arrays for s0 c0 and put target data in the same for of outputs
s0 = np.zeros((encoder_input_data.shape[0], latent_dim*2))
c0 = np.zeros((encoder_input_data.shape[0], latent_dim*2))
outputs = list(decoder_target_data.swapaxes(0,1))

#fit the model with the expected dimensions of input/output
model.fit(
    [encoder_input_data, s0, c0], 
    outputs,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.1
)

#save and download the model
model.save('s2s.h5')
files.download("s2s.h5")

Во время обучения я получаю следующие результаты:

Train on 11671 samples, validate on 1297 samples
Epoch 1/50
11671/11671 [==============================] - 719s 62ms/step - loss: 86.7096 - dense_77_loss: 1.0157 - val_loss: 85.6579 - val_dense_77_loss: 0.5682
Epoch 2/50
11671/11671 [==============================] - 672s 58ms/step - loss: 87.6775 - dense_77_loss: 2.0322 - val_loss: 88.3077 - val_dense_77_loss: 2.4503
Epoch 3/50
11671/11671 [==============================] - 670s 57ms/step - loss: 86.1718 - dense_77_loss: 0.6686 - val_loss: 85.1008 - val_dense_77_loss: 0.1771
Epoch 4/50
11671/11671 [==============================] - 666s 57ms/step - loss: 85.1310 - dense_77_loss: 0.1196 - val_loss: 84.8357 - val_dense_77_loss: 0.0205
Epoch 5/50
11671/11671 [==============================] - 666s 57ms/step - loss: 84.7977 - dense_77_loss: 0.0173 - val_loss: 84.7414 - val_dense_77_loss: 0.0072
Epoch 6/50
11671/11671 [==============================] - 655s 56ms/step - loss: 87.8612 - dense_77_loss: 2.4636 - val_loss: 87.3005 - val_dense_77_loss: 1.3145
Epoch 7/50
11671/11671 [==============================] - 662s 57ms/step - loss: 88.1340 - dense_77_loss: 2.5091 - val_loss: 89.6831 - val_dense_77_loss: 4.6627
Epoch 8/50
11671/11671 [==============================] - 666s 57ms/step - loss: 88.2948 - dense_77_loss: 2.6113 - val_loss: 86.4465 - val_dense_77_loss: 0.1490
Epoch 9/50
11671/11671 [==============================] - 666s 57ms/step - loss: 87.3295 - dense_77_loss: 1.8405 - val_loss: 85.1743 - val_dense_77_loss: 0.1448
Epoch 10/50
11671/11671 [==============================] - 661s 57ms/step - loss: 85.0535 - dense_77_loss: 0.1180 - val_loss: 84.8204 - val_dense_77_loss: 0.0236
Epoch 11/50
11671/11671 [==============================] - 662s 57ms/step - loss: 84.7884 - dense_77_loss: 0.0179 - val_loss: 84.7479 - val_dense_77_loss: 0.0050
Epoch 12/50
11671/11671 [==============================] - 665s 57ms/step - loss: 87.0466 - dense_77_loss: 1.9977 - val_loss: 89.4181 - val_dense_77_loss: 4.4239
Epoch 13/50
 1216/11671 [==>...........................] - ETA: 9:34 - loss: 89.7864 - dense_77_loss: 4.8242

Помощь будет высоко ценится!

...