Мой dqn не работает хорошо: награда не меняется, потери продолжают расти - PullRequest
0 голосов
/ 11 июля 2019

Я пытаюсь тренировать Градия с помощью гимнастического ретро и DQNAgent из keras-rl, но это не работает.награда не увеличивается, потери продолжают увеличиваться.Я не могу понять, что не так.

Часть вывода ниже.

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 32, 30, 28)        8224      
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 16, 15, 64)        28736     
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 16, 15, 64)        36928     
_________________________________________________________________
flatten_1 (Flatten)          (None, 15360)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               3932416   
_________________________________________________________________
dense_2 (Dense)              (None, 36)                9252      
=================================================================
Total params: 4,015,556
Trainable params: 4,015,556
Non-trainable params: 0
_________________________________________________________________
None
Training for 1500000 steps ...
    2339/1500000: episode: 1, duration: 47.685s, episode steps: 2339, steps per second: 49, episode reward: 2500.000, mean reward: 1.069 [0.000, 100.000], mean action: 19.122 [0.000, 35.000], mean observation: 0.029 [0.000, 0.980], loss: 36.018083, mean_absolute_error: 11.380395, mean_q: 18.252860
    3936/1500000: episode: 2, duration: 51.391s, episode steps: 1597, steps per second: 31, episode reward: 1800.000, mean reward: 1.127 [0.000, 100.000], mean action: 19.312 [0.000, 35.000], mean observation: 0.027 [0.000, 0.980], loss: 64.386497, mean_absolute_error: 54.420486, mean_q: 68.424599
    6253/1500000: episode: 3, duration: 75.020s, episode steps: 2317, steps per second: 31, episode reward: 3500.000, mean reward: 1.511 [0.000, 100.000], mean action: 16.931 [0.000, 35.000], mean observation: 0.029 [0.000, 0.980], loss: 177.966461, mean_absolute_error: 153.478119, mean_q: 177.061630




#(snip)





 1493035/1500000: episode: 525, duration: 95.634s, episode steps: 2823, steps per second: 30, episode reward: 5100.000, mean reward: 1.807 [0.000, 500.000], mean action: 19.664 [0.000, 35.000], mean observation: 0.034 [0.000, 0.980], loss: 26501204410368.000000, mean_absolute_error: 86211024.000000, mean_q: 90254256.000000
 1495350/1500000: episode: 526, duration: 78.401s, episode steps: 2315, steps per second: 30, episode reward: 2500.000, mean reward: 1.080 [0.000, 100.000], mean action: 18.652 [0.000, 34.000], mean observation: 0.029 [0.000, 0.980], loss: 23247718449152.000000, mean_absolute_error: 84441184.000000, mean_q: 88424568.000000
 1497839/1500000: episode: 527, duration: 84.667s, episode steps: 2489, steps per second: 29, episode reward: 3700.000, mean reward: 1.487 [0.000, 500.000], mean action: 21.676 [0.000, 35.000], mean observation: 0.034 [0.000, 0.980], loss: 23432217493504.000000, mean_absolute_error: 80286264.000000, mean_q: 83946064.000000
done, took 49517.509 seconds
end!

Программа работает на сервере моего университета, я подключаю сервер с помощью SSH.

Результат "замораживания пипса" ниже.

absl-py==0.7.1
alembic==1.0.10
asn1crypto==0.24.0
astor==0.8.0
async-generator==1.10
attrs==19.1.0
backcall==0.1.0
bleach==3.1.0
certifi==2019.3.9
certipy==0.1.3
cffi==1.12.3
chardet==3.0.4
cloudpickle==1.2.1
cryptography==2.6.1
cycler==0.10.0
decorator==4.4.0
defusedxml==0.6.0
EasyProcess==0.2.7
entrypoints==0.3
future==0.17.1
gast==0.2.2
google-pasta==0.1.7
grpcio==1.21.1
gym==0.13.0
gym-retro==0.7.0
h5py==2.9.0
idna==2.8
ipykernel==5.1.0
ipython==7.5.0
ipython-genutils==0.2.0
jedi==0.13.3
Jinja2==2.10.1
jsonschema==3.0.1
jupyter-client==5.2.4
jupyter-core==4.4.0
jupyterhub==1.0.0
jupyterhub-ldapauthenticator==1.2.2
jupyterlab==0.35.6
jupyterlab-server==0.2.0
Keras==2.2.4
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
ldap3==2.6
Mako==1.0.10
Markdown==3.1.1
MarkupSafe==1.1.1
matplotlib==3.0.3
mistune==0.8.4
nbconvert==5.5.0
nbformat==4.4.0
notebook==5.7.8
numpy==1.16.4
oauthlib==3.0.1
pamela==1.0.0
pandocfilters==1.4.2
parso==0.4.0
pexpect==4.7.0
pickleshare==0.7.5
pipenv==2018.11.26
prometheus-client==0.6.0
prompt-toolkit==2.0.9
protobuf==3.8.0
ptyprocess==0.6.0
pyasn1==0.4.5
pycparser==2.19
pycurl==7.43.0
pyglet==1.3.2
Pygments==2.4.0
pygobject==3.20.0
pyOpenSSL==19.0.0
pyparsing==2.4.0
pyrsistent==0.15.2
python-apt==1.1.0b1+ubuntu0.16.4.2
python-dateutil==2.8.0
python-editor==1.0.4
PyVirtualDisplay==0.2.4
PyYAML==5.1.1
pyzmq==18.0.1
requests==2.21.0
scipy==1.3.0
Send2Trash==1.5.0
six==1.12.0
SQLAlchemy==1.3.3
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0
tensorflow-gpu==1.14.0
termcolor==1.1.0
terminado==0.8.2
testpath==0.4.2
tornado==6.0.2
traitlets==4.3.2
unattended-upgrades==0.1
urllib3==1.24.3
virtualenv==16.5.0
virtualenv-clone==0.5.3
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.15.4
wrapt==1.11.2
xvfbwrapper==0.2.9

Я подозреваю, что первый слой conv2d имеет какую-то проблему, и, возможно, это связано с window_length SequentialMemory.Я думаю, что первый слой conv2d не принимает или свернут правильно.поэтому я отсортировал пакет в process_state_batch класса CustomProcessor.Но проблема не была решена.

Все, что я написал, здесь.

#import all i need

import retro
import keras as k
import numpy as np
import rl
import rl.memory
import rl.policy
import rl.agents.dqn
import rl.core
import sys
import gym
from PIL import Image

import tensorflow as tf
from keras.backend import tensorflow_backend

config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))
session = tf.Session(config=config)
tensorflow_backend.set_session(session)

#set window size
win_size = (112,120)

#set log file
fo = open('log.txt', 'w')
sys.stdout = fo

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

"""
keras add extra dimension for batch.
and add history dimention for SequentialMemory.
but conv2d isn't able to accept 5D input.
so i'd made my processor class.
CustomProcessor convert RGB input into Gray input.

and conv2d layer convolute data which's shape (win_len, win_hei, history).
so transpose batch.

for more information, ref url bellow.
https://github.com/keras-rl/keras-rl/issues/229
"""

class CustomProcessor(rl.core.Processor):

    def process_observation(self, observation):
        img = Image.fromarray(observation)
        img = img.resize(win_size).convert('L')
        tes = np.array(img) / 255
        return np.array(img) / 255


    #def process_state_batch(self, batch):
        #batch = batch.transpose(0,2,3,1)
        #print(batch.shape)
        #return batch


myprocessor = CustomProcessor()

"""
Gradius have action space which can take 9 action in same moment.
so i gotta discrete action space.
the way i'd taken is wrapping env class.
"""

class Discretizer(gym.ActionWrapper):

    def __init__(self, env):
        super(Discretizer, self).__init__(env)
        self._actions = [[0,0,0,0,0,0,0,0,0],
               [0,0,0,0,0,0,0,0,1],
               [1,0,0,0,0,0,0,0,0],
               [1,0,0,0,0,0,0,0,1],
               [0,0,0,0,1,0,0,0,0],
               [0,0,0,0,0,1,0,0,0],
               [0,0,0,0,0,0,1,0,0],
               [0,0,0,0,0,0,0,1,0],
               [0,0,0,0,1,0,1,0,0],
               [0,0,0,0,1,0,0,1,0],
               [0,0,0,0,0,1,0,1,0],
               [0,0,0,0,0,1,1,0,0],]
        for i in range(8):
            self._actions.append((np.array(self._actions[1]) + np.array(self._actions[i + 4])).tolist())
        for i in range(8):
            self._actions.append((np.array(self._actions[2]) + np.array(self._actions[i + 4])).tolist())
        for i in range(8):
            self._actions.append((np.array(self._actions[3]) + np.array(self._actions[i + 4])).tolist())
        self.actions = []
        for action in self._actions:
            env.get_action_meaning(action)
        self.action_space = gym.spaces.Discrete(len(self._actions))

    def action(self, a):
        return self._actions[a].copy()

env = retro.make(game="Gradius-Nes", record="./Record")
env = Discretizer(env)

import retro
import keras as k
import numpy as np
import rl
import rl.memory
import rl.policy
import rl.agents.dqn
import rl.core

import tensorflow as tf
from keras.backend import tensorflow_backend

config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))
session = tf.Session(config=config)
tensorflow_backend.set_session(session)

nb_actions = env.action_space.n

normal = k.initializers.glorot_normal()
model = k.Sequential()
win_len = 4
model.add(k.layers.Conv2D(
    32, kernel_size=8, strides=4, padding="same",
    input_shape=(4,120,112), kernel_initializer=normal,
    activation="relu", data_format='channels_first'))
print("chack")
model.add(k.layers.Conv2D(
    64, kernel_size=4, strides=2, padding="same",
    kernel_initializer=normal,
    activation="relu"))
model.add(k.layers.Conv2D(
    64, kernel_size=3, strides=1, padding="same",
    kernel_initializer=normal,
    activation="relu"))
model.add(k.layers.Flatten())
model.add(k.layers.Dense(256, kernel_initializer=normal,
                         activation="relu"))
model.add(k.layers.Dense(nb_actions,
                         kernel_initializer=normal,
                         activation="linear"))

memory = rl.memory.SequentialMemory(limit=50000, window_length=win_len)
policy = rl.policy.EpsGreedyQPolicy()

"""
dqn = rl.agents.DQNAgent(processor=myprocessor, model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
               target_model_update=1e-2, policy=policy)
"""
dqn = rl.agents.DQNAgent(processor=myprocessor, model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=1000,
               target_model_update=1e-2, policy=policy)

dqn.compile(k.optimizers.Adam(lr=1e-3), metrics=['mae'])
print(model.summary());
hist = dqn.fit(env, nb_steps=1500000, visualize=False, verbose=2)
print("end!")
dqn.save_weights("test_model.h5f", overwrite=True)

env.close()

ps:

Я попробовал это решение.1, добавьте слой maxpooling и плотный слой 2, используйте градиентное отсечение 3 вниз по rl Адама, но он все еще не работает.Код ниже.

#import all i need

import retro
import keras as k
import numpy as np
import rl
import rl.memory
import rl.policy
import rl.agents.dqn
import rl.core
import sys
import gym
from PIL import Image

import tensorflow as tf
from keras.backend import tensorflow_backend

config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))
session = tf.Session(config=config)
tensorflow_backend.set_session(session)

#set window size
win_size = (224,240)

#set log file
#fo = open('log.txt', 'w')
#sys.stdout = fo

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

"""
keras add extra dimension for batch.
and add history dimention for SequentialMemory.
but conv2d isn't able to accept 5D input.
so i'd made my processor class.
CustomProcessor convert RGB input into Gray input.

and conv2d layer convolute data which's shape (win_len, win_hei, history).
so transpose batch.

for more information, ref url bellow.
https://github.com/keras-rl/keras-rl/issues/229
"""

class CustomProcessor(rl.core.Processor):

    def process_observation(self, observation):
        img = Image.fromarray(observation)
        img = img.resize(win_size).convert('L')
        tes = np.array(img) / 255
        return np.array(img) / 255


    #def process_state_batch(self, batch):
        #batch = batch.transpose(0,2,3,1)
        #print(batch.shape)
        #return batch


myprocessor = CustomProcessor()

"""
Gradius have action space which can take 9 action in same moment.
so i gotta discrete action space.
the way i'd taken is wrapping env class.
"""

class Discretizer(gym.ActionWrapper):

    def __init__(self, env):
        super(Discretizer, self).__init__(env)
        self._actions = [[0,0,0,0,0,0,0,0,0],
               [0,0,0,0,0,0,0,0,1],
               [1,0,0,0,0,0,0,0,0],
               [1,0,0,0,0,0,0,0,1],
               [0,0,0,0,1,0,0,0,0],
               [0,0,0,0,0,1,0,0,0],
               [0,0,0,0,0,0,1,0,0],
               [0,0,0,0,0,0,0,1,0],
               [0,0,0,0,1,0,1,0,0],
               [0,0,0,0,1,0,0,1,0],
               [0,0,0,0,0,1,0,1,0],
               [0,0,0,0,0,1,1,0,0],]
        for i in range(8):
            self._actions.append((np.array(self._actions[1]) + np.array(self._actions[i + 4])).tolist())
        for i in range(8):
            self._actions.append((np.array(self._actions[2]) + np.array(self._actions[i + 4])).tolist())
        for i in range(8):
            self._actions.append((np.array(self._actions[3]) + np.array(self._actions[i + 4])).tolist())
        self.actions = []
        for action in self._actions:
            env.get_action_meaning(action)
        self.action_space = gym.spaces.Discrete(len(self._actions))

    def action(self, a):
        return self._actions[a].copy()

env = retro.make(game="Gradius-Nes", record="./Record")
env = Discretizer(env)

import retro
import keras as k
import numpy as np
import rl
import rl.memory
import rl.policy
import rl.agents.dqn
import rl.core

import tensorflow as tf
from keras.backend import tensorflow_backend

config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))
session = tf.Session(config=config)
tensorflow_backend.set_session(session)

nb_actions = env.action_space.n

normal = k.initializers.glorot_normal()
model = k.Sequential()
win_len = 4
model.add(k.layers.Conv2D(
    32, kernel_size=8, strides=4, padding="same",activation="relu", 
    input_shape=(win_len,240,224), kernel_initializer=normal, data_format="channels_first"))
model.add(k.layers.MaxPooling2D(pool_size=(2, 2), strides=None, padding='same', data_format="channels_first"))
model.add(k.layers.Conv2D(
    64, kernel_size=4, strides=2, padding="same",activation="relu", 
    kernel_initializer=normal, data_format="channels_first"))
model.add(k.layers.MaxPooling2D(pool_size=(2, 2), strides=None, padding='same', data_format="channels_first"))
model.add(k.layers.Conv2D(
    64, kernel_size=3, strides=1, padding="same",activation="relu", 
    kernel_initializer=normal, data_format="channels_first"))
model.add(k.layers.MaxPooling2D(pool_size=(2, 2), strides=None, padding='same', data_format="channels_first"))
model.add(k.layers.Flatten())
model.add(k.layers.Dense(1024, kernel_initializer=normal, activation="relu"))
model.add(k.layers.Dense(1024, kernel_initializer=normal, activation="relu"))
model.add(k.layers.Dense(nb_actions,
                         kernel_initializer=normal,
                         activation="linear"))

memory = rl.memory.SequentialMemory(limit=50000, window_length=win_len)
policy = rl.policy.EpsGreedyQPolicy()

"""
dqn = rl.agents.DQNAgent(processor=myprocessor, model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
               target_model_update=1e-2, policy=policy)
"""
dqn = rl.agents.DQNAgent(processor=myprocessor, model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=50000,
               target_model_update=1e-6, policy=policy)

dqn.compile(k.optimizers.Adam(lr=1e-7, clipnorm=1.), metrics=['mae'])
print(model.summary());
hist = dqn.fit(env, nb_steps=750000, visualize=False, verbose=2)
print("end!")
dqn.save_weights("test_model.h5f", overwrite=True)

env.close()
...