У меня есть класс модели GAN (три модели внутри этого класса (дискриминатор, генератор и комбинированная модель). Код успешно работает на одном графическом процессоре с Tensorflow 2.1. Теперь я планирую работать на нескольких графических процессорах с большим размером пакета чтобы получить лучшую модель и короткое время обучения.
Ниже мой код без парализации:
self.discriminator_model = image_gan_discriminator(patch_shape=patch_shape,
num_patches=num_patches,
image_shape=image_shape,
spectral_normalization=True)
self.discriminator_model.build(tuple([None] + list(image_shape)))
self.discriminator_model.compile(loss='binary_crossentropy', optimizer=Adam(self.lr_d, 0.5))
self.generator_model = image_gan_generator(noise_shape=noise_shape,
image_shape=.image_shape,
spectral_normalization=True)
self.generator_model.compile(loss='binary_crossentropy', optimizer=Adam(self.lr_g, 0.5))
self.generator_model.build(tuple([None] + list(noise_shape)))
self.discriminator_model.trainable = False
self.combined_model = keras.Sequential(layers=[self.generator_model, self.discriminator_model], name="combined")
self.combined_model.build(tuple([None] + list(noise_shape)))
self.combined_model.compile(loss='binary_crossentropy',optimizer=Adam(self.lr_c, 0.5))
Вот код после парализации. Я просто добавляю две строки перед этим кодом:
tf.debugging.set_log_device_placement(True)
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
self.discriminator_model = image_gan_discriminator(patch_shape=patch_shape,
num_patches=num_patches,
image_shape=image_shape,
spectral_normalization=True)
self.discriminator_model.build(tuple([None] + list(image_shape)))
self.discriminator_model.compile(loss='binary_crossentropy',optimizer=Adam(lr_d, 0.5))
self.generator_model = image_gan_generator(noise_shape=noise_shape,
image_shape=.image_shape,
spectral_normalization=True)
self.generator_model.compile(loss='binary_crossentropy',optimizer=Adam(lr_g, 0.5))
self.generator_model.build(tuple([None] + list(noise_shape)))
self.discriminator_model.trainable = False
self.combined_model = keras.Sequential(layers=[self.generator_model, self.discriminator_model], name="combined")
self.combined_model.build(tuple([None] + list(noise_shape)))
self.combined_model.compile(loss='binary_crossentropy',optimizer=Adam(lr_c, 0.5))
Здесь я получил ошибку:
ValueError: Trying to create optimizer slot variable under the scope for tf.distribute.Strategy (<tensorflow.python.distribute.distribute_lib._DefaultDistributionStrategy object at 0x7f2599567190>), which is different from the scope used for the original variable (MirroredVariable:{
0 /job:localhost/replica:0/task:0/device:GPU:0: <tf.Variable 'spectral_normalization_11/bias:0' shape=(65536,) dtype=float32, numpy=array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)>,
1 /job:localhost/replica:0/task:0/device:GPU:1: <tf.Variable 'spectral_normalization_11/bias/replica_1:0' shape=(65536,) dtype=float32, numpy=array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)>
}). Make sure the slot variables are created under the same strategy scope. This may happen if you're restoring from a checkpoint outside the scope
2020-04-09 15:15:17.678734: I tensorflow/core/common_runtime/eager/execute.cc:573] Executing op DestroyResourceOp in device /job:localhost/replica:0/task:0/device:GPU:0
2020-04-09 15:15:17.678970: I tensorflow/core/common_runtime/eager/execute.cc:573] Executing op DestroyResourceOp in device /job:localhost/replica:0/task:0/device:GPU:1