Повторное обучение загруженной модели, кажется, не использует GPU должным образом (очень медленно для обучения) - PullRequest
2 голосов
/ 22 апреля 2020

У меня нет проблем с обучением модели на моем GPU, однако, когда дело доходит до загрузки модели из файла .h5 и подгонки к ней большего количества данных - процесс обучения становится невероятно медленным. Медленно, как от 28 секунд до 201 секунды за эпоху

Помимо создания архитектуры, загрузки и сохранения модели, мой код идентичен моим нижеприведенным примерам

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization

EPOCHS = 10
BATCH_SIZE = 32

model = Sequential()
model.add(LSTM(128, input_shape=(X_train_lstm.shape[1:]), return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(LSTM(128))
model.add(Dropout(0.5))
model.add(BatchNormalization())

model.add(Dense(1))

opt = tf.keras.optimizers.Adam(lr=0.0005, decay=1e-6)

model.compile(loss='mean_squared_error', optimizer=opt, metrics=['mean_absolute_error'])


history = model.fit(X_train_lstm, y_train_lstm, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_data=(X_test_lstm, y_test_lstm))


model.save("model.h5")


Первоначально тренировка моей модели, кажется, работает отлично, полностью используя мой графический процессор, как показано ниже:

2020-04-23 00:29:27.838398: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:29:29.710605: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-04-23 00:29:29.741211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:09:00.0 name: GeForce GTX 1060 6GB computeCapability: 6.1
coreClock: 1.835GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s
2020-04-23 00:29:29.747629: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:29:29.760441: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-23 00:29:29.769311: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-23 00:29:29.774739: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-23 00:29:29.785242: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-23 00:29:29.792404: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-23 00:29:29.815372: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-23 00:29:29.819504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-23 00:29:29.822104: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-04-23 00:29:29.827926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:09:00.0 name: GeForce GTX 1060 6GB computeCapability: 6.1
coreClock: 1.835GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s
2020-04-23 00:29:29.835707: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:29:29.839583: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-23 00:29:29.843859: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-23 00:29:29.847116: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-23 00:29:29.851727: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-23 00:29:29.855977: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-23 00:29:29.859217: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-23 00:29:29.863804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-23 00:29:30.482013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-23 00:29:30.486022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0
2020-04-23 00:29:30.488195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N
2020-04-23 00:29:30.491115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4702 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:09:00.0, compute capability: 6.1)
Train on 34849 samples, validate on 7421 samples
Epoch 1/10
2020-04-23 00:29:35.276148: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-23 00:29:35.530107: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll

34849/34849 [==============================] - 28s 808us/sample - loss: 0.3903 - mean_absolute_error: 0.4317 - val_loss: 0.0015 - val_mean_absolute_error: 0.0341

Однако, когда я вместо того, чтобы тренировать модель с нуля, загружаю одну - это невероятно медленно (тренироваться) и Кажется, не открывается dynamic library cudnn64_7.dll по сравнению с вышеупомянутым (внизу, прямо перед началом тренировки):

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization
from tensorflow.keras.models import load_model

EPOCHS = 10
BATCH_SIZE = 32

model = load_model("model.h5")

history = model.fit(X_train_lstm, y_train_lstm, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_data=(X_test_lstm, y_test_lstm))
2020-04-23 00:37:30.650618: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:37:32.426823: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-04-23 00:37:32.459919: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:09:00.0 name: GeForce GTX 1060 6GB computeCapability: 6.1
coreClock: 1.835GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s
2020-04-23 00:37:32.465038: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:37:32.475213: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-23 00:37:32.481768: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-23 00:37:32.485735: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-23 00:37:32.493894: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-23 00:37:32.499155: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-23 00:37:32.520215: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-23 00:37:32.523377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-23 00:37:32.525383: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-04-23 00:37:32.530055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:09:00.0 name: GeForce GTX 1060 6GB computeCapability: 6.1
coreClock: 1.835GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s
2020-04-23 00:37:32.538667: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:37:32.542212: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-23 00:37:32.545335: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-23 00:37:32.549384: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-23 00:37:32.552674: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-23 00:37:32.556134: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-23 00:37:32.559841: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-23 00:37:32.563358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-23 00:37:33.191288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-23 00:37:33.194467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0
2020-04-23 00:37:33.196314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N
2020-04-23 00:37:33.198647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4702 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:09:00.0, compute capability: 6.1)
Train on 34849 samples, validate on 7421 samples
Epoch 1/10
2020-04-23 00:37:37.688378: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll

34849/34849 [==============================] - 201s 6ms/sample - loss: 0.0074 - mean_absolute_error: 0.0604 - val_loss: 2.7296e-04 - val_mean_absolute_error: 0.0151

Как я могу go решить эту проблему?

...