tenorflow.python.framework.errors_impl.InternalError: сбой синхронизации графического процессора - PullRequest
0 голосов
/ 22 июня 2019

Я не хочу обучать сеть LSTM на GPU (Nvidia Quadro P5000), установлено tensorflow-gpu 1.13.

Сеть:

model = Sequential()
model.add(LSTM(256, return_sequences=True, input_shape=(SEQ_LEN, FEATURES)))  # , return_sequences=True
model.add(Dropout(0.3))
model.add(LSTM(256))
model.add(Dropout(0.3))
model.add(Dense(128, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
tensorboard = TensorBoard(log_dir='logs/{}'.format(NAME))
filepath = NAME + "_{epoch:02d}-{val_acc:.3f}"
checkpoint = ModelCheckpoint("models/{}.model".format(filepath, monitor='val_acc', verbose=1, save_best_only=True,
                                                      mode='max'))  # saves only the best ones
history = model.fit(
    train_x, train_y,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(validation_x, validation_y),
    callbacks=[tensorboard, checkpoint],
    verbose=2)

, когда я запускаю код первымЯ получаю этот вывод:

WARNING:tensorflow:From C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\keras\layers\core.py:143: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 200, 256)          269312    
_________________________________________________________________
dropout (Dropout)            (None, 200, 256)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 128)               32896     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
=================================================================
Total params: 827,649
Trainable params: 827,649
Non-trainable params: 0
_________________________________________________________________
Train on 135 samples, validate on 15 samples
WARNING:tensorflow:From C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
2019-06-22 13:32:38.312006: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-06-22 13:32:38.661805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Quadro P5000 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:18:00.0
totalMemory: 16.00GiB freeMemory: 13.40GiB
2019-06-22 13:32:38.662434: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-06-22 13:32:39.314138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-22 13:32:39.314871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-06-22 13:32:39.315399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-06-22 13:32:39.316240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 12964 MB memory) -> physical GPU (device: 0, name: Quadro P5000, pci bus id: 0000:18:00.0, compute capability: 6.1)
Epoch 1/1000
2019-06-22 13:32:54.748055: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally
2019-06-22 13:32:55.055557: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.056085: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.056398: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.056787: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.057127: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.057525: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.057834: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.058159: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.065639: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.065970: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.066302: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.066660: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.067030: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.067353: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.067684: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.067977: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.072992: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.073554: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.075976: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.076362: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.076852: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.078560: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.078807: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.079266: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.080071: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.080366: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.080835: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.081167: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.081488: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.081843: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.082937: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.083244: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.083599: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.083848: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.085391: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.085731: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.086090: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.086451: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.087272: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.087620: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.087862: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.088335: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.088958: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.089280: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.089578: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.089914: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.090225: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.090537: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.090909: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.091175: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.098724: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-06-22 13:32:55.099470: W tensorflow/stream_executor/stream.cc:2130] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "<input>", line 108, in <module>
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\keras\engine\training.py", line 880, in fit
    validation_steps=validation_steps)
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\keras\engine\training_arrays.py", line 329, in model_iteration
    batch_outs = f(ins_batch)
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\keras\backend.py", line 3076, in __call__
    run_metadata=self.run_metadata)
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1439, in __call__
    run_metadata_ptr)
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(1, 256), b.shape=(256, 256), m=1, n=256, k=256
     [[{{node lstm/while/MatMul_4}}]]
     [[{{node loss/mul}}]]

Я обнаружил, что эту проблему можно решить с помощью:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)

Запуск newtwok после того, как этот код приводит к:

Traceback (most recent call last):
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
    return fn(*args)
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<input>", line 20, in <module>
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\keras\engine\training.py", line 880, in fit
    validation_steps=validation_steps)
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\keras\engine\training_arrays.py", line 215, in model_iteration
    mode=mode)
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\keras\callbacks.py", line 106, in configure_callbacks
    callback_list.set_model(callback_model)
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\keras\callbacks.py", line 178, in set_model
    callback.set_model(model)
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\keras\callbacks.py", line 1010, in set_model
    self._init_writer()
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\keras\callbacks.py", line 947, in _init_writer
    self.writer = tf_summary.FileWriter(self.log_dir, K.get_session().graph)
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\keras\backend.py", line 482, in get_session
    _initialize_variables(session)
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\keras\backend.py", line 758, in _initialize_variables
    [variables_module.is_variable_initialized(v) for v in candidate_vars])
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
    run_metadata_ptr)
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
    run_metadata)
  File "C:\Users\eiLink\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

Код не выдает никаких ошибок, когда я запускаю его на процессоре с использованием tf.device ('/ cpu: 0')

...