CUDNN_STATUS_INTERNAL_ERROR в tenorflow-gpu 2.0 - PullRequest
       25

CUDNN_STATUS_INTERNAL_ERROR в tenorflow-gpu 2.0

1 голос
/ 16 октября 2019

Когда я запускаю CNN в Tensorflow 2.0, я получаю CUDNN_STATUS_INTERNAL_ERROR. Кажется, что libcublas.so.10.0 и libcudnn.so.7 загружаются нормально.

версии должны быть в порядке:

  • Tensorflow 2.0
  • ubuntu 18.04
  • GeForce GTX 1650
  • Драйвер NVIDIA 430
  • cudnn: 7.4.2.24 (также пробовал с 7.3.0.29 и 7.6.4.38) ( ref )

Я попытался выполнить следующие действия, но они не устранили проблему:

  1. Я удалил ~ / .nv ( ref )
  2. Изменено/usr/include/cudnn.h #include "driver_types.h" до #include <driver_types.h> и прошел тест mnistCUDNN ( ref )

Вопросы:

  1. Проходит прохождениетест mnistCUDNN означает, что необходимые пакеты установлены правильно?
  2. Как я могу исправить эту проблему ниже?

В конце концов, вот сообщение об ошибке:

Using TensorFlow backend.
2019-10-16 14:48:16.226892: I tensorflow/stream_executor/platform/default/dso_loader.cc:44]  Successfully opened dynamic library libcuda.so.1
2019-10-16 14:48:16.255123: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006]    successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
...
2019-10-16 14:48:16.370703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3253 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650, pci bus id: 0000:01:00.0, compute capability: 7.5)
Train on 48000 samples, validate on 12000 samples
Epoch 1/12
2019-10-16 14:48:17.357747: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-10-16 14:48:17.525865: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
--error here--
2019-10-16 14:48:17.873127: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-10-16 14:48:17.879412: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
--error here--
2019-10-16 14:48:17.879516: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
 [[{{node conv2d_1/convolution}}]]
Traceback (most recent call last):
File "lenet.py", line 96, in <module> x_train, y_train, batch_size=128, epochs=12, validation_split=0.2
File "lenet.py", line 83, in train verbose=self.verbose
File "/home/yuyu/venv/lib/python3.6/site-packages/keras/engine/training.py", line 1239, in fit validation_freq=validation_freq)
File "/home/yuyu/venv/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop outs = fit_function(ins_batch)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/keras/backend.py", line 3740, in __call__
outputs = self._graph_fn(*converted_inputs)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1081, in __call__
return self._call_impl(args, kwargs)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1121, in _call_impl
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
ctx, args, cancellation_manager=cancellation_manager)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
ctx=ctx)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
 [[node conv2d_1/convolution (defined at /home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_keras_scratch_graph_1220]

Function call stack:
keras_scratch_graph
...