Невозможно использовать графический процессор для tenorflow-gpu: «Не удалось создать дескриптор cudnn: CUDNN_STATUS_INTERNAL_ERROR» - PullRequest
0 голосов
/ 29 июня 2019

Краткое описание моей проблемы

Когда я выполняю код с tenorflow-gpu, в качестве заголовка я получаю сообщение об ошибке.Эта ошибка происходит в каждом коде, который содержит слой свертки.

Environment

  • Ubuntu 18.04
  • Python 3.7.1
  • tenorflow-gpu 1.13.1
  • CUDA 10.1
  • CuDNN 7.4.2

Детали вокруг графического процессора

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43       Driver Version: 418.43       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   46C    P8    21W / 215W |    568MiB /  7949MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1733      G   /usr/lib/xorg/Xorg                            18MiB |
|    0      1771      G   /usr/bin/gnome-shell                          57MiB |
|    0      2698      G   /usr/lib/xorg/Xorg                           175MiB |
|    0      2813      G   /usr/bin/gnome-shell                         168MiB |
|    0      3339      G   ...uest-channel-token=11703333986562712743    76MiB |
|    0      8579      G   /proc/self/exe                                67MiB |
+-----------------------------------------------------------------------------+

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-10.0/bin
CUDA_PATH=/usr/local/cuda-10.0
LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-10.0/lib64
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH    "
export PATH="/usr/local/cuda/bin:$PATH"

Все сообщение об ошибке

2019-06-29 23:13:22.132275: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-06-29 23:13:22.803064: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-06-29 23:13:22.805965: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "train.py", line 90, in <module>
    main(args)
  File "train.py", line 81, in main
    callbacks=[callback]
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1426, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training_generator.py", line 191, in model_iteration
    batch_outs = batch_function(*batch_data)
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1191, in train_on_batch
    outputs = self._fit_function(ins)  # pylint: disable=not-callable
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/backend.py", line 3076, in __call__
    run_metadata=self.run_metadata)
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node block1_conv1/Conv2D}}]]
     [[{{node loss/arc_face_loss/broadcast_weights/assert_broadcastable/is_valid_shape/has_valid_nonscalar_shape/has_invalid_dims/concat}}]]

Там написано "Не удалось создать дескриптор cudnn: CUDNN_STATUS_INTERNAL_ERROR", поэтому я считаю, что это вызвано CuDNN.Я пытался каким-то образом, например sudo rm -rf ~/.nv/ в этот вопрос и config.gpu_options.allow_growth = True в этот вопрос GitHub , но я не могу решить.

Пожалуйста, сообщите мне решениеэта проблема.

Спасибо.

...