Не удалось решить эту ошибку: не удалось создать дескриптор cudnn: CUDNN_STATUS_INTERNAL_ERROR - PullRequest
0 голосов
/ 28 марта 2019

Я пытаюсь тренировать CNN с керасом, тензорным бэкэндом.Это работало, и внезапно я получил эту ошибку.

Я попытался rm -rf ~/.nv/, но это не сработало.

РЕДАКТИРОВАТЬ: я создал другую виртуальную среду, и она работает.

Вот полный журнал:

Using TensorFlow backend.
2019-03-28 09:23:59.602808: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-03-28 09:23:59.686104: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-28 09:23:59.686506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce GTX 1060 with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.3415
pciBusID: 0000:01:00.0
totalMemory: 5.94GiB freeMemory: 5.36GiB
2019-03-28 09:23:59.686525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-03-28 09:23:59.910495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-28 09:23:59.910530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-03-28 09:23:59.910536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-03-28 09:23:59.910694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5126 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
Epoch 1/10
2019-03-28 09:24:24.618996: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-03-28 09:24:24.921106: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "train_models.py", line 63, in <module>
    workers=6)
  File "/home/fadwa/.virtualenvs/dl_tf_Py3/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/fadwa/.virtualenvs/dl_tf_Py3/lib/python3.5/site-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/fadwa/.virtualenvs/dl_tf_Py3/lib/python3.5/site-packages/keras/engine/training_generator.py", line 217, in fit_generator
    class_weight=class_weight)
  File "/home/fadwa/.virtualenvs/dl_tf_Py3/lib/python3.5/site-packages/keras/engine/training.py", line 1217, in train_on_batch
    outputs = self.train_function(ins)
  File "/home/fadwa/.virtualenvs/dl_tf_Py3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/home/fadwa/.virtualenvs/dl_tf_Py3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/home/fadwa/.virtualenvs/dl_tf_Py3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/home/fadwa/.virtualenvs/dl_tf_Py3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node conv2d_1/convolution}} = Conv2D[T=DT_FLOAT, _class=["loc:@training/Adam/gradients/conv2d_1/convolution_grad/Conv2DBackpropFilter"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adam/gradients/conv2d_1/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv2d_1/kernel/read)]]
         [[{{node loss/mul/_113}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_853_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
...