Когда я пытался использовать распределенное обучение, я столкнулся с ошибкой Cudnn.
Не удалось инициализировать Cudnn
GPU достаточно Память. nvidia-smi
говорит мне, что я использовал только 102Mb.
Coda 10.1, тензор потока 2.1, cudnn 7.6.5 Все версии обновлены.
Даже установил config.gpu_options.allow_growth = True
Может кто-нибудь помочь мне с этой проблемой .?
Полный журнал ошибок:
2020-04-29 09:01:30.184547: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64
2020-04-29 09:01:30.184643: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64
2020-04-29 09:01:30.184660: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
.2020-04-29 09:01:30.728942: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-29 09:01:30.801630: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:30.802477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-29 09:01:30.802600: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:30.803364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:00:05.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-29 09:01:30.804118: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-29 09:01:30.819996: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-29 09:01:30.822466: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-29 09:01:30.823123: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-29 09:01:30.827450: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-29 09:01:30.830013: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-29 09:01:30.835374: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-29 09:01:30.835527: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:30.836349: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:30.837131: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:30.837895: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:30.838568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-04-29 09:01:30.838902: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled touse: AVX2 AVX512F FMA
2020-04-29 09:01:30.846028: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000134999 Hz
2020-04-29 09:01:30.846366: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4522c60 initialized for platform Host (this does not guarantee thatXLA will be used). Devices:
2020-04-29 09:01:30.846396: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-04-29 09:01:31.066134: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.069463: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.070509: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4598920 initialized for platform CUDA (this does not guarantee thatXLA will be used). Devices:
2020-04-29 09:01:31.070536: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-04-29 09:01:31.070542: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): Tesla T4, Compute Capability 7.5
2020-04-29 09:01:31.071050: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.071823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-29 09:01:31.071949: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.072675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:00:05.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-29 09:01:31.072750: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-29 09:01:31.072792: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-29 09:01:31.072814: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-29 09:01:31.072837: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-29 09:01:31.072855: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-29 09:01:31.072875: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-29 09:01:31.072897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-29 09:01:31.072978: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.073755: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.074494: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.075230: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.075934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-04-29 09:01:31.076050: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-29 09:01:31.078176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-29 09:01:31.078206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0 1
2020-04-29 09:01:31.078213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N N
2020-04-29 09:01:31.078229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 1: N N
2020-04-29 09:01:31.078466: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.079271: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.080016: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.080730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14249 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
2020-04-29 09:01:31.081303: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.082033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14249 MB memory) -> physical GPU (device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5)
WARNING:tensorflow:From /home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/ops/image_ops_impl.py:1556: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train for 390 steps
Epoch 1/60
2020-04-29 09:01:55.079654: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-29 09:02:05.724980: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:150] Filling up shuffle buffer (this may take a while): 33418 of 50000
2020-04-29 09:02:10.436522: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:199] Shuffle buffer filled.
2020-04-29 09:02:10.438717: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-29 09:02:11.453483: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.3.1 but source was compiled with: 7.6.4. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNNlibrary. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2020-04-29 09:02:11.456625: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.3.1 but source was compiled with: 7.6.4. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNNlibrary. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2020-04-29 09:02:11.482157: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node replica_1/resnet56/conv1/Conv2D}}]]
[[metrics/sparse_categorical_accuracy/div_no_nan/AddN_1/_28]]
2020-04-29 09:02:11.482214: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node replica_1/resnet56/conv1/Conv2D}}]]
[[replica_1/loss/ArithmeticOptimizer/HoistCommonFactor_Mul_add_1/_8]]
2020-04-29 09:02:11.482314: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node replica_1/resnet56/conv1/Conv2D}}]]
2020-04-29 09:02:12.282331: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.3.1 but source was compiled with: 7.6.4. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNNlibrary. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2020-04-29 09:02:12.284817: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.3.1 but source was compiled with: 7.6.4. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNNlibrary. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
1/390 [..............................] - ETA: 3:47:17Traceback (most recent call last):
File "worker.py", line 88, in <module>
epochs=NUM_EPOCHS)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
use_multiprocessing=use_multiprocessing)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
total_epochs=epochs)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
batch_outs = execution_function(iterator)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
distributed_function(input_fn))
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
result = self._call(*args, **kwds)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 632, in _call
return self._stateless_fn(*args, **kwds)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
self.captured_inputs)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
ctx=ctx)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node replica_1/resnet56/conv1/Conv2D (defined at usr/lib/python3.6/threading.py:916) ]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node replica_1/resnet56/conv1/Conv2D (defined at usr/lib/python3.6/threading.py:916) ]]
[[metrics/sparse_categorical_accuracy/div_no_nan/AddN_1/_28]]
0 successful operations.
1 derived errors ignored. [Op:__inference_distributed_function_35914]
Function call stack:
distributed_function -> distributed_function
2020-04-29 09:02:12.502074: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-04-29 09:02:12.502969: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled