Недопустимая порядковая ошибка устройства на сервере Linux GPU? - PullRequest
0 голосов
/ 05 июня 2019

Я пытаюсь запустить код тензорного потока на графическом процессоре, к которому я могу получить удаленный доступ через SSH.Я использую Windows CMD для SSH, а затем я получаю терминал Linux сервера.Теперь я хотел запустить код на GPU сервера, а не на CPU, и поэтому я установил Tensorflow-GPU.Я использую среду conda для запуска Python.Теперь, когда я запускаю python и после импорта tenorflow я получаю следующую ошибку.Пожалуйста, помогите мне решить эту проблему?

>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2019-06-05 13:26:45.280912: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-06-05 13:26:45.309892: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3399505000 Hz
2019-06-05 13:26:45.311731: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55ffc55f7cc0 executing computations on platform Host. Devices:
2019-06-05 13:26:45.311780: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-06-05 13:26:45.315413: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-06-05 13:26:46.043595: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 11719409664
2019-06-05 13:26:46.044237: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55ffc84a7710 executing computations on platform CUDA. Devices:
2019-06-05 13:26:46.044297: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-06-05 13:26:46.044308: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-06-05 13:26:46.044322: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (2): GeForce GTX 1080 Ti, Compute Capability 6.1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/media/data_dump_1/group_9/anaconda3/envs/pradyumnaenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1570, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/media/data_dump_1/group_9/anaconda3/envs/pradyumnaenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 693, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid device ordinal value (3). Valid range is [0, 2].
        while setting up XLA_GPU_JIT device number 3
>>> sess=tf.Session()
2019-06-05 13:27:09.128360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:0a:00.0
2019-06-05 13:27:09.130001: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:0b:00.0
2019-06-05 13:27:09.131300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:41:00.0
2019-06-05 13:27:09.132226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:42:00.0
2019-06-05 13:27:09.133262: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-06-05 13:27:09.135097: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-06-05 13:27:09.385582: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-06-05 13:27:09.486764: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-06-05 13:27:09.489289: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-06-05 13:27:10.140611: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-06-05 13:27:10.145133: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-06-05 13:27:10.159615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/media/data_dump_1/group_9/anaconda3/envs/pradyumnaenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1570, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/media/data_dump_1/group_9/anaconda3/envs/pradyumnaenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 693, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid device ordinal value (3). Valid range is [0, 2].
        while setting up XLA_GPU_JIT device number 3
>>> exit()

Ниже приведены сведения о графическом процессоре -

(pradyumnaenv) cse563@falcon:~$ nvidia-smi
Wed Jun  5 15:53:05 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54                 Driver Version: 396.54                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:0A:00.0 Off |                  N/A |
|  0%   44C    P8    19W / 250W |  10791MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:0B:00.0 Off |                  N/A |
| 22%   53C    P8    21W / 250W |  10791MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:41:00.0 Off |                  N/A |
| 31%   57C    P8    22W / 250W |    677MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:42:00.0 Off |                  N/A |
|100%   91C    P2   132W / 250W |  11105MiB / 11176MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     78117      C   python                                     10777MiB |
|    1     47128      C   python                                     10777MiB |
|    2     79606      C   ...t_nagpal/miniconda3/envs/dnn/bin/python   667MiB |
|    3     83393      C   python                                     11095MiB |
+-----------------------------------------------------------------------------+
...