Ошибка Tensorflow - «Контейнер localhost не существует. (Не удалось найти ресурс: localhost / _AnonymousVar0)» - PullRequest
2 голосов
/ 29 мая 2020

Я пытаюсь запустить обучающий код Tensorflow's Transformer с другим набором данных локально на моем ноутбуке. К сожалению, я получаю некий Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0). (Я считаю, что это основная ошибка, но могу ошибаться).

Что необычно, так это то, что я получаю эту ошибку только после того, как модель тренировалась в течение нескольких эпох .

Вот весь журнал: (Я обрезал верхнюю половину журнала, где показаны инициализации тензорного потока, там не было никаких ошибок / предупреждений)

Train for 7290 steps
Epoch 1/15
2020-05-28 22:57:18.046206: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
7290/7290 [==============================] - 1986s 272ms/step - loss: 2.0052 - accuracy: 0.0939
Epoch 2/15
7290/7290 [==============================] - 1971s 270ms/step - loss: 1.6234 - accuracy: 0.1223
Epoch 3/15
7290/7290 [==============================] - 1968s 270ms/step - loss: 1.5535 - accuracy: 0.1291
Epoch 4/15
7290/7290 [==============================] - 1968s 270ms/step - loss: 1.5192 - accuracy: 0.1325
Epoch 5/15
7290/7290 [==============================] - 1968s 270ms/step - loss: 1.4978 - accuracy: 0.1348
Epoch 6/15
7290/7290 [==============================] - 1967s 270ms/step - loss: 1.4825 - accuracy: 0.1364
Epoch 7/15
7290/7290 [==============================] - 1967s 270ms/step - loss: 1.4711 - accuracy: 0.1376
Epoch 8/15
7290/7290 [==============================] - 1966s 270ms/step - loss: 1.4621 - accuracy: 0.1386
Epoch 9/15
 174/7290 [..............................] - ETA: 32:11 - loss: 1.4382 - accuracy: 0.13312020-05-29 03:20:43.528885: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at resource_variable_ops.cc:540 : Not found: Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
2020-05-29 03:20:43.528953: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Not found: Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
     [[{{node Adam/Adam/update/AssignSubVariableOp}}]]
     [[GroupCrossDeviceControlEdges_0/Adam/Adam/Const/_301]]
2020-05-29 03:20:43.529025: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Not found: Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
     [[{{node Adam/Adam/update/AssignSubVariableOp}}]]
 175/7290 [..............................] - ETA: 32:14 - loss: 1.4382 - accuracy: 0.1331Traceback (most recent call last):
  File "model.py", line 114, in <module>
    model.fit(dataset, epochs=EPOCHS)
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 599, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) Not found:  Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
     [[node Adam/Adam/update/AssignSubVariableOp (defined at model.py:114) ]]
  (1) Not found:  Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
     [[node Adam/Adam/update/AssignSubVariableOp (defined at model.py:114) ]]
     [[GroupCrossDeviceControlEdges_0/Adam/Adam/Const/_301]]
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_function_15977]

Errors may have originated from an input operation.
Input Source operations connected to node Adam/Adam/update/AssignSubVariableOp:
 transformer/encoder/embedding/embedding_lookup/11773 (defined at /home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/contextlib.py:112)

Input Source operations connected to node Adam/Adam/update/AssignSubVariableOp:
 transformer/encoder/embedding/embedding_lookup/11773 (defined at /home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/contextlib.py:112)

Function call stack:
distributed_function -> distributed_function

код, с которым я работаю:
model.py - https://pastebin.com/FVaj1V5W. Это файл, который выполняет обучение.

Определения модели находятся в другом скрипте в том же каталоге: model_definition.py - https://pastebin.com/HyV2RMY2

РАБОЧАЯ СРЕДА:
Версия Tensorflow: 2.1.0 (Tensorflow GPU)
Версия Pythnon: 3.7.7
GPU - Nvidia GTX 1660 Ti, 6 ГБ
ОС: Ubuntu 20.04 LTS

...