Я пытаюсь запустить обучающий код Tensorflow's Transformer с другим набором данных локально на моем ноутбуке. К сожалению, я получаю некий Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
. (Я считаю, что это основная ошибка, но могу ошибаться).
Что необычно, так это то, что я получаю эту ошибку только после того, как модель тренировалась в течение нескольких эпох .
Вот весь журнал: (Я обрезал верхнюю половину журнала, где показаны инициализации тензорного потока, там не было никаких ошибок / предупреждений)
Train for 7290 steps
Epoch 1/15
2020-05-28 22:57:18.046206: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
7290/7290 [==============================] - 1986s 272ms/step - loss: 2.0052 - accuracy: 0.0939
Epoch 2/15
7290/7290 [==============================] - 1971s 270ms/step - loss: 1.6234 - accuracy: 0.1223
Epoch 3/15
7290/7290 [==============================] - 1968s 270ms/step - loss: 1.5535 - accuracy: 0.1291
Epoch 4/15
7290/7290 [==============================] - 1968s 270ms/step - loss: 1.5192 - accuracy: 0.1325
Epoch 5/15
7290/7290 [==============================] - 1968s 270ms/step - loss: 1.4978 - accuracy: 0.1348
Epoch 6/15
7290/7290 [==============================] - 1967s 270ms/step - loss: 1.4825 - accuracy: 0.1364
Epoch 7/15
7290/7290 [==============================] - 1967s 270ms/step - loss: 1.4711 - accuracy: 0.1376
Epoch 8/15
7290/7290 [==============================] - 1966s 270ms/step - loss: 1.4621 - accuracy: 0.1386
Epoch 9/15
174/7290 [..............................] - ETA: 32:11 - loss: 1.4382 - accuracy: 0.13312020-05-29 03:20:43.528885: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at resource_variable_ops.cc:540 : Not found: Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
2020-05-29 03:20:43.528953: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Not found: Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
[[{{node Adam/Adam/update/AssignSubVariableOp}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/Const/_301]]
2020-05-29 03:20:43.529025: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Not found: Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
[[{{node Adam/Adam/update/AssignSubVariableOp}}]]
175/7290 [..............................] - ETA: 32:14 - loss: 1.4382 - accuracy: 0.1331Traceback (most recent call last):
File "model.py", line 114, in <module>
model.fit(dataset, epochs=EPOCHS)
File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
use_multiprocessing=use_multiprocessing)
File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
total_epochs=epochs)
File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
batch_outs = execution_function(iterator)
File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
distributed_function(input_fn))
File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
result = self._call(*args, **kwds)
File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 599, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
self.captured_inputs)
File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
ctx=ctx)
File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
[[node Adam/Adam/update/AssignSubVariableOp (defined at model.py:114) ]]
(1) Not found: Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
[[node Adam/Adam/update/AssignSubVariableOp (defined at model.py:114) ]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/Const/_301]]
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_function_15977]
Errors may have originated from an input operation.
Input Source operations connected to node Adam/Adam/update/AssignSubVariableOp:
transformer/encoder/embedding/embedding_lookup/11773 (defined at /home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/contextlib.py:112)
Input Source operations connected to node Adam/Adam/update/AssignSubVariableOp:
transformer/encoder/embedding/embedding_lookup/11773 (defined at /home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/contextlib.py:112)
Function call stack:
distributed_function -> distributed_function
код, с которым я работаю:
model.py - https://pastebin.com/FVaj1V5W. Это файл, который выполняет обучение.
Определения модели находятся в другом скрипте в том же каталоге: model_definition.py - https://pastebin.com/HyV2RMY2
РАБОЧАЯ СРЕДА:
Версия Tensorflow: 2.1.0 (Tensorflow GPU)
Версия Pythnon: 3.7.7
GPU - Nvidia GTX 1660 Ti, 6 ГБ
ОС: Ubuntu 20.04 LTS