Я пытаюсь следовать учебнику Tensorflow по обнаружению объектов для распределенного обучения моей собственной модели, но использую код точно так же, как и из хранилища .
Я внес пару изменений в учебник, в частности, для использования среды выполнения 1.5 вместо 1.2, как сказано в учебнике.Я не вижу явных ошибок (которые я вижу) при попытке запуска в Google Cloud ML, но задача быстро завершается без обучения.
Вот команда, которую я использую для запуска задания обучения:
gcloud ml-engine jobs submit training object_detection_`date +%s`
--job-dir=gs://test-bucket/training/
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz
--module-name object_detection.train
--region us-central1
--config ./config.yaml
--
--train_dir=gs://test-bucket/data/
--pipeline_config_path=gs://test-bucket/configs/ssd_inception_v2_coco.config
И это мой config.yaml:
trainingInput:
runtimeVersion: "1.5"
scaleTier: CUSTOM
masterType: complex_model_l
workerCount: 9
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: large_model
И, наконец, журналы моей работы заканчиваются:
I worker-replica-6 Clean up finished. worker-replica-6
I worker-replica-7 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior. worker-replica-7
I worker-replica-7 Module completed; cleaning up. worker-replica-7
I worker-replica-7 Clean up finished. worker-replica-7
I worker-replica-8 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior. worker-replica-8
I worker-replica-8 Module completed; cleaning up. worker-replica-8
I worker-replica-8 Clean up finished. worker-replica-8
I worker-replica-1 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-1
I worker-replica-1 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior. worker-replica-1
I worker-replica-1 Module completed; cleaning up. worker-replica-1
I worker-replica-1 Clean up finished. worker-replica-1
I worker-replica-7 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-7
I worker-replica-8 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-8
I worker-replica-6 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-6
I worker-replica-3 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-3
I worker-replica-0 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-0
I worker-replica-2 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-2
I worker-replica-5 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-5
I worker-replica-1 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-1
I worker-replica-7 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-7
I worker-replica-8 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-8
I worker-replica-6 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-6
I worker-replica-3 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-3
I worker-replica-0 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-0
I worker-replica-2 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-2
I worker-replica-5 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-5
I worker-replica-1 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-1
I worker-replica-7 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-7
I worker-replica-8 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-8
I worker-replica-6 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-6
I Finished tearing down TensorFlow.
I Job failed.
Как я уже говорил, я неудалось получить что-то полезное из логов.Чуть дальше я получаю эту ошибку Master init: Unavailable: Stream removed
, но не знаю, как с этим справиться.Спасибо за любой толчок в правильном направлении!