Задание cloud-ml отменяется после выполнения нескольких тысяч шагов - PullRequest
0 голосов
/ 01 мая 2019

Я успешно начинаю учебные задания в Google Cloud. Однако после выполнения от 30 минут до 1 часа и нескольких тысяч шагов они заканчиваются неинформативным сообщением об ошибке: «CancelledError: Cancelled».

Я тренируюсь на ~ 30K изображениях, распределенных по 16 файлам записи. У меня нет этой проблемы при обучении меньшему количеству изображений в одном файле (~ 5K или около того)

Вот подробности: Я запускаю работу с помощью этой команды:

gcloud ai-platform jobs submit training my_job_name \
 --runtime-version 1.13 \
--job-dir=gs://image-training/my_job_dir \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,dist/pycocotools-2.0.tar.gz \
--module-name object_detection.model_main \
--region us-east1 --config object_detection/CLOUDgpu.yaml \
--python-version 3.5 \
-- \
--model_dir gs://image-training/my_job_dir \
--pipeline_config_path=gs://image-training/ssd_inception_v2_coco_2018_01_28/ssd_inception_v2_CLOUD.config 

Вот мой файл YAML:

trainingInput:
  runtimeVersion: "1.13"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 9
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

Мой конфигурационный файл ссылается на файлы данных следующим образом:

train_input_reader: {
  tf_record_input_reader {
    input_path: "gs://image-training/t0423data/train_*_re.tfrecord"
  }
  num_readers:3
  label_map_path: "gs://image-training/PigCount/label_map.pbtxt"
}

Наконец, полная ошибка:

The replica worker 6 exited with a non-zero status of 1. Termination reason:
Error. Traceback (most recent call last): [...] saving_listeners) File
"/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py",
line 1407, in _train_with_estimator_spec _, loss =
mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 676, in run run_metadata=run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1171, in run run_metadata=run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1270, in run raise six.reraise(*original_exc_info) File
"/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise raise
value File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1255, in run return self._sess.run(*args, **kwargs) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1327, in run run_metadata=run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1091, in run return self._sess.run(*args, **kwargs) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 929, in run run_metadata_ptr) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1152, in _run feed_dict_tensor, options, run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1328, in _do_run run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1348, in _do_call raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Cancelled To find out
more about why your job exited please check the logs:
https://console.cloud.google.com/logs/viewer?project=226138759195&resource=ml_job%2Fjob_id%2Ft_05_01_big_data1&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22t_05_01_big_data1%22

В журналах Replica 6 показаны следующие ошибки:

command '['python3', '-m', 'object_detection.model_main', '--model_dir', 'gs://image-training/my_job_dir', '--pipeline_config_path=gs://image-training/ssd_inception_v2_coco_2018_01_28/ssd_inception_v2_CLOUD.config', '--job-dir', 'gs://image-training/my_job_dir']' returned non-zero exit status 1

А перед этим:

worker-replica-6
Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.CancelledError: Cancelled During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/model_main.py", line 109, in <module> tf.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/model_main.py", line 105, in main tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0]) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate return executor.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 638, in run getattr(self, task_to_run)() File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 648, in run_worker return self._start_distributed_training() File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 789, in _start_distributed_training saving_listeners=saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 676, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1171, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1270, in run raise six.reraise(*original_exc_info) File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise raise value File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1327, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1091, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.CancelledError: Cancelled

Есть идеи, как я могу сохранить эти задания от сбоев?

1 Ответ

0 голосов
/ 02 мая 2019

Кажется, я решил эту проблему, увеличив количество и мощность компьютеров, которые я использую. Я изменил файл YAML на это, и он без проблем работает на 50000 шагов. Намного дороже, но, по крайней мере, это работает!:

trainingInput:
  scaleTier: CUSTOM
  # Configure a master worker with 4 K80 GPUs
  masterType: n1-highcpu-16
  masterConfig:
    acceleratorConfig:
      count: 4
      type: NVIDIA_TESLA_K80
  # Configure 9 workers, each with 4 K80 GPUs
  workerCount: 9
  workerType: n1-highcpu-16
  workerConfig:
    acceleratorConfig:
      count: 4
      type: NVIDIA_TESLA_K80
  # Configure 3 parameter servers with no GPUs
  parameterServerCount: 3
  parameterServerType: n1-highmem-8

См. Эту страницу для полного объяснения: https://cloud.google.com/ml-engine/docs/tensorflow/using-gpus

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...