Ошибка при запуске API обнаружения объектов в Google ML - PullRequest
0 голосов
/ 24 мая 2018

У меня проблема с получением задания для запуска в Google ML для переподготовки API обнаружения объектов SSD Mobilenet с использованием моих собственных данных обучения.Обратите внимание, что я могу успешно тренироваться на своей локальной машине.Вот подробности.Я пробовал разные версии tenorflow для файлов gcloud (и соответствующих cloud.yaml), и все они потерпели неудачу.Я использую локально версию 1.8 тензорного потока с API обнаружения объектов (+ slim).

ПРИМЕЧАНИЕ. Попытка переобучить модель сети SSD_Mobile, которую я скопировал в свое хранилище Google CLoud и изначально находился в object_detection \ ssd_mobilenet_v1_coco_2017_11_17 \ model.ckpt

TensorFlow версия (используйте команду ниже): попыталсяМногочисленная версия, включая 1.8 (не поддерживает Google ML 1.8, и эта версия используется локально для создания обучающих файлов TFRecord)

ПРИМЕЧАНИЕ: попытка запустить учебный пример (обучающий локально) в Google ML.Выполните запрос задания с помощью инструмента gcloud.Следовали инструкциям на https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_on_cloud.md. КОМАНДА, выполненная по тензорному потоку / модели / исследования

gcloud ml-engine jobs submit training grewe_object_detection_6 --runtime-version 1.8 --job-dir=gs://BLAHBLAH-storage/Train --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz --module-name object_detection.train --region us-central1 --config object_detection/samples/cloud/cloud.yml -- --

Опишите проблему См. Ошибку ниже.Попытался изменить версию используемого tenorflow (обратите внимание, что локально при успешном запуске с использованием 1.8, так что полагайте, что именно это используется для упаковки TFRecord, он должен работать в Google ML) - поэтому пытался обновить предоставленный cloud.yaml (пробовал для версии 1.2, 1.4, 1.6 и 1.8, а также пытался обновить setup.py в моделях / исследованиях, и ничего не работает.

В последний раз я попробовал следующее для моего файла cloud.yaml

trainingInput: runtimeVersion: "1.8" scaleTier: CUSTOM masterType: standard_gpu workerCount: 5 workerType: standard_gpu parameterServerCount: 3 parameterServerType: standard

Я пыталсяпоследнее для моего setup.py

** _ `" "Сценарий установки для object_detection." ""

from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = ['Pillow>=1.0', 'Matplotlib>=2.1', 'Cython>=0.28.1']

setup(
name='object_detection',
version='0.1',
install_requires=REQUIRED_PACKAGES,
include_package_data=True,
packages=[p for p in find_packages() if p.startswith('object_detection')],
description='Tensorflow Object Detection Library',
)`_**

Это ошибка из журнала в консоли Google Cloud ML ОШИБКАсообщение:

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 726, in prepare_or_wait_for_session init_feed_dict=self._init_feed_dict, init_fn=self._init_fn) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session config=config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 207, in _restore_checkpoint saver.restore(sess, ckpt.model_checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1802, in restore {self.saver_def.filename_tensor_name: save_path}) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 734, in prepare_or_wait_for_session max_wait_secs=max_wait_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 406, in wait_for_session sess) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 490, in _try_run_local_init_op sess.run(self._local_init_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error [[Node: init_ops/init_all_tables_S2 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=6383848822399600260, tensor_name="edge_29_init_ops/init_all_tables", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:GPU:0"]()]] The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 747, in train master, start_standard_services=False, config=session_config) as sess: File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 734, in prepare_or_wait_for_session max_wait_secs=max_wait_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 406, in wait_for_session sess) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 490, in _try_run_local_init_op sess.run(self._local_init_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 747, in train master, start_standard_services=False, config=session_config) as sess: File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 734, in prepare_or_wait_for_session max_wait_secs=max_wait_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 406, in wait_for_session sess) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 490, in _try_run_local_init_op sess.run(self._local_init_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 747, in train master, start_standard_services=False, config=session_config) as sess: File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 734, in prepare_or_wait_for_session max_wait_secs=max_wait_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 406, in wait_for_session sess) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 490, in _try_run_local_init_op sess.run(self._local_init_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=36123659232&resource=ml_job%2Fjob_id%2Fgrewe_object_detection_8&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22grewe_object_detection_8%22

Ответы [ 2 ]

0 голосов
/ 11 июля 2018

Проблема может быть решена с помощью флага --runtime-version 1.2, как упомянуто @ iwz1992, и с помощью Tensorflow и Jupyter в файле setup.py

0 голосов
/ 24 мая 2018

Это известная проблема.Команда обнаружения объектов работает над новым двоичным файлом, чтобы исправить это.В то же время вы можете использовать среду исполнения 1.2 на Cloud ML Engine, которая должна работать.

...