У меня проблема с получением задания для запуска в Google ML для переподготовки API обнаружения объектов SSD Mobilenet с использованием моих собственных данных обучения.Обратите внимание, что я могу успешно тренироваться на своей локальной машине.Вот подробности.Я пробовал разные версии tenorflow для файлов gcloud (и соответствующих cloud.yaml), и все они потерпели неудачу.Я использую локально версию 1.8 тензорного потока с API обнаружения объектов (+ slim).
ПРИМЕЧАНИЕ. Попытка переобучить модель сети SSD_Mobile, которую я скопировал в свое хранилище Google CLoud и изначально находился в object_detection \ ssd_mobilenet_v1_coco_2017_11_17 \ model.ckpt
TensorFlow версия (используйте команду ниже): попыталсяМногочисленная версия, включая 1.8 (не поддерживает Google ML 1.8, и эта версия используется локально для создания обучающих файлов TFRecord)
ПРИМЕЧАНИЕ: попытка запустить учебный пример (обучающий локально) в Google ML.Выполните запрос задания с помощью инструмента gcloud.Следовали инструкциям на https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_on_cloud.md. КОМАНДА, выполненная по тензорному потоку / модели / исследования
gcloud ml-engine jobs submit training grewe_object_detection_6 --runtime-version 1.8 --job-dir=gs://BLAHBLAH-storage/Train --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz --module-name object_detection.train --region us-central1 --config object_detection/samples/cloud/cloud.yml -- --
Опишите проблему См. Ошибку ниже.Попытался изменить версию используемого tenorflow (обратите внимание, что локально при успешном запуске с использованием 1.8, так что полагайте, что именно это используется для упаковки TFRecord, он должен работать в Google ML) - поэтому пытался обновить предоставленный cloud.yaml (пробовал для версии 1.2, 1.4, 1.6 и 1.8, а также пытался обновить setup.py в моделях / исследованиях, и ничего не работает.
В последний раз я попробовал следующее для моего файла cloud.yaml
trainingInput: runtimeVersion: "1.8" scaleTier: CUSTOM masterType: standard_gpu workerCount: 5 workerType: standard_gpu parameterServerCount: 3 parameterServerType: standard
Я пыталсяпоследнее для моего setup.py
** _ `" "Сценарий установки для object_detection." ""
from setuptools import find_packages
from setuptools import setup
REQUIRED_PACKAGES = ['Pillow>=1.0', 'Matplotlib>=2.1', 'Cython>=0.28.1']
setup(
name='object_detection',
version='0.1',
install_requires=REQUIRED_PACKAGES,
include_package_data=True,
packages=[p for p in find_packages() if p.startswith('object_detection')],
description='Tensorflow Object Detection Library',
)`_**
Это ошибка из журнала в консоли Google Cloud ML ОШИБКАсообщение:
The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 726, in prepare_or_wait_for_session init_feed_dict=self._init_feed_dict, init_fn=self._init_fn) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session config=config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 207, in _restore_checkpoint saver.restore(sess, ckpt.model_checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1802, in restore {self.saver_def.filename_tensor_name: save_path}) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 734, in prepare_or_wait_for_session max_wait_secs=max_wait_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 406, in wait_for_session sess) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 490, in _try_run_local_init_op sess.run(self._local_init_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error [[Node: init_ops/init_all_tables_S2 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=6383848822399600260, tensor_name="edge_29_init_ops/init_all_tables", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:GPU:0"]()]] The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 747, in train master, start_standard_services=False, config=session_config) as sess: File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 734, in prepare_or_wait_for_session max_wait_secs=max_wait_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 406, in wait_for_session sess) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 490, in _try_run_local_init_op sess.run(self._local_init_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 747, in train master, start_standard_services=False, config=session_config) as sess: File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 734, in prepare_or_wait_for_session max_wait_secs=max_wait_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 406, in wait_for_session sess) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 490, in _try_run_local_init_op sess.run(self._local_init_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 747, in train master, start_standard_services=False, config=session_config) as sess: File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 734, in prepare_or_wait_for_session max_wait_secs=max_wait_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 406, in wait_for_session sess) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 490, in _try_run_local_init_op sess.run(self._local_init_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=36123659232&resource=ml_job%2Fjob_id%2Fgrewe_object_detection_8&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22grewe_object_detection_8%22