Как настроить MirroredStrategy в Tensorflow на AWS EC2 для переобучения обнаружения объектов? - PullRequest
0 голосов
/ 30 октября 2018

Я собрал Tensorflow 1.11 из исходного кода. Я могу переучить Mobilenet SSD v2 на одном графическом процессоре. Я пытаюсь использовать зеркальную стратегию на 4 GPU AWS g2.8xlarge EC2 instace.

Я добавил MirroredStrategy в RunConfig в файл model_main.py , используя следующий код:

def main(unused_argv):
  flags.mark_flag_as_required('model_dir')
  flags.mark_flag_as_required('pipeline_config_path')

  distribution = tf.contrib.distribute.MirroredStrategy()
  config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir,train_distribute=distribution)

Я сталкиваюсь с ошибкой AssertionError. Ниже приведен полный журнал. Может кто-нибудь, пожалуйста, помогите? Спасибо.

ubuntu@ip-172-31-30-151:~/tensorflow/models/research$ python object_detection/model_main.py \
>     --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
>     --model_dir=${MODEL_DIR} \
>     --num_train_steps=${NUM_TRAIN_STEPS} \
>     --sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
>     --alsologtostderr
/home/ubuntu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
/home/ubuntu/tensorflow/models/research/object_detection/utils/visualization_utils.py:27: UserWarning: 
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was *originally* set to 'Qt5Agg' by the following code:
  File "object_detection/model_main.py", line 26, in <module>
    from object_detection import model_lib
  File "/home/ubuntu/tensorflow/models/research/object_detection/model_lib.py", line 27, in <module>
    from object_detection import eval_util
  File "/home/ubuntu/tensorflow/models/research/object_detection/eval_util.py", line 27, in <module>
    from object_detection.metrics import coco_evaluation
  File "/home/ubuntu/tensorflow/models/research/object_detection/metrics/coco_evaluation.py", line 20, in <module>
    from object_detection.metrics import coco_tools
  File "/home/ubuntu/tensorflow/models/research/object_detection/metrics/coco_tools.py", line 47, in <module>
    from pycocotools import coco
  File "/home/ubuntu/tensorflow/models/research/pycocotools/coco.py", line 49, in <module>
    import matplotlib.pyplot as plt
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/pyplot.py", line 71, in <module>
    from matplotlib.backends import pylab_setup
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/backends/__init__.py", line 16, in <module>
    line for line in traceback.format_stack()


  import matplotlib; matplotlib.use('Agg')  # pylint: disable=multiple-statements
2018-10-24 22:06:33.449647: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-10-24 22:06:33.450249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties: 
name: GRID K520 major: 3 minor: 0 memoryClockRate(GHz): 0.797
pciBusID: 0000:00:03.0
totalMemory: 3.94GiB freeMemory: 3.90GiB
2018-10-24 22:06:33.479913: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-10-24 22:06:33.480481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 1 with properties: 
name: GRID K520 major: 3 minor: 0 memoryClockRate(GHz): 0.797
pciBusID: 0000:00:04.0
totalMemory: 3.94GiB freeMemory: 3.90GiB
2018-10-24 22:06:33.512979: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-10-24 22:06:33.513529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 2 with properties: 
name: GRID K520 major: 3 minor: 0 memoryClockRate(GHz): 0.797
pciBusID: 0000:00:05.0
totalMemory: 3.94GiB freeMemory: 3.90GiB
2018-10-24 22:06:33.546408: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-10-24 22:06:33.546944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 3 with properties: 
name: GRID K520 major: 3 minor: 0 memoryClockRate(GHz): 0.797
pciBusID: 0000:00:06.0
totalMemory: 3.94GiB freeMemory: 3.90GiB
2018-10-24 22:06:33.547089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0, 1, 2, 3
2018-10-24 22:06:34.907497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-24 22:06:34.907580: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0 1 2 3 
2018-10-24 22:06:34.907607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N N N N 
2018-10-24 22:06:34.907623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 1:   N N N N 
2018-10-24 22:06:34.907636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 2:   N N N N 
2018-10-24 22:06:34.907650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 3:   N N N N 
2018-10-24 22:06:34.908235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3644 MB memory) -> physical GPU (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0)
2018-10-24 22:06:34.946961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 3644 MB memory) -> physical GPU (device: 1, name: GRID K520, pci bus id: 0000:00:04.0, compute capability: 3.0)
2018-10-24 22:06:34.981965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 3644 MB memory) -> physical GPU (device: 2, name: GRID K520, pci bus id: 0000:00:05.0, compute capability: 3.0)
2018-10-24 22:06:35.020808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 3644 MB memory) -> physical GPU (device: 3, name: GRID K520, pci bus id: 0000:00:06.0, compute capability: 3.0)
WARNING:tensorflow:Forced number of epochs for all eval validations to be 1.
W1024 22:06:35.072655 139904432719616 tf_logging.py:125] Forced number of epochs for all eval validations to be 1.
WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered `eval_on_train_input_config.num_epochs` = 0. Overwriting `num_epochs` to 1.
W1024 22:06:35.073021 139904432719616 tf_logging.py:125] Expected number of evaluation epochs is 1, but instead encountered `eval_on_train_input_config.num_epochs` = 0. Overwriting `num_epochs` to 1.
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x7f3d960158c8>) includes params argument, but params are not passed to Estimator.
W1024 22:06:35.073723 139904432719616 tf_logging.py:125] Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x7f3d960158c8>) includes params argument, but params are not passed to Estimator.
2018-10-24 22:06:35.077395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0, 1, 2, 3
2018-10-24 22:06:35.077580: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-24 22:06:35.077606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0 1 2 3 
2018-10-24 22:06:35.077626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N N N N 
2018-10-24 22:06:35.077649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 1:   N N N N 
2018-10-24 22:06:35.077661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 2:   N N N N 
2018-10-24 22:06:35.077676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 3:   N N N N 
2018-10-24 22:06:35.077982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/device:GPU:0 with 3644 MB memory) -> physical GPU (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0)
2018-10-24 22:06:35.078267: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/device:GPU:1 with 3644 MB memory) -> physical GPU (device: 1, name: GRID K520, pci bus id: 0000:00:04.0, compute capability: 3.0)
2018-10-24 22:06:35.078644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/device:GPU:2 with 3644 MB memory) -> physical GPU (device: 2, name: GRID K520, pci bus id: 0000:00:05.0, compute capability: 3.0)
2018-10-24 22:06:35.079453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/device:GPU:3 with 3644 MB memory) -> physical GPU (device: 3, name: GRID K520, pci bus id: 0000:00:06.0, compute capability: 3.0)
WARNING:tensorflow:num_readers has been reduced to 2 to match input file shards.
W1024 22:06:35.100583 139904432719616 tf_logging.py:125] num_readers has been reduced to 2 to match input file shards.
WARNING:tensorflow:From /home/ubuntu/tensorflow/models/research/object_detection/core/preprocessor.py:1207: calling squeeze (from tensorflow.python.ops.array_ops) with squeeze_dims is deprecated and will be removed in a future version.
Instructions for updating:
Use the `axis` argument instead
W1024 22:06:35.548344 139904432719616 tf_logging.py:125] From /home/ubuntu/tensorflow/models/research/object_detection/core/preprocessor.py:1207: calling squeeze (from tensorflow.python.ops.array_ops) with squeeze_dims is deprecated and will be removed in a future version.
Instructions for updating:
Use the `axis` argument instead
WARNING:tensorflow:From /home/ubuntu/tensorflow/models/research/object_detection/builders/dataset_builder.py:148: batch_and_drop_remainder (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.batch(..., drop_remainder=True)`.
W1024 22:06:37.296217 139904432719616 tf_logging.py:125] From /home/ubuntu/tensorflow/models/research/object_detection/builders/dataset_builder.py:148: batch_and_drop_remainder (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.batch(..., drop_remainder=True)`.
Traceback (most recent call last):
  File "object_detection/model_main.py", line 111, in <module>
    tf.app.run()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "object_detection/model_main.py", line 107, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 610, in run
    return self.run_local()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 711, in run_local
    saving_listeners=saving_listeners)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1179, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1290, in _train_model_distributed
    self.config)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 718, in call_for_each_tower
    return self._call_for_each_tower(fn, *args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 552, in _call_for_each_tower
    return _call_for_each_tower(self, fn, *args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 183, in _call_for_each_tower
    coord.join(threads)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 783, in run
    self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1169, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/ubuntu/tensorflow/models/research/object_detection/model_lib.py", line 268, in model_fn
    features[fields.InputDataFields.true_image_shape])
  File "/home/ubuntu/tensorflow/models/research/object_detection/meta_architectures/ssd_meta_arch.py", line 501, in predict
    preprocessed_inputs)
  File "/home/ubuntu/tensorflow/models/research/object_detection/models/ssd_mobilenet_v2_feature_extractor.py", line 134, in extract_features
    image_features=image_features)
  File "/home/ubuntu/tensorflow/models/research/object_detection/models/feature_map_generators.py", line 379, in multi_resolution_feature_maps
    scope=layer_name + '_depthwise')
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 2809, in separable_convolution2d
    collections=weights_collections)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 350, in model_variable
    aggregation=aggregation)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 277, in variable
    aggregation=aggregation)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 1484, in get_variable
    aggregation=aggregation)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 1234, in get_variable
    aggregation=aggregation)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 521, in get_variable
    return custom_getter(**custom_getter_kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 1922, in wrapped_custom_getter
    *args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1749, in layer_variable_getter
    return _model_variable_getter(getter, *args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1740, in _model_variable_getter
    aggregation=aggregation)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 350, in model_variable
    aggregation=aggregation)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 277, in variable
    aggregation=aggregation)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 461, in disable_partitioned_variables
    return getter(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 492, in _true_getter
    aggregation=aggregation)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 938, in _get_single_variable
    with ops.colocate_with(v):
  File "/home/ubuntu/anaconda3/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4092, in _colocate_with_for_gradient
    with self.colocate_with(op, ignore_existing):
  File "/home/ubuntu/anaconda3/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4144, in colocate_with
    op = internal_convert_to_tensor_or_indexed_slices(op, as_ref=True).op
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1305, in internal_convert_to_tensor_or_indexed_slices
    value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1144, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/values.py", line 447, in _tensor_conversion_mirrored
    assert not as_ref
AssertionError
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...