Тренировочная маска rcnn на google tpu используя - сокет закрыт - PullRequest
0 голосов
/ 23 мая 2019

Я следовал учебному пособию по Google, чтобы обучить Маска Ркнн на данных Коко без проблем (https://cloud.google.com/tpu/docs/tutorials/mask-rcnn)

Затем я снова следовал инструкциям, но на этот раз по моим собственным данным.

В моем наборе данных около 3000 образцов

This is how I start the train script:

python ~/tpu/models/official/mask_rcnn/mask_rcnn_main.py 
--use_tpu=True 
--tpu="tputputpu" 
--model_dir= "gs://my/path/mask-rcnn-model" 
--config="damconfig.yaml" 
--mode="train"

This is my config

num_classes: 9
backbone: 'resnet50'
use_bfloat16: True
train_batch_size: 16
eval_batch_size: 8
training_file_pattern: gs://my/path/TFRecords/train-*
validation_file_pattern: gs://my/path/TFRecords/val-*
val_json_file: gs://my/path/val_annotations.json
total_steps: 3000
num_steps_per_eval: 150
eval_samples: 311

I get the following error when I start training:

INFO:tensorflow:Enqueue next (2500) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (2500) batch(es) of data from outfeed.
INFO:tensorflow:Error recorded from infeed: assertion failed: [103]
[[{{node parser/Assert_2/Assert}}]]
[[node input_pipeline_task0/while/IteratorGetNext_4 (defined at /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1112) ]]
INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed
...