Не удалось развернуть сохраненную модель в AWS Sagemaker - PullRequest
0 голосов
/ 18 апреля 2020

Я следую этой статье https://machinelearningmastery.com/how-to-perform-object-detection-with-yolov3-in-keras/, чтобы развернуть YOLOv3 внутри AWS Sagemaker. У меня есть model.weights, имеющие веса и модель. json, имеющая структуру модели, а также model.h5, имеющий структуру модели + веса. Когда я конвертирую эти файлы в формат protobuf, чтобы я мог их распаковать и развернуть в Sagemaker, появляется эта ошибка.

UnexpectedStatusException: Error hosting endpoint sagemaker-tensorflow-2020-04-12-10-57-05- 
567: Failed. Reason:  The primary container for production variant AllTraffic did not pass the 
ping health check. Please check CloudWatch logs for this endpoint..

Вот мой код:


import tensorflow
tensorflow.__version__

Output:
'1.7.0'

import boto3, re
from sagemaker import get_execution_role

role = get_execution_role()

from tensorflow.keras.models import model_from_json

!ls keras_model/

import tensorflow as tf

json_file = open('/home/ec2-user/SageMaker/keras_model/' + 'model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json, custom_objects={"GlorotUniform": tf.keras.initializers.glorot_uniform})

loaded_model.load_weights('/home/ec2-user/SageMaker/keras_model/model_weights.h5')
print("Loaded model from disk ")

from tensorflow.python.saved_model import builder
from tensorflow.python.saved_model.signature_def_utils import predict_signature_def
from tensorflow.python.saved_model import tag_constants

# this directory sturcture will be followed as below. Do not change it.
model_version = '1' 
export_dir = 'export/Servo/' + model_version

#Build the protocol buffer savedmodel at export_dir
build = builder.SavedModelBuilder(export_dir)

print(loaded_model.inputs)  
print([t for t in loaded_model.outputs])

Output:
[<tf.Tensor 'input_1:0' shape=(?, ?, ?, 3) dtype=float32>]
[<tf.Tensor 'conv_81/BiasAdd:0' shape=(?, ?, ?, 255) dtype=float32>, <tf.Tensor 'conv_93/BiasAdd:0' shape=(?, ?, ?, 255) dtype=float32>, <tf.Tensor 'conv_105/BiasAdd:0' shape=(?, ?, ?, 255) dtype=float32>]

tf.convert_to_tensor(loaded_model.output)

Output:
<tf.Tensor 'packed:0' shape=(3, ?, ?, ?, 255) dtype=float32>

signature = predict_signature_def(inputs={"input_image": loaded_model.input}, 
                                  outputs={t.name: t for t in loaded_model.outputs})

from tensorflow.keras import backend as K
with K.get_session() as sess:
    build.add_meta_graph_and_variables(sess=sess, tags=[tag_constants.SERVING], 
                                       signature_def_map={"serving_default":signature} )
    build.save()

!ls export/Servo/1/variables/

Output:
variables.data-00000-of-00001  variables.index

import tarfile 
with tarfile.open('model.tar.gz', mode='w:gz') as archive:
    archive.add('export', recursive=True)

import sagemaker

sagemaker_session = sagemaker.Session()
inputs = sagemaker_session.upload_data(path='model.tar.gz', key_prefix='model')

!touch train.py

from sagemaker.tensorflow.model import TensorFlowModel
sagemaker_model = TensorFlowModel(model_data='s3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz',
                                 role = role,
                                 entry_point= 'train.py')

%%time
predictor = sagemaker_model.deploy(initial_instance_count=1, 
                                   instance_type='ml.t2.large')

Ошибка:

-----------------------------*
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<timed exec> in <module>()

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, update_endpoint, tags, kms_key, wait, data_capture_config)
    478                 kms_key=kms_key,
    479                 wait=wait,
--> 480                 data_capture_config_dict=data_capture_config_dict,
    481             )
    482 

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait, data_capture_config_dict)
   2849 
   2850             self.sagemaker_client.create_endpoint_config(**config_options)
-> 2851         return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait)
   2852 
   2853     def expand_role(self, role):

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait)
   2381         )
   2382         if wait:
-> 2383             self.wait_for_endpoint(endpoint_name)
   2384         return endpoint_name
   2385 

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in wait_for_endpoint(self, endpoint, poll)
   2638                 ),
   2639                 allowed_statuses=["InService"],
-> 2640                 actual_status=status,
   2641             )
   2642         return desc

UnexpectedStatusException: Error hosting endpoint sagemaker-tensorflow-2020-04-12-10-57-05-567: Failed. Reason:  The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

Я думаю, что ошибка связана с разницей между Тензорная форма loaded_model.inputs и loaded_model.outputs. Но я все еще не уверен в том, что 3 и 255 представляют в форме. Любая помощь будет оценена.

print(loaded_model.inputs)
[<tf.Tensor 'input_1:0' shape=(?, ?, ?, 3) dtype=float32>]

print([t for t in loaded_model.outputs])
[<tf.Tensor 'conv_81/BiasAdd:0' shape=(?, ?, ?, 255) dtype=float32>, 
 <tf.Tensor 'conv_93/BiasAdd:0' shape=(?, ?, ?, 255) dtype=float32>, 
 <tf.Tensor 'conv_105/BiasAdd:0' shape=(?, ?, ?, 255) dtype=float32>]

Журнал Cloudwatch:

2020-04-12 11:01:33,439 INFO - root - running container entrypoint
2020-04-12 11:01:33,440 INFO - root - starting serve task
2020-04-12 11:01:33,440 INFO - container_support.serving - reading config
Downloading s3://sagemaker-us-east-1-611475884433/sagemaker-tensorflow-2020-04-12-10-57-05-375/sourcedir.tar.gz to /tmp/script.tar.gz
2020-04-12 11:01:33,828 INFO - container_support.serving - importing user module
2020-04-12 11:01:33,828 INFO - container_support.serving - loading framework-specific dependencies
2020-04-12 11:01:35,795 INFO - container_support.serving - starting nginx
2020-04-12 11:01:35,797 INFO - container_support.serving - nginx config: 
worker_processes auto;
daemon off;
pid /tmp/nginx.pid;
error_log /var/log/nginx/error.log;
worker_rlimit_nofile 4096;
events {
  worker_connections 2048;
}
http {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /var/log/nginx/access.log combined;

  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }

  server {
    listen 8080 deferred;
    client_max_body_size 0;

    keepalive_timeout 3;

    location ~ ^/(ping|invocations) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_pass http://gunicorn;
    }

    location / {
      return 404 "{}";
    }
  }
}

2020-04-12 11:01:35,815 INFO - container_support.serving - starting gunicorn
2020-04-12 11:01:35,820 INFO - container_support.serving - inference server started. waiting on processes: set([24, 23])
2020-04-12 11:01:35.904746: I tensorflow_serving/model_servers/server.cc:82] Building single TensorFlow model file config:  model_name: generic_model model_base_path: /opt/ml/model/export/Servo
2020-04-12 11:01:35.905995: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
2020-04-12 11:01:35.906148: I tensorflow_serving/model_servers/server_core.cc:517]  (Re-)adding model: generic_model
2020-04-12 11:01:35.907173: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: generic_model version: 1}
2020-04-12 11:01:35.907349: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: generic_model version: 1}
2020-04-12 11:01:35.907422: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: generic_model version: 1}
2020-04-12 11:01:35.907578: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /opt/ml/model/export/Servo/1
2020-04-12 11:01:35.907687: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /opt/ml/model/export/Servo/1
2020-04-12 11:01:35.939232: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2020-04-12 11:01:35.980215: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[2020-04-12 11:01:36 +0000] [24] [INFO] Starting gunicorn 19.9.0
2020-04-12 11:01:36.048327: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:259] SavedModel load for tags { serve }; Status: fail. Took 140502 microseconds.
2020-04-12 11:01:36.048617: E tensorflow_serving/util/retrier.cc:37] Loading servable: {name: generic_model version: 1} failed: Not found: Op type not registered 'FusedBatchNormV3' in binary running on model.aws.local. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
...