`capture_tpu_profile` не может получить доступ к TPU - PullRequest
0 голосов
/ 31 января 2020

У меня есть следующий TPU:

$ gcloud compute tpus list 
NAME         ZONE           ACCELERATOR_TYPE  NETWORK_ENDPOINTS  NETWORK  RANGE          STATUS
daniels-tpu  us-central1-a  v3-8              10.240.1.10:8470   default  10.240.1.8/29  READY

Но он не доступен через capture_tpu_profile:

$  capture_tpu_profile --tpu=daniels-tpu 
TensorFlow version 1.15.2 detected
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0131 10:15:16.553251 4571966912 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1319, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1252, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1298, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1247, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 966, in send
    self.connect()
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 938, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 707, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 752, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/capture_tpu_profile", line 8, in <module>
    sys.exit(run_main())
  File "/usr/local/lib/python3.7/site-packages/cloud_tpu_profiler/main.py", line 85, in run_main
    tf.compat.v1.app.run(main)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/usr/local/lib/python3.7/site-packages/cloud_tpu_profiler/main.py", line 105, in main
    [FLAGS.tpu], zone=FLAGS.tpu_zone, project=FLAGS.gcp_project))
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/distribute/cluster_resolver/tpu_cluster_resolver.py", line 330, in __init__
    self._request_compute_metadata('project/project-id'))
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/distribute/cluster_resolver/tpu_cluster_resolver.py", line 124, in _request_compute_metadata
    resp = urlopen(req)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1347, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1321, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

Ошибка не исчезла go даже после использования --tpu_zone=us-central1-a флаг.

1 Ответ

0 голосов
/ 31 января 2020

Когда вы используете capture_tpu_profile для захвата профиля, файл .tracetable сохраняется в вашем хранилище Google Cloud Storage, если это местоположение не указано, то go нигде не будет.

Можете ли вы попробуйте добавить путь к вашей корзине с помощью следующей команды:

capture_tpu_profile --tpu=tpu-name --logdir=${MODEL_DIR}

- logdir = $ {MODEL_DIR} - это место в облачном хранилище, где хранятся ваша модель и контрольные точки.

Все детали, упомянутые в этой документации

...