когда я запускаю этот код в kaggle tpu, я получаю эту ошибку. Зачем? спасибо.
# test if spawn working
def train(rank, flags):
torch.manual_seed(1234)
device = xm.xla_device()
t = torch.randn((2, 2), device=device)
print("Process", rank ,"is using", xm.xla_real_devices([str(device)])[0])
flags = {}
xmp.spawn(train, args=(flags,), nprocs=8, start_method='fork')
Приведенный выше код взят из учебника colab, который работает хорошо. Не знаю, почему в kaggle возникает эта ошибка?
Это журнал ошибок
Exception in device=TPU:6: tensorflow/compiler/xla/xla_client/mesh_service.cc:331 : Failed to retrieve mesh configuration: Connection reset by peer (14)
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
_setup_replication()
File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 315, in _setup_replication
device = xm.xla_device()
File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 231, in xla_device
devkind=[devkind] if devkind is not None else None)
File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 136, in get_xla_supported_devices
xla_devices = _DEVICES.value
File "/opt/conda/lib/python3.7/site-packages/torch_xla/utils/utils.py", line 32, in value
self._value = self._gen_fn()
File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 18, in <lambda>
_DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:331 : Failed to retrieve mesh configuration: Connection reset by peer (14)