Horovod ValueError: номер процесса не должен быть больше, чем общее количество доступных слотов - PullRequest
0 голосов
/ 10 июля 2020

Когда я запускаю horovodrun -np 4 -H 192.168.1.191:2 192.168.1.119:2 python Convolution\ Network\ MNIST.py, возникает та же ошибка: я установил хоровод с переменными графического процессора.

(base) vinhdiesal@vinhdiesal:~/Documents$ horovodrun -np 4 -H localhost:2 192.168.1.119:2 python Convolution\ Network\ MNIST.py 
2020-07-09 18:47:43.186035: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-07-09 18:47:45.112437: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
Traceback (most recent call last):
  File "/home/vinhdiesal/anaconda3/bin/horovodrun", line 21, in <module>
    run_commandline()
  File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/runner.py", line 723, in run_commandline
    _run(args)
  File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/runner.py", line 656, in _run
    _launch_job(args, remote_host_names, settings, nics, command)
  File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/runner.py", line 717, in _launch_job
    args.verbose)
  File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/runner.py", line 694, in run_controller
    gloo_run()
  File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/runner.py", line 706, in gloo_run_fn
    gloo_run(settings, remote_host_names, nics, env, driver_ip, command)
  File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 312, in gloo_run
    launch_gloo(command, exec_command, settings, nics, env, server_ip)
  File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 250, in launch_gloo
    host_alloc_plan = _allocate(settings.hosts, settings.num_proc)
  File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 103, in _allocate
    raise ValueError("Process number should not be larger than "
ValueError: Process number should not be larger than total available slots.

Я провел тест и выполнил horovodrun на противоположной машине и получил следующее результаты:

mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       vinhdiesal
Executable: razer:2

Обе машины работают, когда я выполняю horovodrun -np 2 -H localhost:2 python file.py

I ran another test and received the following: 
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4
slots that were requested by the application:

  razer:2

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.

Есть ли файл, который мне нужно настроить, чтобы увидеть другую машину?

Есть предложения?

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...