Когда я запускаю horovodrun -np 4 -H 192.168.1.191:2 192.168.1.119:2 python Convolution\ Network\ MNIST.py
, возникает та же ошибка: я установил хоровод с переменными графического процессора.
(base) vinhdiesal@vinhdiesal:~/Documents$ horovodrun -np 4 -H localhost:2 192.168.1.119:2 python Convolution\ Network\ MNIST.py
2020-07-09 18:47:43.186035: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-07-09 18:47:45.112437: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
Traceback (most recent call last):
File "/home/vinhdiesal/anaconda3/bin/horovodrun", line 21, in <module>
run_commandline()
File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/runner.py", line 723, in run_commandline
_run(args)
File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/runner.py", line 656, in _run
_launch_job(args, remote_host_names, settings, nics, command)
File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/runner.py", line 717, in _launch_job
args.verbose)
File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/runner.py", line 694, in run_controller
gloo_run()
File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/runner.py", line 706, in gloo_run_fn
gloo_run(settings, remote_host_names, nics, env, driver_ip, command)
File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 312, in gloo_run
launch_gloo(command, exec_command, settings, nics, env, server_ip)
File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 250, in launch_gloo
host_alloc_plan = _allocate(settings.hosts, settings.num_proc)
File "/home/vinhdiesal/anaconda3/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 103, in _allocate
raise ValueError("Process number should not be larger than "
ValueError: Process number should not be larger than total available slots.
Я провел тест и выполнил horovodrun
на противоположной машине и получил следующее результаты:
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 0; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).
Node: vinhdiesal
Executable: razer:2
Обе машины работают, когда я выполняю horovodrun -np 2 -H localhost:2 python file.py
I ran another test and received the following:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4
slots that were requested by the application:
razer:2
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
Есть ли файл, который мне нужно настроить, чтобы увидеть другую машину?
Есть предложения?