У меня есть две машины с разными именами пользователей: предположим, user1@master
и user2@slave
. Я хотел бы запустить задание MPI на двух машинах, но до сих пор у меня не получалось. Я успешно установил ssh без пароля между двумя компьютерами. Обе машины имеют одинаковую версию OpenMPI, и обе машины имеют настройки PATH
и LD_LIBRARY_PATH
соответственно.
Путь к openmpi на каждой машине /home/$USER/.openmpi
, и программа, которую я хочу запустить, находится внутри ~/folder
Мой файл / etc / hosts на обеих машинах:
master x.x.x.110
slave x.x.x.111
Мой / .ssh/config файл в user1@master
:
Host slave
User user2
Затем я выполняю команду на user1@master
, находясь внутри ~/folder
, следующим образом:
$ mpiexec -n 1 ./program : -np 1 -host slave -wdir /home/user2/folder ./program
Я получаю следующую ошибку:
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
редактирует
Если я использую хост-файл с содержимым:
localhost
user2@slave
вместе с аргументом --mca
я получаю следующую ошибку:
$ mpirun --mca plm_base_verbose 10 -n 5 --hostfile hosts.txt ./program
[user:29277] mca: base: components_register: registering framework plm components
[user:29277] mca: base: components_register: found loaded component slurm
[user:29277] mca: base: components_register: component slurm register function successful
[user:29277] mca: base: components_register: found loaded component isolated
[user:29277] mca: base: components_register: component isolated has no register or open function
[user:29277] mca: base: components_register: found loaded component rsh
[user:29277] mca: base: components_register: component rsh register function successful
[user:29277] mca: base: components_open: opening plm components
[user:29277] mca: base: components_open: found loaded component slurm
[user:29277] mca: base: components_open: component slurm open function successful
[user:29277] mca: base: components_open: found loaded component isolated
[user:29277] mca: base: components_open: component isolated open function successful
[user:29277] mca: base: components_open: found loaded component rsh
[user:29277] mca: base: components_open: component rsh open function successful
[user:29277] mca:base:select: Auto-selecting plm components
[user:29277] mca:base:select:( plm) Querying component [slurm]
[user:29277] mca:base:select:( plm) Querying component [isolated]
[user:29277] mca:base:select:( plm) Query of component [isolated] set priority to 0
[user:29277] mca:base:select:( plm) Querying component [rsh]
[user:29277] mca:base:select:( plm) Query of component [rsh] set priority to 10
[user:29277] mca:base:select:( plm) Selected component [rsh]
[user:29277] mca: base: close: component slurm closed
[user:29277] mca: base: close: unloading component slurm
[user:29277] mca: base: close: component isolated closed
[user:29277] mca: base: close: unloading component isolated
[user:29277] *** Process received signal ***
[user:29277] Signal: Segmentation fault (11)
[user:29277] Signal code: (128)
[user:29277] Failing at address: (nil)
[user:29277] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f4226242f20]
[user:29277] [ 1] /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x197)[0x7f422629b207]
[user:29277] [ 2] /lib/x86_64-linux-gnu/libc.so.6(__nss_lookup_function+0x10a)[0x7f422634d06a]
[user:29277] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__nss_lookup+0x3d)[0x7f422634d19d]
[user:29277] [ 4] /lib/x86_64-linux-gnu/libc.so.6(getpwuid_r+0x2f3)[0x7f42262e7ee3]
[user:29277] [ 5] /lib/x86_64-linux-gnu/libc.so.6(getpwuid+0x98)[0x7f42262e7498]
[user:29277] [ 6] /home/.openmpi/lib/openmpi/mca_plm_rsh.so(+0x477d)[0x7f422356977d]
[user:29277] [ 7] /home/.openmpi/lib/openmpi/mca_plm_rsh.so(+0x67a7)[0x7f422356b7a7]
[user:29277] [ 8] /home/.openmpi/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xdc9)[0x7f4226675749]
[user:29277] [ 9] mpirun(+0x1262)[0x563fde915262]
[user:29277] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f4226225b97]
[user:29277] [11] mpirun(+0xe7a)[0x563fde914e7a]
[user:29277] *** End of error message ***
Segmentation fault (core dumped)
Я не получаю никакой информации ssh orte, как меня просили, но, возможно, потому что я неправильно набираю команду --mca
?