Это относится к MPI с двумя компьютерами с 64-битным процессором и третьим компьютером с 32-битным процессором. Все компьютеры имеют одинаковые точные местоположения для lib
и bin
, и все они имеют одинаковый bashrc вместе с одной и той же папкой, где хранятся исполняемые файлы. Соединение SSH работает одинаково как для 64-битной, так и для 32-битной машины. Сервер является 64-битной машиной. Я локально скомпилировал исполняемый файл на 32-разрядной машине (обозначается как ([K7ASA: 1555])), и он работал на нем, но когда я попытался запустить его параллельно, я получил это сообщение.
mpirun -host 10.42.0.163,10.42.0.72,10.42.0.68 ./mpi_quad-1
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[K7ASA:1555] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[40577,1],2]
Exit code: 1
Вот вывод для
mpirun -host 10.42.0.163,10.42.0.72,10.42.0.68 --tag-output uname -a
[1,0]<stdout>:Linux verthex-Lenovo-V570 4.15.0-38-generic #41~16.04.1-Ubuntu SMP Wed Oct 10 20:16:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[1,1]<stdout>:Linux verthex-HP-Pavilion-zv5000-DP299AV 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[1,2]<stdout>:Linux verthex-K7ASA 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:58:23 UTC 2018 i686 athlon i686 GNU/Linux
mpirun -host 10.42.0.163,10.42.0.72,10.42.0.68 --tag-output file mpi_quad-1
[1,0]<stdout>:mpi_quad-1: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=a7aa397b9a339ae464201270a065fa7037721016, not stripped
[1,1]<stdout>:mpi_quad-1: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=a7aa397b9a339ae464201270a065fa7037721016, not stripped
[1,2]<stdout>:mpi_quad-1: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=a7aa397b9a339ae464201270a065fa7037721016, not stripped
mpirun -host 10.42.0.163,10.42.0.72,10.42.0.68 --tag-output ldd mpi_quad-1
[1,0]<stdout>: linux-vdso.so.1 => (0x00007ffc091eb000)
[1,0]<stdout>: libmpi.so.40 => /usr/local/lib/libmpi.so.40 (0x00007fbda7934000)
[1,0]<stdout>: libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fbda7717000)
[1,0]<stdout>: libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbda734d000)
[1,0]<stdout>: libopen-rte.so.40 => /usr/local/lib/libopen-rte.so.40 (0x00007fbda7096000)
[1,0]<stdout>: libopen-pal.so.40 => /usr/local/lib/libopen-pal.so.40 (0x00007fbda6d8b000)
[1,0]<stdout>: librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fbda6b83000)
[1,0]<stdout>: libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbda687a000)
[1,0]<stdout>: /lib64/ld-linux-x86-64.so.2 (0x00007fbda7c2e000)
[1,0]<stdout>: libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fbda6660000)
[1,0]<stdout>: libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbda645c000)
[1,0]<stdout>: libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007fbda6251000)
[1,0]<stdout>: libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fbda604e000)
[1,1]<stdout>: [1,1]<stdout>:linux-vdso.so.1 (0x00007ffcfcdd0000)
[1,1]<stdout>: [1,1]<stdout>:libmpi.so.40 => /usr/local/lib/libmpi.so.40 (0x00007f59231b5000)
[1,1]<stdout>: [1,1]<stdout>:libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f5922f96000)
[1,1]<stdout>: [1,1]<stdout>:libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f5922ba5000)
[1,1]<stdout>: [1,1]<stdout>:libopen-rte.so.40 => /usr/local/lib/libopen-rte.so.40 (0x00007f59228f0000)
[1,1]<stdout>: libopen-pal.so.40 => /usr/local/lib/libopen-pal.so.40 (0x00007f59225e1000)
[1,1]<stdout>: librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f59223d9000)
[1,1]<stdout>: [1,1]<stdout>:libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f592203b000)
[1,1]<stdout>: /lib64/ld-linux-x86-64.so.2 (0x00007f59234ca000)
[1,1]<stdout>: libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f5921e1e000)
[1,1]<stdout>: libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f5921c1a000)
[1,1]<stdout>: libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f5921a0f000)
[1,1]<stdout>: [1,1]<stdout>:libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f592180c000)
[1,2]<stdout>: [1,2]<stdout>:not a dynamic executable
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[45618,1],2]
Exit code: 1
--------------------------------------------------------------------------