OpenMPI работает с 32-битными и 64-битными процессорами, сбой при параллельном запуске - PullRequest
0 голосов
/ 01 ноября 2018

Это относится к MPI с двумя компьютерами с 64-битным процессором и третьим компьютером с 32-битным процессором. Все компьютеры имеют одинаковые точные местоположения для lib и bin, и все они имеют одинаковый bashrc вместе с одной и той же папкой, где хранятся исполняемые файлы. Соединение SSH работает одинаково как для 64-битной, так и для 32-битной машины. Сервер является 64-битной машиной. Я локально скомпилировал исполняемый файл на 32-разрядной машине (обозначается как ([K7ASA: 1555])), и он работал на нем, но когда я попытался запустить его параллельно, я получил это сообщение.

mpirun -host 10.42.0.163,10.42.0.72,10.42.0.68 ./mpi_quad-1
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[K7ASA:1555] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[40577,1],2]
  Exit code:    1

Вот вывод для

mpirun -host 10.42.0.163,10.42.0.72,10.42.0.68 --tag-output uname -a

[1,0]<stdout>:Linux verthex-Lenovo-V570 4.15.0-38-generic #41~16.04.1-Ubuntu SMP Wed Oct 10 20:16:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[1,1]<stdout>:Linux verthex-HP-Pavilion-zv5000-DP299AV 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[1,2]<stdout>:Linux verthex-K7ASA 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:58:23 UTC 2018 i686 athlon i686 GNU/Linux

mpirun -host 10.42.0.163,10.42.0.72,10.42.0.68 --tag-output file mpi_quad-1

[1,0]<stdout>:mpi_quad-1: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=a7aa397b9a339ae464201270a065fa7037721016, not stripped
[1,1]<stdout>:mpi_quad-1: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=a7aa397b9a339ae464201270a065fa7037721016, not stripped
[1,2]<stdout>:mpi_quad-1: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=a7aa397b9a339ae464201270a065fa7037721016, not stripped

mpirun -host 10.42.0.163,10.42.0.72,10.42.0.68 --tag-output ldd mpi_quad-1

[1,0]<stdout>:  linux-vdso.so.1 =>  (0x00007ffc091eb000)
[1,0]<stdout>:  libmpi.so.40 => /usr/local/lib/libmpi.so.40 (0x00007fbda7934000)
[1,0]<stdout>:  libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fbda7717000)
[1,0]<stdout>:  libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbda734d000)
[1,0]<stdout>:  libopen-rte.so.40 => /usr/local/lib/libopen-rte.so.40 (0x00007fbda7096000)
[1,0]<stdout>:  libopen-pal.so.40 => /usr/local/lib/libopen-pal.so.40 (0x00007fbda6d8b000)
[1,0]<stdout>:  librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fbda6b83000)
[1,0]<stdout>:  libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbda687a000)
[1,0]<stdout>:  /lib64/ld-linux-x86-64.so.2 (0x00007fbda7c2e000)
[1,0]<stdout>:  libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fbda6660000)
[1,0]<stdout>:  libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbda645c000)
[1,0]<stdout>:  libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007fbda6251000)
[1,0]<stdout>:  libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fbda604e000)
[1,1]<stdout>:  [1,1]<stdout>:linux-vdso.so.1 (0x00007ffcfcdd0000)
[1,1]<stdout>:  [1,1]<stdout>:libmpi.so.40 => /usr/local/lib/libmpi.so.40 (0x00007f59231b5000)
[1,1]<stdout>:  [1,1]<stdout>:libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f5922f96000)
[1,1]<stdout>:  [1,1]<stdout>:libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f5922ba5000)
[1,1]<stdout>:  [1,1]<stdout>:libopen-rte.so.40 => /usr/local/lib/libopen-rte.so.40 (0x00007f59228f0000)
[1,1]<stdout>:  libopen-pal.so.40 => /usr/local/lib/libopen-pal.so.40 (0x00007f59225e1000)
[1,1]<stdout>:  librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f59223d9000)
[1,1]<stdout>:  [1,1]<stdout>:libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f592203b000)
[1,1]<stdout>:  /lib64/ld-linux-x86-64.so.2 (0x00007f59234ca000)
[1,1]<stdout>:  libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f5921e1e000)
[1,1]<stdout>:  libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f5921c1a000)
[1,1]<stdout>:  libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f5921a0f000)
[1,1]<stdout>:  [1,1]<stdout>:libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f592180c000)
[1,2]<stdout>:  [1,2]<stdout>:not a dynamic executable
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[45618,1],2]
  Exit code:    1
--------------------------------------------------------------------------
...