Как использовать MPI_Win_allocate_shared без получения ошибки? - PullRequest
1 голос
/ 20 марта 2019

Мне нужно повторить алгоритм, в котором мне нужны два буфера (матрицы (L + 1) x (N + 2)), которые должны быть общими для всех процессов (каждый процесс должен иметь возможность записывать в них и читать, какие другие процессы писал). Я обнаружил, что решением может быть использование MPI_Win_allocate_shared, однако я думаю, что не очень хорошо понял, как его использовать, потому что я получаю ошибки. Ниже я приведу свой код с двумя испытаниями, которые, по моему мнению, близки к решению (я избегаю всего алгоритма, чтобы сосредоточиться на проблеме):

#include "Options.h"
#include <math.h>
#include <array>
#include <algorithm>
#include <memory>
#include <cmath>
#include <mpi.h>

std::pair <double, double> Options::BinomialPriceAmericanPut(void) {

int rank,size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

// shared buffers to save data for seller and buyer
MPI_Win win_seller, win_buyer;
// size of the local window in bytes
MPI_Aint buff_size;

///////////////// TRIAL 1 /////////////////////////
// pointers that will (locally) point to the shared memory
typedef std::array<PWL, N+2> row_type;
row_type *seller_buff;
row_type *buyer_buff;

///////////////// TRIAL 2 /////////////////////////
// pointers that will (locally) point to the shared memory
typedef std::array<PWL, N+2> row_type;
row_type seller_buff[L+1];
row_type buyer_buff[L+1];
// with this TRIAL 2 I'll remove "&"in front of seller_buff and buyer_buff 
// in MPI_Win_allocate_shared and MPI_Win_shared_query

// allocate shared memory
if (rank == 0) {
    buff_size = (N+2) * (L+1) * sizeof(PWL);
    MPI_Win_allocate_shared(buff_size, sizeof(PWL), MPI_INFO_NULL,
                          MPI_COMM_WORLD, &seller_buff, &win_seller);
    MPI_Win_allocate_shared(buff_size, sizeof(PWL), MPI_INFO_NULL,
                          MPI_COMM_WORLD, &buyer_buff, &win_buyer);
}
else {
    int disp_unit;
    MPI_Win_allocate_shared(0, sizeof(PWL), MPI_INFO_NULL,
                          MPI_COMM_WORLD, &seller_buff, &win_seller);
    MPI_Win_allocate_shared(0, sizeof(PWL), MPI_INFO_NULL,
                          MPI_COMM_WORLD, &buyer_buff, &win_buyer);
    MPI_Win_shared_query(win_seller, 0, &buff_size, &disp_unit, 
&seller_buff);
    MPI_Win_shared_query(win_buyer, 0, &buff_size, &disp_unit, 
&buyer_buff);
}

// up- and down- move factors
double u = exp( sigma * sqrt(expiry/N) );

// cash accumulation factor
double r = exp( R*expiry / N );

// initialize algorithm
int p(size);
int n = N + 2; // number of nodes in the current base level
int s = rank * ( n/p );
int e = (rank==p-1)? n: (rank+1) * ( n/p );
// each core works on e-s nodes in the current level

// compute u and z for both seller and buyer: payoff (0,0) at time N+1
for (int l=s; l<e; l++) {
    const double St = S0*pow (u, N+1-2*l);
    const double Sa = St * (1+k);
    const double Sb = St * (1-k);

    // compute functions
    PWL u_s( {Line(-Sa, 0), Line(-Sb,0)} );
    PWL u_b( {Line(-Sa, 0), Line(-Sb, 0)} );

    // fill buffers
    seller_buff[0][l] = u_s;
    buyer_buff[0][l] = u_b;
}

MPI_Barrier(MPI_COMM_WORLD);
if (rank == 0 ) {
    std::cout << "Row: " << 11 << std::endl
    << "\tAsk = " << seller_buff[0][7].valueInPoint(0) << std::endl
    << "\tBid = " << -buyer_buff[0][7].valueInPoint(0) << std::endl;
}

int U = 0; // variable for the mapping from tree to buffers
int B=N+1; // current base level
while ( B>0 ) {
  // DO stuffs with buffers
}

// compute ask and bid prices
double ask(0), bid(0);

// clear shared windows
MPI_Win_free(&win_seller);
MPI_Win_free(&win_buyer);

return std::make_pair(bid, ask);
}

Я добавил «if» после MPI_Barrier, чтобы посмотреть, работают ли буферы, где столбец 7 (с N = 10) должен быть вычислен по рангу 1. На самом деле TRIAL 1 работал при использовании другого более простого класса, но с классом PWL нет. Ошибка в двух испытаниях:

1) В TRIAL 1 я получаю ошибку сегментации из-за вызова valueInPoint () в if: проблема в том, что rank 0 не может видеть то, что rank 1 записал в его столбцах, но я не понимаю, почему.

mpiexec -np 3 main
[localhost:09623] *** Process received signal ***
[localhost:09623] Signal: Segmentation fault (11)
[localhost:09623] Signal code: Address not mapped (1)
[localhost:09623] Failing at address: 0x26fd440
[localhost:09623] [ 0] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libpthread.so.0(+0x10e20)[0x7fa78a99de20]
[localhost:09623] [ 1] main[0x4048ac]
[localhost:09623] [ 2] main[0x4048f8]
[localhost:09623] [ 3] main[0x401e55]
[localhost:09623] [ 4] main[0x40178e]
[localhost:09623] [ 5] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(__libc_start_main+0xf0)[0x7fa78a60c6b0]
[localhost:09623] [ 6] main[0x401389]
[localhost:09623] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 9623 on node localhost exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
make: *** [Makefile:26: run] Error 139

2) В этом случае ранг 0 может получить доступ и распечатать то, что сделал ранг 1, но я получаю еще одну ошибку.

mpiexec -np 3 main
Row: 11
    Ask = 0
    Bid = -0
[localhost:09651] *** Process received signal ***
[localhost:09651] Signal: Segmentation fault (11)
[localhost:09651] Signal code: Address not mapped (1)
[localhost:09651] Failing at address: 0x7fe9777c90bc
[localhost:09651] [ 0] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libpthread.so.0(+0x10e20)[0x7fe976627e20]
[localhost:09651] [ 1] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(cfree+0x14)[0x7fe9762eef74]
[localhost:09651] [ 2] main[0x404724]
[localhost:09651] [ 3] main[0x404096]
[localhost:09651] [ 4] main[0x40362e]
[localhost:09651] [ 5] main[0x403127]
[localhost:09651] [ 6] main[0x40274f]
[localhost:09651] [ 7] main[0x402528]
[localhost:09651] [ 8] main[0x4025e6]
[localhost:09651] [ 9] main[0x402017]
[localhost:09651] [10] main[0x40178e]
[localhost:09651] [11] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(__libc_start_main+0xf0)[0x7fe9762966b0]
[localhost:09651] [12] main[0x401389]
[localhost:09651] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 9651 on node localhost exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
make: *** [Makefile:26: run] Error 139

Более того, фактически, когда я запускаю весь алгоритм (без комментирования цикла while) с TRIAL 2, я получаю еще одну ошибку:

mpiexec -np 3 main
Row: 11
    Ask = 0
    Bid = -0
*** Error in `main': free(): invalid pointer: 0x00007f2ccac660c4 ***
======= Backtrace: =========
/u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(+0x6f2e4)[0x7f2cc977f2e4]
/u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(+0x74d16)[0x7f2cc9784d16]
/u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(+0x754fe)[0x7f2cc97854fe]
main[0x405dd6]
main[0x40556c]
main[0x404930]
main[0x404429]
main[0x40396f]
main[0x404eb7]
main[0x404334]
main[0x403833]
main[0x4023ee]
main[0x40178e]
/u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(__libc_start_main+0xf0)[0x7f2cc97306b0]
main[0x401389]
======= Memory map: ========
00400000-0040e000 r-xp 00000000 00:25 659                                /vagrant/Google Drive/ACP/TENTATIVO4/main
0060d000-0060e000 r--p 0000d000 00:25 659                                /vagrant/Google Drive/ACP/TENTATIVO4/main
0060e000-0060f000 rw-p 0000e000 00:25 659                                /vagrant/Google Drive/ACP/TENTATIVO4/main
00fac000-01239000 rw-p 00000000 00:00 0                                  [heap]
7f2cb0000000-7f2cb0021000 rw-p 00000000 00:00 0 
7f2cb0021000-7f2cb4000000 ---p 00000000 00:00 0 
7f2cb7fff000-7f2cc0000000 rw-s 00000000 fd:00 202783851                  /tmp/openmpi-sessions-vagrant@localhost_0/63096/1/shared_mem_pool.localhost (deleted)
7f2cc0000000-7f2cc0021000 rw-p 00000000 00:00 0 
7f2cc0021000-7f2cc4000000 ---p 00000000 00:00 0 
7f2cc48f1000-7f2cc4cf2000 rw-s 00000000 fd:00 135477240                  /tmp/openmpi-sessions-vagrant@localhost_0/63096/1/2/vader_segment.localhost.2
7f2cc4cf2000-7f2cc50f3000 rw-s 00000000 fd:00 68300033                   /tmp/openmpi-sessions-vagrant@localhost_0/63096/1/1/vader_segment.localhost.1
7f2cc50f3000-7f2cc54f4000 rw-s 00000000 fd:00 1474379                    /tmp/openmpi-sessions-vagrant@localhost_0/63096/1/0/vader_segment.localhost.0
7f2cc54f4000-7f2cc54ff000 r-xp 00000000 fd:00 2626640                    /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libnss_files-2.23.so
7f2cc54ff000-7f2cc56fe000 ---p 0000b000 fd:00 2626640                    /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libnss_files-2.23.so
7f2cc56fe000-7f2cc56ff000 r--p 0000a000 fd:00 2626640                    /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libnss_files-2.23.so
7f2cc56ff000-7f2cc5700000 rw-p 0000b000 fd:00 2626640                    /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libnss_files-2.23.so
7f2cc5700000-7f2cc5706000 rw-p 00000000 00:00 0 
7f2cc5706000-7f2cc5707000 ---p 00000000 00:00 0 
7f2cc5707000-7f2cc5f07000 rw-p 00000000 00:00 0                          [stack:9979]
7f2cc5f07000-7f2cc5f2b000 r-xp 00000000 fd:00 4240669                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/liblzma.so.5.2.2
7f2cc5f2b000-7f2cc612b000 ---p 00024000 fd:00 4240669                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/liblzma.so.5.2.2
7f2cc612b000-7f2cc612c000 r--p 00024000 fd:00 4240669                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/liblzma.so.5.2.2
7f2cc612c000-7f2cc612d000 rw-p 00025000 fd:00 4240669                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/liblzma.so.5.2.2
7f2cc612d000-7f2cc6142000 r-xp 00000000 fd:00 1363668                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libz.so.1.2.8
7f2cc6142000-7f2cc6341000 ---p 00015000 fd:00 1363668                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libz.so.1.2.8
7f2cc6341000-7f2cc6342000 r--p 00014000 fd:00 1363668                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libz.so.1.2.8
7f2cc6342000-7f2cc6343000 rw-p 00015000 fd:00 1363668                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libz.so.1.2.8
7f2cc6343000-7f2cc7bbf000 r--p 00000000 fd:00 1549735                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicudata.so.57.1
7f2cc7bbf000-7f2cc7dbe000 ---p 0187c000 fd:00 1549735                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicudata.so.57.1
7f2cc7dbe000-7f2cc7dbf000 r--p 0187b000 fd:00 1549735                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicudata.so.57.1
7f2cc7dbf000-7f2cc7f4d000 r-xp 00000000 fd:00 1549736                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicuuc.so.57.1
7f2cc7f4d000-7f2cc814d000 ---p 0018e000 fd:00 1549736                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicuuc.so.57.1
7f2cc814d000-7f2cc815f000 r--p 0018e000 fd:00 1549736                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicuuc.so.57.1
7f2cc815f000-7f2cc8160000 rw-p 001a0000 fd:00 1549736                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicuuc.so.57.1
7f2cc8160000-7f2cc8162000 rw-p 00000000 00:00 0 
7f2cc8162000-7f2cc83c3000 r-xp 00000000 fd:00 1549762                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicui18n.so.57.1
7f2cc83c3000-7f2cc85c3000 ---p 00261000 fd:00 1549762                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicui18n.so.57.1
7f2cc85c3000-7f2cc85d0000 r--p 00261000 fd:00 1549762                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicui18n.so.57.1
7f2cc85d0000-7f2cc85d2000 rw-p 0026e000 fd:00 1549762                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicui18n.so.57.1
7f2cc85d2000-7f2cc85d3000 rw-p 00000000 00:00 0 
7f2cc85d3000-7f2cc85d5000 r-xp 00000000 fd:00 2590182                    /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libdl-2.23.so[localhost:09977] *** Process received signal ***
[localhost:09977] Signal: Aborted (6)
[localhost:09977] Signal code:  (-6)
[localhost:09977] [ 0] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libpthread.so.0(+0x10e20)[0x7f2cc9ac1e20]
[localhost:09977] [ 1] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(gsignal+0x38)[0x7f2cc9743228]
[localhost:09977] [ 2] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(abort+0x16a)[0x7f2cc97446aa]
[localhost:09977] [ 3] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(+0x6f2e9)[0x7f2cc977f2e9]
[localhost:09977] [ 4] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(+0x74d16)[0x7f2cc9784d16]
[localhost:09977] [ 5] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(+0x754fe)[0x7f2cc97854fe]
[localhost:09977] [ 6] main[0x405dd6]
[localhost:09977] [ 7] main[0x40556c]
[localhost:09977] [ 8] main[0x404930]
[localhost:09977] [ 9] main[0x404429]
[localhost:09977] [10] main[0x40396f]
[localhost:09977] [11] main[0x404eb7]
[localhost:09977] [12] main[0x404334]
[localhost:09977] [13] main[0x403833]
[localhost:09977] [14] main[0x4023ee]
[localhost:09977] [15] main[0x40178e]
[localhost:09977] [16] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(__libc_start_main+0xf0)[0x7f2cc97306b0]
[localhost:09977] [17] main[0x401389]
[localhost:09977] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 9977 on node localhost exited on signal 6 (Aborted).
--------------------------------------------------------------------------
make: *** [Makefile:26: run] Error 134

Пожалуйста, кто-нибудь может помочь мне понять, что происходит и как это исправить ?? Спасибо всем.

...