Проблема с отображением внутренней переменной генерирования и ошибка покрытия проблемы со знаком Verilog-кода умножения - PullRequest
0 голосов
/ 09 февраля 2019

Почему следующий код умножения verilog не умножается?

Кроме того, для отладки кода мне нужен доступ к переменной "middle_layers", но gtkwave не дает мне доступа к ней.Почему?

Я также попробовал формальную проверку с использованием yosys-smtbmc , но на удивление код не прошел даже самую простую обложку (in_valid), которую я вообще не понимаю.

module multiply(clk, reset, in_valid, out_valid, in_A, in_B, out_C); // C=A*B

parameter A_WIDTH = 16;
parameter B_WIDTH = 16;

input clk, reset;
input in_valid; // to signify that in_A, in_B are valid, multiplication process can start
input signed [(A_WIDTH-1):0] in_A;
input signed [(B_WIDTH-1):0] in_B;
output signed [(A_WIDTH+B_WIDTH-1):0] out_C;
output reg out_valid; // to signify that out_C is valid, multiplication finished

/* 
   This signed multiplier code architecture is a combination of row adder tree and 
   modified baugh-wooley algorithm, thus requires an area of O(N*M*logN) and time O(logN)
   with M, N being the length(bitwidth) of the multiplicand and multiplier respectively

   see https://i.imgur.com/NaqjC6G.png or 
   Row Adder Tree Multipliers in http://www.andraka.com/multipli.php or
   https://pdfs.semanticscholar.org/415c/d98dafb5c9cb358c94189927e1f3216b7494.pdf#page=10
   regarding the mechanisms within all layers

   In terms of fmax consideration: In the case of an adder tree, the adders making up the levels
   closer to the input take up real estate (remember the structure of row adder tree).  As the 
   size of the input multiplicand bitwidth grows, it becomes more and more difficult to find a
   placement that does not use long routes involving multiple switch nodes for FPGA.  The result
   is the maximum clocking speed degrades quickly as the size of the bitwidth grows.

   For signed multiplication, see also modified baugh-wooley algorithm for trick in skipping 
   sign extension (implemented as verilog example in https://www.dsprelated.com/showarticle/555.php),
   thus smaller final routed silicon area.

   /10753315/ponimanie-modifitsirovannogo-algoritma-umnozheniya-bougli

   All layers are pipelined, so throughput = one result for each clock cycle 
   but each multiplication result still have latency = NUM_OF_INTERMEDIATE_LAYERS 
*/


// The multiplication of two numbers is equivalent to adding as many copies of one 
// of them, the multiplicand, as the value of the other one, the multiplier.
// Therefore, multiplicand always have the larger width compared to multipliers

localparam SMALLER_WIDTH = (A_WIDTH <= B_WIDTH) ? A_WIDTH : B_WIDTH;
localparam LARGER_WIDTH = (A_WIDTH > B_WIDTH) ? A_WIDTH : B_WIDTH;

wire [(LARGER_WIDTH-1):0] MULTIPLICAND = (A_WIDTH > B_WIDTH) ? in_A : in_B ;
wire [(SMALLER_WIDTH-1):0] MULTIPLIPLIER = (A_WIDTH <= B_WIDTH) ? in_A : in_B ;

localparam NUM_OF_INTERMEDIATE_LAYERS = $clog2(SMALLER_WIDTH);


/*Binary multiplications and additions for partial products rows*/

// first layer has "SMALLER_WIDTH" entries of data of width "LARGER_WIDTH"
// This resulted in a binary tree with faster vertical addition processes as we have 
// lesser (NUM_OF_INTERMEDIATE_LAYERS) rows to add

// intermediate partial product rows additions
// Imagine a rhombus of height of "SMALLER_WIDTH" and width of "LARGER_WIDTH"
// being re-arranged into binary row adder tree
// such that additions can be done in O(logN) time

//reg [(NUM_OF_INTERMEDIATE_LAYERS-1):0][0:(SMALLER_WIDTH-1)][(A_WIDTH+B_WIDTH-1):0] middle_layers;
reg [(A_WIDTH+B_WIDTH-1):0] middle_layers[(NUM_OF_INTERMEDIATE_LAYERS-1):0][0:(SMALLER_WIDTH-1)];
//reg [(NUM_OF_INTERMEDIATE_LAYERS-1):0] middle_layers [0:(SMALLER_WIDTH-1)] [(A_WIDTH+B_WIDTH-1):0];
//reg middle_layers [(NUM_OF_INTERMEDIATE_LAYERS-1):0][0:(SMALLER_WIDTH-1)][(A_WIDTH+B_WIDTH-1):0];

generate // duplicates the leafs of the binary tree

    genvar layer; // layer 0 means the youngest leaf, layer N means the tree trunk

    for(layer=0; layer<NUM_OF_INTERMEDIATE_LAYERS; layer=layer+1) begin: intermediate_layers

        integer pp_index; // leaf index within each layer of the tree
        integer bit_index; // index of binary string within each leaf

        always @(posedge clk)
        begin
            if(reset) 
            begin
                for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
                    middle_layers[layer][pp_index] <= 0;
            end

            else begin

                if(layer == 0)  // all partial products rows are in first layer
                begin

                    // generation of partial products rows
                    for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
                        middle_layers[layer][pp_index] <= 
                        (MULTIPLICAND & MULTIPLIPLIER[pp_index]);

                    // see modified baugh-wooley algorithm: https://i.imgur.com/VcgbY4g.png
                    for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
                        middle_layers[layer][pp_index][LARGER_WIDTH-1] <= 
                       !middle_layers[layer][pp_index][LARGER_WIDTH-1];

                    middle_layers[layer][SMALLER_WIDTH-1] <= !middle_layers[layer][SMALLER_WIDTH-1];
                    middle_layers[layer][0][LARGER_WIDTH] <= 1;
                    middle_layers[layer][SMALLER_WIDTH-1][LARGER_WIDTH] <= 1;
                end

                // adding the partial product rows according to row adder tree architecture
                else begin
                    for(pp_index=0; pp_index<(SMALLER_WIDTH >> layer) ; pp_index=pp_index+1)
                        middle_layers[layer][pp_index] <=
                        middle_layers[layer-1][pp_index<<1] +
                      (middle_layers[layer-1][(pp_index<<1) + 1]) << 1;

                    // bit-level additions using full adders
                    /*for(pp_index=0; pp_index<SMALLER_WIDTH ; pp_index=pp_index+1)
                        for(bit_index=0; bit_index<(LARGER_WIDTH+layer); bit_index=bit_index+1)
                            full_adder fa(.clk(clk), .reset(reset), .ain(), .bin(), .cin(), .sum(), .cout());*/
                end
            end
        end
    end

endgenerate

assign out_C = (reset)? 0 : middle_layers[NUM_OF_INTERMEDIATE_LAYERS-1][0];


/*Checking if the final multiplication result is ready or not*/

reg [($clog2($clog2(SMALLER_WIDTH))-1):0] out_valid_counter; // to track the multiply stages
reg multiply_had_started;

always @(posedge clk)
begin
    if(reset) 
    begin
        multiply_had_started <= 0;
        out_valid <= 0;
        out_valid_counter <= 0;
    end

    else if(out_valid_counter == $clog2(SMALLER_WIDTH)-1) begin
        multiply_had_started <= 0;
        out_valid <= 1;
        out_valid_counter <= 0;
    end

    else if(in_valid && !multiply_had_started) begin
        multiply_had_started <= 1;
        out_valid <= 0; // for consecutive multiplication
    end

    else begin
        out_valid <= 0;
        if(multiply_had_started) out_valid_counter <= out_valid_counter + 1;
    end
end


`ifdef FORMAL

initial assume(in_valid == 0);
initial assert(out_valid == 0);
initial assert(out_valid_counter == 0);

wire sign_bit = in_A[A_WIDTH-1] ^ in_B[B_WIDTH-1];

always @(posedge clk)
begin
    if(reset) assert(out_C == 0);

    else if(out_valid) begin
        assert(out_C == (in_A * in_B));
        assert(out_C[A_WIDTH+B_WIDTH-1] == sign_bit);
    end
end

`endif

`ifdef FORMAL

always @(posedge clk)
begin
    cover(in_valid && (in_A != 0) && (in_B != 0));
    cover(out_valid);
end

`endif

endmodule


module full_adder(clk, reset, ain, bin, cin, sum, cout);

input clk, reset;
input ain, bin, cin;
output reg sum, cout;

// Full Adder Equations
// Sum = A ⊕ B ⊕ Cin and Cout = (A ⋅ B) + (Cin ⋅ (A ⊕ B))
// where A ⊕ B is equivalent to A XOR B , A ⋅ B is equivalent to A AND B

always @(posedge clk)
begin
    if(reset)
    begin
        sum <= 0;
        cout <= 0;
    end

    else begin
        sum <= ain^bin^cin;
        cout <= (ain & bin) | (cin & (ain^bin));
        //cout <= (ain * bin) + (cin * (ain - bin)); 
    end
end

endmodule

wrong gtkwave multiplication result

...