# DESIGN OF DECIMAL MULTIPLICATION USING SIGN MAGNITUDE ENCODING FOR VLSI ARCHITECTURE

Saritha<sup>1</sup>, Manchalla.O.V.P.Kumar<sup>2</sup>

<sup>1</sup>PG Scholar, Dept of ECE, Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad, Telangana.

2Assistant Professor, Dept of ECE, Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad, Telangana.

Abstract- This paper introduces two novel architectures for parallel decimal multipliers. Our multipliers are based on a new algorithm for decimal carry—save multi operand addition that uses a novel TCSD recoding for decimal digits. It significantly improves the area and latency of the partial product reduction tree with respect to previous proposals. We also present three schemes for fast and efficient generation of partial products in parallel. The recoding of the TCSD multiplier operand into minimally redundant signed—digit radix—10, radix—4 and radix—5 representations using new recoders reduces the complexity of partial product generation. In addition, SD radix—4 and radix—5 recodings allow the reuse of a conventional parallel binary radix—4 multiplier to perform combined binary/decimal multiplications.

**Keywords-** Radix-10 multiplier; redundant representation; sign-magnitude signed digits (SMSDs); VLSI design.

### I. INTRODUCTION

Decimal computer arithmetic [1] is preferred in decimal data processing environments such as scientific, commercial, financial, internet-based applications in monetary, web-based, and human interactive applications. Ever growing needs for processing power, required by applications with intensive decimal arithmetic, cannot be met by conventional slow software simulated decimal arithmetic units. However, their hardware counterparts as an integral part of recently commercialized general purpose processors are gaining importance. Binarycoded decimal (BCD) encoding of decimal digits has conventionally dominated decimal arithmetic algorithms, whether realized by hardware or in software. The research for hardware realization of decimal arithmetic is not matured yet and there are rooms for improvements in hardware algorithms and designs. For example, the state-ofthe-art BCD multipliers[2], for computing X Y, use iterative multiplication algorithms, where the partial products (i.e. the product of one BCD digit of the multiplier Y times the multi-BCD-digit multiplicand X) are generated one at a time and added to the previously accumulated result. Each partial product may be directly generated as one BCD number in [0, 9] X, or may be composed of few easy multiples of the multiplicand (e.g. 7X \(^1\)4 4X \(^1\) 2X \(^1\) X). The latter approach tends to increase the depth (measured by the maximum number of equally weighted BCD digits) of partial product tree per each BCD digit of multiplier, which in general leads to slower partial product accumulation. But, by using possibly fast and low-cost BCD digit by BCD-digit multipliers, the former approach may lead to less costly BCD multipliers. Erle et al. have enumerated three reasons for using decimal digit-by-digit multipliers for partial product generation, which leads to less number of cycles, less wiring and no need for registers to store multiples of the multiplicand. With the rapid advances in VLSI technology, semi(fully)-parallel BCD multipliers will soon be attractive, where more than one (all) partial product(s) are generated at once and accumulated in parallel.

#### II. LITERATURE SURVEY

Dynamic negation of pre computed X multiples reduces their selection cost at the penalty of one XOR gate per each bit of the selected positive multiple. This negation cost is replicated n times for parallel n×n multiplication. Moreover, the n inserted 1s for 10's complementation in and  $n \times (n+1)$  1s for digit wise two's complementation in have a negative impact on area and power saving. The same is true for the correction constant, and more complex recoding due to zero handling, for [0, 15] partial products. One way to save these costs, as we do in Section III, is to generate the SD pre computed X multiples with sign magnitude format, so as to reduce the XOR gates to one per digit (roughly 75% savings in the number of negating XOR gates) and remove the aforementioned negative impacts. However, besides slowing down the PPG [3] to some extent (e.g., in comparison with radix-5 implementation of [4]), new problems are introduced in PPR, which are explained and solved in the next section, where we also reduce the depth of IPP matrix to n = 16, effectively prior to termination of PPG.

#### IJRECE VOL. 7 ISSUE 2 (APRIL- JUNE 2019)

#### III. EXISTING SYSTEM

Fast radix-10 multiplication, in particular, can be achieved via parallel partial product generation (PPG) and partial product reduction (PPR), which is, however, highly area consuming in VLSI implementations. Therefore, it is desired to lower the silicon cost, while keeping the high speed of parallel realization. Let  $P = X \times Y$  represent an  $n \times n$ decimal multiplication, where multiplicand X, multiplier Y. and product P are normal radix-10 numbers with digits in [0, 9]. Such digits are commonly represented via binary-coded decimal (BCD) encoding. However, intermediate partial products (IPPs) are represented via a diversity of often redundant decimal digit sets. The choice of alternative IPP representations is influential on the PPG, which is of particular importance in decimal multiplication from two points of view: one is fast and low cost generation of IPPs and the other is its impact on representation of IPPs, which is influential on PPR efficiency. Straightforward PPG via BCD digit-by-digit multiplication [5], [6] is slow, expensive, and leads to n double-BCD IPPs for n×n multiplication (i.e., 2n BCD numbers to be added).

#### ISSN: 2393-9028 (PRINT) | ISSN: 2348-2281 (ONLINE)

However, the work of recodes both the multiplier and multiplicand to sign magnitude signed digit (SMSD) representation and uses a more efficient 3-b by 3-b PPG. Nevertheless, following a long standing practice, most PPG schemes use pre computed multiples of multiplicand X (or X multiples). Pre computation of the complete set0,  $1, \ldots 9$  × X, as normal BCD numbers, and the subsequent selection are also slow and costly. A common remedial technique is to use a smaller less costly set that can be achieved via fast carry-free manipulation (e.g., 0, 1, 2, 4, 5 × X) at the cost of doubling the count of BCD numbers to be added in PPR; that is, n double-BCD IPPs are generated, such as 3X = (2X, X), 7X = (5X, 2X), or <math>9X = (5X, 4X).

#### IV. PROPOSED SYSTEM

We aim to take advantage of [-5, 5] SMSD recoding of multiplier and dynamic negation of X multiples, while reducing the number of XOR gates via generating [-6, 6] SMSD pre-computed X multiples (i.e., just one XOR gate per 4-b digit). Other contributions of this paper are highlighted below.



Fig. 1: Block diagram of the proposed multiplier.

- A. Starting the PPR with 16 Partial Products: An especial on the fly augmentation of two middle SMSD digits leads to reducing the depth of partial product matrix by 1, such that the PPR starts with 16 operands right at the end of PPG, with no delay penalty for the latter.
- B. Special 4-in-1 SMSD Adder with TCSD Sum: To avoid the challenging addition of SMSD IPPs, we design a novel carry-free adder that represents the sum of two [-6, 6] SMSD operands in [-7, 7] two's complement signed-digit (TCSD) format, where one

- unified adder is utilized for all the four possible sign combinations.
- C. Improved TCSD Addition: The rest of the reduction process[7] uses special TCSD adders that are actually an improved version of the fast TCSD adder. Such 2:1 reduction promotes the VLSI regularity of the PPR circuit, especially for n = 16.
- D. Augmenting the Final Redundant to Non redundant Conversion with the Last PPR Level: The last PPR level would normally lead to TCSD product, which should be converted to BCD. However, to gain more

#### IJRECE VOL. 7 ISSUE 2 (APRIL- JUNE 2019)

speed and reduce costs, we device a special hybrid decimal adder with two TCSD inputs and a BCD output.

Recoding of Multiplier's Digits: Original BCD digits of multiplier require[0,9]×X precom-puted multiples, which include hard multiples {3,6,7,9}×X that unlike {2,4,5,8}×X are not derivable without carry propagation. On the other hand, BCD-to-redundant [-5,5] SMSD recoding of multiplier's digits with dynamic negation of IPPs reduces the required X multiples to [0,5]×X that include only one hard multiple (i.e., 3X). However, this recoding produces a carry as the (n+1)th digit of multiplier, which increases the number of IPPs by 1. This is especially not desirable for n = 16 (i.e., the recommended IEEE754-2008 word size for decimal operands [12]).

Partial Product Reduction The overall PPR for n = 16 is illustrated by Fig. 2, where a bar, triangle, square, and diamond represent a BCD, [-6, 6] SMSD, [-7, 7] TCSD, and

#### ISSN: 2393-9028 (PRINT) | ISSN: 2348-2281 (ONLINE)

binary signed digit (BSD), respectively. The choice of SMSD representation for the first level IPPs, while facilitating the PPG, bears no extra complexity for PPR, since all reduction levels use TCSD adders, except for the first one that requires a special SMSD+SMSD-to-TCSD added.

We reduce the matrix depth to n (e.g.,  $5 \rightarrow 4$  forn = 4, and 17  $\rightarrow$  16 for n = 16), with no delay between the termination of PPG and start of PPR. Here is how it works: we compute sum of the two gray digits (see Fig. 3) independent of (and in parallel with) normal PPG as follows. If  $Yn-1 \le 4$ , the value of 10n-weighted carry of recoded multiplier is zero, so the bottom gray digit has to be zero. Therefore, no addition is required. For Yn-1 > 4, let H denote the most significant digit of  $Xn-1 \times Y0$  (e.g., the top gray digit in Fig. 3), where Xn-1 and Y0 represent the most significant BCD digit of multiplicand[8] and the least significant recoded digit of multiplier, respectively.



Fig. 2:Partial Product Reduction.

Special 4-in-1 SMSD Adder: A digit slice of the aforementioned SMSD+SMSD-toTCSD adder for four different cases corresponding to all possible combinations of the input signs is depicted by Fig. 3. (A posibit is a normal bit whose arithmetic value equals its logical status, and the arithmetic value of a nega bit with logical status x equals x - 1 [9].) The sum of two [-6, 6] SMSD digits (e.g., P = sp p2 p1 p0 and Q = sqq2q1q0), and a signed carry in (e.g., Cin) is produced as one [-7, 7] TCSD digit (e.g., S = s3s2s1s0), and a

signed carry out (e.g., Cout). This is a two-stage process. In the stage I, the sign bits are applied to the magnitudes, such that a negative sign changes the polarity of magnitude posibits to negabits and inverts their logical states. Subsequently, in the same stage, the bit collection U is decomposed, and the bit collection V is recoded. In the second stage, however, as will be explained shortly, only one 4-b adder takes care of all the four cases, which explains the rationale for designation of the adder.



*Fig. 3: Digit slice of the 4-in-1 SMSD+SMSD \rightarrow TCSD adder.* 

To speed up the latter two steps (i.e., 2:1 reduction and TCSD-to-BCD conversion), the actual BCD product generation of Fig. 5 uses a more efficient method which is described below. The final 2:1 reduction level that is required for positions 8 to 22 and the subsequent TCSD-to-BCD conversion can be actually augmented as a TCSD + TCSD addition with BCD result. The 4-in-1 Design: Other encodings are also possible for Z and V values. For example, an alternative encoding for both  $Z \in [-2,3]$  and  $V \in [-3,1]$  of Fig. 3 is  $\circ \bullet \bullet \in [-4,3]$  which covers the latter two intervals. However, the proposed encodings are so chosen to allow for unified treatment of the bit collections that are obtained after the decomposition and recoding. That is, a simplified 4-b adder can take care of all the four cases. This is actually possible via the standard full adders that are capable of handling all the 3-b posibit/negabit collections of inputs [10]. Note that the normally required leftmost Half Adder is reduced to an OR gate since no carry out is expected.

The aforementioned decomposition and recoding can be further justified by close examination of the content, where the range of P + O determines the possible values for Cout, which

always lead to  $S = 2Z + V + Cin \in [-7,7]$ , as shown in the rightmost column. The (cin,cin) pair represents the incoming signed carry Cin from the less significant position. Representations of Z, V, and Cin are so determined as to lead to two's complement representation for S in all the four cases (see below for more explanations, and the following numerical example). Example 1: (Fig. 3 by numerical values): where two SMSDs P = sp101 (|P|=5) and Q = sq100 (|Q|=4) are added. Fig. 3 with numerical values, where signs (i.e., sp and sq) are explicitly shown as was the case in Fig. 3, and negabits are inversely encoded as 1-(0-), which represent the arithmetic value 0(-1). The incoming signed carry Cin = 0 is represented by the posibit cin = 0 and inversely encoded negabit c in = 1-. Therefore, the Full Adder in position 0 receives two negabits and one posibit and produces a posibit sum 1 and a negabit carry 0-, such that  $2 \times (-1)+1 = -1$ , as there was only one arithmetically nonzero input 0- (i.e., -1). The 4-in-1 adder is slightly more efficient than [-7,7] TCSD adder (i.e., less latency with no area overhead), as can be verified by inspecting (4) and (5) for the preprocessing logic boxes in 4-in-1 adder and that of TCSD adder [i.e., (6)].

## V. SIMULATION RESULTS 5.1 Existing Method Results

| Device Utilization Summary (estimated values) |      |           |             |  |  |
|-----------------------------------------------|------|-----------|-------------|--|--|
| Logic Utilization                             | Used | Available | Utilization |  |  |
| Number of Slice Registers                     | 493  | 126800    | 0%          |  |  |
| Number of Slice LUTs                          | 8427 | 63400     | 13%         |  |  |
| Number of fully used LUT-FF pairs             | 455  | 8465      | 5%          |  |  |
| Number of bonded IOBs                         | 275  | 210       | 130%        |  |  |
| Number of BUFG/BUFGCTRLs                      | 2    | 32        | 6%          |  |  |

Fig,5.1: Design summary



Fig.5.2: RTL schematic



Fig. 5.3: Simulation results

x<0> (PAD) Source: p<7> (PAD) Destination: Data Path: x<0> to p<7> Net Gate Cell:in->out fanout Delay Delay Logical Name (Net Name) IBUF: I->0 13 1.106 0.905 x 0 IBUF (Madd inv x lut<0>) 3 0.612 0.520 Madd inv x xor<3>11 (inv x<3>) LUT4: I1->0 LUT3:I1->0 1 0.612 0.000 spp\_0\_mux0000<3>1 (spp\_0\_mux0000<3>1) MUXF5: I1->0 3 0.278 0.603 spp 0 mux0000<3> f5 (spp 0 mux0000<3>) LUT4:10->0 1 0.612 0.509 Madd prod cy<4>1 SWO (N30) LUT4:I0->0 3 0.612 0.520 Madd prod cy<4>1 (Madd prod cy<4>) LUT2: I1->0 1 0.612 0.357 Madd prod xor<5>11 (p 5 OBUF) 3.169 p 5 OBUF (p<5>) OBUF: I->O

Total 11.028ns (7.613ns logic, 3.415ns route) (69.0% logic, 31.0% route)

Fig. 5.4: Time Summary 5.2 Proposed Results



Fig. 5.5:one hot recoder output





Fig. 5.8: partial products reduction output



Figure 5.9:- double sd to bcd converter output



Fig.5.10: multiplier output 5.3 Extension Results



Fig. 5.11: RTL schematic output

| Device Utilization Summary (estimated values) |      |           |      |             | [-] |
|-----------------------------------------------|------|-----------|------|-------------|-----|
| Logic Utilization                             | Used | Available |      | Utilization |     |
| Number of Slices                              |      | 21        | 4656 |             | 0%  |
| Number of 4 input LUTs                        |      | 37        | 9312 |             | 0%  |
| Number of bonded IOBs                         |      | 16        | 232  |             | 6%  |

Fig.5.12: design summary output

Data Path: b2/r BCD 6 to b2/r BCD 15 Gate Net Delay Logical Name (Net Name) Cell:in->out fanout Delay FDE:C->Q 5 0.361 0.575 b2/r BCD 6 (b2/r BCD 6) 1 0.097 0.295 b2/ n0185 inv3 (b2/ n0185 inv3) LUT4:I0->0 1 0.097 0.379 b2/ n0185 inv4 (b2/ n0185 inv4) LUT6:I5->0 16 0.097 0.348 b2/ n0185 inv5 (b2/ n0185 inv5) LUT6:I4->0 FDE:CE 0.095 b2/r BCD 0 2.345ns (0.747ns logic, 1.598ns route) Total (31.9% logic, 68.1% route)

Fig. 5.13: time summary output



Fig.5.14: extension simultiuon output

#### VI. CONCLUSION

In this paper we have presented several techniques to implement decimal parallel multiplication in hardware. We propose three different SD encodings for the multiplier that lead to fast parallel and simple generation of partial products. For partial product reduction we have developed a decimal carry-save algorithm based on a BCD-4221 representation of decimal digit operands. It makes possible the construction ofp:2decimal CSA trees that outperform the area-delay figures of existing proposals. Moreover, proposed techniques also allow the computation of combined binary/decimal multiplications with a moderate overhead. We have proposed architecture for decimal SD radix-10 parallel multiplication and two combined architectures for binary/decimal SD radix-4 and binary SD radix-4/decimal SD radix-5 multiplication. The area-delay figures from a comparative study including conventional binary parallel multipliers and other representative decimal proposals show that our decimal SD radix-10 multiplier is an interesting option for high performance with moderate area.

#### **REFERENCES**

- [1]. IEEE standard for floating-point arithmetic. IEEE Standards Committee, Oct. 2006.
- [2]. F. Y. Busaba, T. Slegel, S. Carlough, C. Krygowski, and J. G. Rell. The design of the fixed point unit for the z990 microprocessor. InProc. ACM Great Lakes14th Symposium on VLSI, pages 364–367, Apr. 2004.
- [3]. M. F. Cowlishaw. Decimal floating-point: Algorism for computers. InProc. IEEE16th Symposium on Computer Arithmetic, pages 104–111, July 2003.
- [4]. M. A. Erle and M. J. Schulte. Decimal multiplication via carry-save addition. InProc. IEEE Int'l Conference on Application-Specific Systems, Architectures, and Processors, pages 348–358, June 2003.
- [5]. M. A. Erle, E. M. Schwarz, and M. J. Schulte. Decimal multiplication with efficient partial product generation. In Proc. IEEE17th Symposium on Computer Arithmetic, pages 21–28, June 2005.

- [6]. R. D. Kenney and M. J. Schulte. High-speed multioperand decimal adders. IEEE Trans. on Computers, 54(8):953–963, Aug. 2005.
- [7]. R. D. Kenney, M. J. Schulte, and M. A. Erle. High-frequency decimal multiplier. InProc. IEEE Int'l Conference on Computer Design: VLSI in Computers and Processors, pages 26–29, Oct. 2004.
- [8]. T. Lang and A. Nannarelli. A radix-10 combinational multiplier. InProc.40th Asilomar Conference on Signals, Systems, and Computers, pages 313–317, Oct. 2006.
- [9]. R. H. Larson. High-speed multiply using four input carrysave adder. IBM Tech. Disclosure Bulletin, 16(7):2053–2054, Dec. 1973.
- [10]. N. Ohkubo and M. Suzuki. A 4.4 ns CMOS 54x54-bit multiplier using pass-transistor multiplexer. IEEE Journal of Solid State Circuits, 30(3):251-256, Mar. 1995.