# Implementation of WFTA Structure in FFT processor based on memory

E.PRAVALLIKA<sup>1</sup>, M.KIRAN<sup>2</sup>, K.JAMAL<sup>3</sup>, MANCHALLA.O.V.P.KUMAR<sup>4</sup>

<sup>1</sup>PG Scholar, Dept of ECE, Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad, Telangana.

<sup>2</sup>Associate Professor, Dept of ECE, Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad, Telangana.

<sup>3</sup>Associate Professor, Dept of ECE, Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad, Telangana.

<sup>4</sup>Assistant Professor, Dept of ECE, Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad, Telangana.

**ABSTRACT:** This paper demonstrates the designing and implementation of the WFTA Structure in FFT processor design based on memory. A High-Radix Small Butterfly unit(HRSB) technique is utilized in the decomposition calculation, for eliminating the complexity of Processor engine and also to diminish the computation cycles. These FFT cases are considered in Long term evolution systems. When considering the WFTA the area is reduced comparing with the previous method. In this 12-point FFT processor is implemented by utilizing the proposed WFTA structure.

**Keywords:** LTE systems, FFT processor structure based on memory, Winogard Algorithm, Butterfly Unit.

# I. INTRODUCTION

The Fast Fourier Transform is normally utilized in DSP technologies like Communication systems, Spectrum analyzing, Image process applications. The lengths of 128-2048 in 3 GPP Long Term Evolution systems[1] is applied in communication systems and the lengths as high as  $2^{22}$  [2] is required in Image Process applications.

The FFT Processors are partitioned into two structures: Memory-based and Pipeline structures. In Memorybased FFT the information is passed through single butterfly PEs many times, with more number of memory-banks. Memory-banks are utilized to hold the intermediate results and the information is processed recursively irrespective of their sizes. The pipeline structure is utilized for high information rates, more area and power compared to memory-based. Therefore, FFT based on memory achieves more hardware efficiency.

A novel in-place strategy and a CFMR FFT processor is illustrated which uses mixed-radix algorithms. This in-place technique is supported only for fixed-radix algorithms [3]. In conventional FFTs based memory [4] and [5], the complex PEs are replaced with the Multi-path Delay Commutator structures with high radix. In this addressing schemes it does not extend the principle of decomposition algorithm and arbitrary length FFTs. A traditional  $2^n$ -point FFT and prime-sized FFTs are explained with generalised mixed-radix algorithms in [6] and [7]. The architecture of 2,3,4 or 5 DFT is designed based on WFTA[8] for FFT processor designing. The complexity of the WFTA is more in terms of adders/subtractors and multipliers. So to reduce the complexity of the design a new reconfigurable WFTA structure is presented in this paper.

The rest of this paper is sorted as follows. In phase II, the related work is associated with this paper. The FFT processor structures for LTE systems are illustrated in phase III. In phase IV, the results and overall performance comparisons are analyzed. In phase V, the conclusion is given.

#### II BACKGROUND

The definition of DFT is characterised by the equation  $X_k = \sum_{n=0}^{N-1} x_n e^{-i2\pi kn/N}$  k = 0,1,...N-1 (1)

In [9] and [10], the strategy for disintegrating one-spatial DFTs into multi-spatial DFTs is illustrated. If the particular size N is deteriorated in to  $N_1N_2$ , the mapping index can be explained as accompanied:

$$n = \langle K_1 n_1 + K_2 n_2 \rangle N, n_1, k_1 = 0, 1, 2, \dots, N_1 - 1$$
  

$$k = \langle K_3 k_1 + K_4 k_2 \rangle N, n_2, k_2 = 0, 1, 2, \dots, N_2 - 1$$
(2)

where  $\langle . \rangle$  denotes modulo-*N* operation. If  $N_1$  and  $N_2$  are not generally prime, we can acquire

$$K_1 = N_2, K_2 = 1, K_3 = 1, K_4 = N_1$$
 (3)

Thus, the DFT is given as

$$\begin{split} C(k_{1},k_{2}) = & \sum_{n=0}^{N-1} x(n1,n2) W_{N}^{N_{1}n2k2} W_{N}^{N_{2}n1k1} W_{N_{1}}^{n2k1} \\ = & \sum_{n2=0}^{N_{2}-1} \{ \sum_{n1=0}^{N-1} x(n1,n2) W_{N_{1}}^{n1k1} W_{N}^{n2k1} \} W_{N_{2}}^{n2k2} \end{split}$$

If  $N_1$  and  $N_2$  are generally prime, then the index mapping for PFA is obtained as

Then the meaning of the DFT can be

$$\begin{split} C(K_1, K_2) = & \sum_{n=0}^{N-1} x(n1, n2) W_N^{N_1 N_1 N_1^{-1} n2k2} W_N^{N_2 N_2 N_2^{-1} n1k1} \\ &= \sum_{n=0}^{N-1} x(n1, n2) W_{N_2}^{N_1 N_1^{-1} n2k2} W_{N_1}^{N_2 N_2^{-1} n1k1} \\ &= \sum_{n2=0}^{N_2 - 1} \{ \sum_{n1=0}^{N-1} x(n1, n2) W_{N_1}^{n1k1} \} W_{N_2}^{n2k2} \end{split}$$
(6)

The N-point DFT can be decomposed into  $N_1$  and  $N_2$  points respectively for both CFA and PFA. In this algorithms for computation input index mapping is utilized and for

reordering the output index mapping is utilized. The mapping technique of PFA is complicated than that of CFA.

# III FAST FOURIER TRANSFORM PROCESSOR STRUCTURE

In the LTE system it exists two types of FFT modules: type1 is the  $2^n$  mode (n = 7-11) FFT and the other one is the 35 distinct-size Non-singular power point DFT. In that the range of transform sizes is from 12 to 1296. The  $2^n$ -point FFT and the NSPP DFT both structures are same. But only the few details are distinct. The figure.1 shows the structure of whole FFT processor.



Figure.1.Blockade illustration of FFT processor

The FFT processor structure consists of major parts : as follows the input and output indexes of vector generator, the address generator of a computation, and there are consisting of two distinct bank groups of memory. The HRSB schemes is included with a PE unit, and in between the memory and the PE unit there exists few more commutator, some cache twiddle factors, and parameters which are under the FFT sizes, and the scaling exponent unit. In the following figure.1 the control signals represent with the dashed lines while the flowing data represent the real lines. The index of vector generator input serves the input information without any data conflicts with distinct memory banks, and the natural sequence of the output data is recorded by the output index of vector generator. The concurrent information of every cycles are collected by computation address generator and intermediate results will be restored. The 1st and 2nd Memory groups in the mode of ping-pong in that the input sampling will hold the two continuous information symbols, and computed output information. In these memory groups are commonly existing in Non-singular power point FFT. In between the memories and HRSB unit there will be commutator that grants accurate data routing mechanism, the computation address generator will controls these routing mechanism. The parameters of a twiddle factors and the FFT size are comes under cache register files, where we configure working modes of FFT. The operations of floating-point will control the exponent scaling unit, which is responsible for degrading storage of memory and improving the signal-to-quantization noise ratio.

#### ISSN: 2393-9028 (PRINT) | ISSN: 2348-2281 (ONLINE)

#### A. INDEX VECTOR GENERATOR

To distribute the input information to particular memory locations, an index vector generator is utilized. The output information of common sequence is utilized for reordering the information. The index counter counts the number of inputs and all the inputs are transferred to memory-bank. For this FFT processor the index counter ranges from 0 to 1295. For this FFT sizes we use two memory-banks, and the address and memory-bank are generated through trivial factors. The elements of PFA map for distinct FFTs lengths are prestored in register documents of configurable information which consists of bank numbers.

According to the generation of address calculation the P concurrent information index of the PE are acquired in the computation mode. The memory addresses of each data and bank indices are generated through address generator. If the information is read from distinct memory-banks, so information is transferred to right path of processing element via commutator as proven in figure.2. The bank indexing is controlled by means of switching sample of commutator. The information is given back to the location from where the data is read, when the information is computed inside the PE. Therefore, the commutator output has reverse switch manner of commutator input.



Figure.2. Commutator structure

#### B. HRSB ARCHITECTURE FOR RADIX-8/9/12/15/16/25 BUTTERFLIES

According to the deterioration calculation, which includes base-8/9/12/15/16/25 is executed by utilizing the HRSB schemes. The HRSB structure consists of 2-stage combined base-2/3/4/5 Butterfly core, Trivial twiddle factors between two butterflies, a switch box, with some distinct delay length elements & MUXs. To assemble a one-level calculation stages, the 2 small Butterfly units are linked in pipeline way.



Figure.3 Two stage MDC unit architecture



Figure.4 Base-25 BU of Butterfly operation.

The 2-stage base-25 butterfly unit of timing operation is proven in figure.4. The 5 concurrent information is given to the starting butterfly unit, in the first butterfly unit level. The information is eventually be given to the element of delay & then to the switch unit to modify the intervals of the information for the 2-BU. After re-ordering the calculated information will go to the memory-bank places. For base-9,16,25 butterfly unit, the approaches are same. When HRSB structure is utilized for base-8 butterfly unit, base-4 is utilized within the starting level, and in the next level the base-2 is utilized for processing speed. After completing the calculations of the starting level, the next level is calculated immediately. The base-8,9,12,15,16,25 DFT is completed within the 2,3,4,5,4 & 5 cycles by utilizing the HRSB schemes.

# C. UNIFIED WFTA BUTTERFLY CORE FOR BASE 2, 3, 4, 5 POINT BUTTERFLIES:

A new WFTA core implements a small base- 2/3/4/ 5 butterflies, in butterfly mechanism. The prime factor DFT of WFTA is given as

$$[X(0), \dots \dots X(N-1)]^{T} = O \times M \times I [x(0), \dots x(N-1)]^{T}$$
(2)

where I resembles the pre-accessions between the information sources, M resembles a diagonal matrix of collusive multiplications and O resembles the post-accessions of the results after multiplications. In [8], a reconfigurable WFTA core structure two, three, four or five point DFT based on memory is designed as shown in figure.5. In reconfigurable mode, to implement 12-point FFT processor the butterfly core should complete with two stages. Base-3 is the first stage and other stage is base-4.

When the number of adders and multipliers are more, so the area occupied in this design is more. A new design is proposed using DFFs and eliminating some adders. When these DFFs are used then the area is diminished. The proposed diagram for WFTA is explained below. ISSN: 2393-9028 (PRINT) | ISSN: 2348-2281 (ONLINE)



Figure.5 Base-2,3,4,5 point DFT using WFTA structure.

## D. PROPOSED COMBINED WFTA CALCULATION FOR THE BASE- 2/3/4/5 POINT BUTTERFLY

Consider a direct mapping of the 5-point DFT onto which all the different sizes are mapped. The main challenge is to reduce the number of multiplexers for 5-point DFT the number of operation is largest, so there is enough computational resources available. The signal flow graphs are mapped without using multiplexers to find out the common parts. By setting the inputs of the circuit to zero or by setting the coefficient of multiplexer to zero, the multiplexer can be avoided. So unnecessary connections are removed in this circuit. The multiplexer can be bypassed by setting the coefficient of multiplexer value to one. By using this technique, the multiplexers are controlled with control signals.



Figure.6 WFTA diagram for base-2/3/4/ or 5-point FFT

The reconfigurable architecture is shown in figure.6 and it has three single-gate multiplexers and one 3:1 multiplexer. The Multi-radix compositions in [6], are complex compared to combined WFTA core, and it also utilizes more number of MUXes. The required coefficients & control signals of the MUXes, the input and output relations, are all proven in Table I. In Table I the dashed lines indicates that it is an irrelevant constrain. The constant multipliers are constructed using adder and shifters to implement the middle coefficient multiplications.

#### ISSN: 2393-9028 (PRINT) | ISSN: 2348-2281 (ONLINE)

| Size           | 1    | Input Mapping |      |      |      | Output Mapping |      |      | Multiplication Coefficients |       |      |      | control          |      |          |          |
|----------------|------|---------------|------|------|------|----------------|------|------|-----------------------------|-------|------|------|------------------|------|----------|----------|
|                | y0   | y1            | y2   | y3   | y4   | YO             | Y1   | Y2   | Y3                          | ¥4    | c0   | c1   | c2               | c3   | c4       | s0s1s2s3 |
| 2 <sup>b</sup> | 0    | x(0)          | x(1) | 0    | 0    | X(0)           | X(1) |      | $X^{'}(0)$                  | X (1) | C 30 | 0    | -C <sub>31</sub> | 0    | 0        | 1021     |
| 3              | x(0) | x(1)          | x(2) | 0    | 0    | X(0)           | X(1) | X(2) | •                           | -     | 0    | 0    | -1               | 0    | 0        | 00       |
| 4              | 0    | x(0)          | x(2) | x(1) | x(3) | X(0)           | X(2) | -    | X(1)                        | X(3)  | 0    | -1   | 0                | 1    | $C_{41}$ | 1110     |
| 5              | x(0) | x(2)          | x(3) | x(1) | x(4) | X(0)           | X(1) | X(4) | X(2)                        | X(3)  | C 50 | C 51 | C 52             | C 53 | C 54     | 0000     |

Table 1. Information about the I/O mappings in WFTA Core

# IV RESULTS AND ANALYSIS

The simulation results of the FFT processor design with 12-point FFT length is shown in figure.7, using reconfigurable WFTA structure. This method is synthesized in Xilinx ISE 14.7 targeting for Spartan3E xc3s1600e device 5fg484 package with a speed grade of -5.

A. RTL Schematic.



Figure.7. Synthesis Result of the proposed FFT Processor design

# **B.** Simulation Results



Figure.8. Simulation Results of the proposed FFT Processor design

#### C. Calculation of Computation Cycles.

When considering a, 972-point DFT it is deteriorated into  $4 \times 3 \times 9 \times 9$ , in [6]. The base-3 delay elements are utilized for computing the 9-point DFT, within 3 cycles, and base-4 and base-3 DFTs is computed within a single cycle. Consequently, the overall computing time is 243+324+ 324+324 = 1215, that is greater than the computing size. So there is more complex for information routing mechanism. To obtain only base-2/3/4/5 butterflies, the HRSB structure is configured to bypass the first BU. The required computation butterflies into one processing elements are combined by utilizing the HRSB structure. So we are combining the elements 3 & 4, to diminish the calculation time, so the achieved texture is 9 x 9 x 12. The base-9 and base-12 is implemented. The total computation cycle for 972-point FFT is 2x3x(972/9)+4x(972/12) = 972. Thus the computation cycles of different FFTs which applies the HRSB structure can be obtained by  $T = \sum_{i=1}^{m} T_i$ .  $N/N_i$ . Where n represents the computation stage,  $N_i$  represents the one's stage butterfly radix, and  $T_i$  represents the butterflies computation cycles.

#### D. Comparison Results

| Table II | Comparison | of different | power-point | FFTs |
|----------|------------|--------------|-------------|------|
|          |            |              | P           |      |

|                      | Algorith<br>m        | parallelism | Process<br>cycle                  | Mem<br>size | Mem<br>Bank | Com<br>Mul. | Com<br>Add. |
|----------------------|----------------------|-------------|-----------------------------------|-------------|-------------|-------------|-------------|
| Huang[11]            | Radix-4 <sup>2</sup> | 16          | N(log <sub>2</sub> N)/64<br>(Yes) | 2N          | 16          | 16*0.<br>76 | 112         |
| Xiao[12]             | Radix-2              | 2           | N(log <sub>2</sub> N)/2<br>(No)   | N           | 2           | 1           | 2           |
| Luo[13]              | Radix-2              | 2           | N(log <sub>2</sub> N)/2<br>(No)   | N           | 1           | 1           | 2           |
| Garrido[14]          | Radix-4              | 4           | N(log 2N)/8<br>(No)               | N           | 4           | 3           | 8           |
| Kia-Feng,<br>Xia[15] | Radix-4              | 4           | N(log <sub>2</sub> N)/2<br>(Yes)  | 2N          | 2           | 4           | 12          |
| This work            | Radix-4              | 4           | N(log <sub>2</sub> N)/2<br>(Yes)  | 2N          | 2           | 4           | 8           |

Table III Design Summary for existing and proposed Results

| Device utilization         |           |                              |                           |                                                 |             |  |
|----------------------------|-----------|------------------------------|---------------------------|-------------------------------------------------|-------------|--|
| Logic Utilization          | Available | 12-point<br>existing<br>Stru | FFT with<br>WFTA<br>cture | 12-point FFT with<br>proposed WFTA<br>Structure |             |  |
|                            |           | Used                         | Utilization               | Used                                            | Utilization |  |
| Number of slices           | 14752     | 74843                        | 507%                      | 77169                                           | 523%        |  |
| Number of slice Flip-flops | 29504     | 89800                        | 304%                      | 94448                                           | 320%        |  |
| Number of 4-input LUTS     | 29504     | 47621                        | 161%                      | 47626                                           | 161%        |  |
| Number of Bonded IOBs      | 376       | 404                          | 107%                      | 406                                             | 107%        |  |
| Number of<br>MULTI8X18SIOs | 36        | 28                           | 77%                       | 16                                              | 44%         |  |
| Number of GCLKs            | 24        | 1                            | 4%                        | 1                                               | 4%          |  |

Table III Comparison Results of the 12-point FFT with existing and proposed results.

| Timing Analysis                             | 12-point FFT with existing<br>WFTA Structure | 12-point FFT with<br>Proposed WFTA Structure |
|---------------------------------------------|----------------------------------------------|----------------------------------------------|
| Minimum Period                              | 22.562ns                                     | 5.865ns                                      |
| Minimum Input arrival<br>time before clock  | 4.995ns                                      | 5.039ns                                      |
| Maximum output required<br>time after clock | 4.040ns                                      | 4.040ns                                      |

To calculate the area, power and delay, the FFT processor is also implemented in cadence tool using the slow\_normal 1.0 technology library. The simulation results of the proposed system is compared as shown in table IV. Form the table, we say that the area is diminished up to 25%, power is reduced up to 0.9% and the delay is same when compared with the existing WFTA design.

| Table IV comparison results of | WFTA structure with |  |  |  |  |
|--------------------------------|---------------------|--|--|--|--|
| different parameters.          |                     |  |  |  |  |

| Report | 12-point FFT with existing<br>WFTA Structure | 12-point FFT with Proposed<br>WFTA Structure |
|--------|----------------------------------------------|----------------------------------------------|
| Area   | 16761485.97nm                                | 4249180.00nm                                 |
| Power  | 7876371.89nw                                 | 7831951.10nw                                 |
| Delay  | 1769.90ps                                    | 1769.90ps                                    |

# V CONCLUSION

The 12-point FFT processor design based on memory with WFTA structure is implemented. Additionally, a deterioration technique, called High-radix small butterfly-unit algorithm is utilized for the design of FFT processor. The WFTA design is intended to decrease complexity and to diminish the area. The WFTA design is executed in Xilinx ISE 14.7, the timing analysis and device utilization are compared. The WFTA structure of FFT processor delay has been reduced up to 5.865ns. This is also supported for distinct arbitrary lengths(0-2048) to accomplish more throughput.

#### ISSN: 2393-9028 (PRINT) | ISSN: 2348-2281 (ONLINE)

The area, power and delay of FFT processor are calculated by using the cadence tool. The technology library used is slow\_normal 1.0. The simulation results of WFTA structure in FFT processor shows that the area has diminished to 25%, the power has diminished to 0.9% and the delay has same when compared with the existing system. The WFTA structure for 7-point FFT and for supporting lengths from 0 to 4096 in LTE systems can be extended in future.

#### REFERENCES

- C.-H. Yang, T.-H. Yu, and D. Markovic, "Power and area minimization of reconfigurable FFT processors: A 3GPP-LTE example," IEEE J. Solid-State Circuits, vol. 47, no. 3, pp. 757–768, Mar. 2012.
- [2] C.-L. Yu, K. Irick, C. Chakrabarti, and V. Narayanan, "Multidimensional DFT IP generator for FPGA platforms," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 58, no. 4, pp. 755–764, Apr. 2011.
- [3] B. G. Jo and M. H. Sunwoo, "New continuous-flow mixed radix (CFMR) FFT processor using novel in-place strategy," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 5, pp. 911–919, May 2005.
- [4] S.-Y. Lee, C.-C. Chen, C.-C. Lee, and C.-J. Cheng, "A low-power VLSI architecture for a shared-memory FFT processor with a mixed-radix algorithm and a simple memory control scheme," in Proc. IEEE Int. Symp. Circuits Systems., pp. 157–160, May 2006.
- [5] P.-Y. Tsai, T.-H. Lee, and T.-D. Chiueh, "Power-efficient continuous flow memory-based FFT processor for WiMax OFDM mode," in Proc. Int. Symp. Intell. Signal Process. Communication., pp. 622–625, Dec. 2006.
- [6] C.-F. Hsiao, Y. Chen, and C.-Y. Lee, "A generalized mixed-radix algorithm for memory-based FFT processors," IEEE Trans. Circuits Syst. II, Express Briefs, vol. 57, no. 1, pp. 26–30, Jan. 2010.
- [7] J. Chen, J. Hu, S. Lee, and G. E. Sobelman, "Hardware efficient mixed radix-25/16/9 FFT for LTE systems," IEEE Trans. Very Large Scale Integration. (VLSI) Syst., vol. 23, no. 2, pp. 221–229, Feb. 2015.
- [8] F. Qureshi, M. Garrido, and O. Gustafsson, "Unified Architecture for 2,3,4,5 and 7-point DFTs based on Winograd Fourier Transform Algorithm," Electron Lett., vol. 49, no. 5, pp. 348-349, May 2013.
- [9] C. Burrus, "Index mapping for multidimensional formulation of the DFT and convolution," IEEE Trans. Acoust, Speech, Signal Process., vol. 25, no. 3, pp.239-242, Jun. 1977.
- [10] D. Kolba and T. Parks, "A prime factor FFT algorithm using high-speed convolution," IEEE Trans. Acoust, Speech, Signal Process., vol. 25, no. 4, pp.281-294, Apr.1977.
- [11] S.-J Huang and S. G Chen, "A high-throughput radix-16 FFT Processor with parallel and normal input/output ordering for IEEE 802.15.3c systems," IEEE Trans. Circuits syst. I, Reg. Papers, vol. 50, no. 8, pp. 1752-1765, Aug. 2012.
- [12] X. Xiao, E. Oruklu, and J. Saniie, "An efficient FFT engine with reduced addressing logic," IEEE Trans. Circuits System. II, vol. 55, no. 11, pp 1149-1153, Nov. 2008.
- [13] H,F. Luo, Y.J. Liu and M.D. Shieh, "Efficient memory-addressing algorithms for FFT Processor design," IEEE Trans. Very Large Scale Integration. (VLSI) Syst., vol. 23, no. 10, pp. 2162–2172, oct. 2015.
- [14] M. Garrido, M.A. Sanchez, M. L. Lopez-vallejo, and J. Grajal, "A 4096-point radix-4 memory-based FFT using DSP slices," IEEE Trans. Very Large Scale Integration. (VLSI) Syst., vol. 25, no. 1, pp. 375-379, Jan. 2015.
- [15] kai-Feng Xia, Bin Wu, Tao Xiong, and Tian-Chun ye, "A memorybased FFT processor design with generalised efficient conflict free address schemes," IEEE Trans. VLSI Systems., vol.25, no. 6, 1919-1929, June 2017.