Using parallel FFT for multi-gigahertz FPGA signal processing
High-speed fast Fourier transform (FFT) cores are an essential requirement for any real-time spectral-monitoring system. As the demand for monitoring bandwidth grows in pace with the proliferation of wireless devices in different parts of the spectrum, these systems must convert time-domain to frequency-domain ever more rapidly, necessitating faster FFT operations. Indeed, in most modern monitoring systems, it is often necessary to use parallel FFTs to run at sample throughputs of multiple times the pace of the highest clock rate achievable in state-of-the-art FPGAs, such as the Xilinx Virtex-7, taking advantage of wideband A/D converters that can easily attain sample rates of 12.5 Gigasamples/second and more. 
At the same time, as communications protocols become increasingly packetized, the duty cycles of signals that need to be monitored are decreasing. This phenomenon requires a dramatic decrease in scan repeat time, which necessitates low-latency FFT cores. Parallel FFTs can help in this regard as well, since the latency scales down almost proportionally to the ratio of sample rate to clock speed.
For all of these reasons, let’s delve into the design of a parallel FFT (PFFT) with runtime-configurable transform length, taking note of the throughput and utilization numbers that are achievable when using parallel FFT.
Hardware parallelism for FFTs
Due to the complexity of implementing FFTs directly in logic, many hardware designers use off-the-shelf FFT cores from various vendors.  However, most off-the-shelf FFT cores use “streaming” or “block” architectures that process only one or fewer samples per clock, which limits the throughput to the maximum clock speed achievable by the FPGA or ASIC device. A PFFT offers a faster alternative. A PFFT can accept multiple samples per clock and process them in parallel, to deliver multiple output samples per clock. This architecture multiplies the throughput beyond the achievable device clock speed, but comes at an additional cost in area and complexity. Thus, to use a PFFT you will have to make trade-offs in throughput vs. area. The trade-offs for a typical Virtex-7 FPGA design are outlined in Figure 1 and Table 1.
synchronization and dynamic length programmability
Click on image to enlarge
Looking at the table, a few general features can be seen in the trade-off curve:
1. As parallel throughput increases, multiplier (area) utilization increases, with a slightly lower multiple (better than linear).
2. Slower system clocks and timing closure yield sublinear throughput growth as parallelism increases. However, on modern FPGAs this degradation is diminishing.
3. Overall better-than-linear throughput/area growth is realized due to No. 1 and No. 2 above.
4. Latency decreases as parallelism increases.
Note that the specific numbers measured in Table 1 are valid only for a given target and configuration of the FFT. In this case, that is a length of 1024, with 16-bit input, dynamic length programmability (4 through 1024) and flow control. Flow control is very important for applications such as spectral monitoring, where side-channel information is often utilized to change the FFT size (in order to change the resolution bandwidth) or to temporarily stall the FFT while other operations, such as acquisition, are going on. In theory, you can accomplish flow control by inserting buffers before the transform operation. But for acquisition-driven operations like spectral monitoring, it’s not easy to precompute the size of the buffer required, resulting in the need to maintain large, fast and expensive memory banks.
While there are a number of ways to implement FFTs, a parallelized version of the Radix2 Multi-Path Delay Commutator kernel (Radix2-MDC)  works very well as a modular method to create configurable parallel-FFT cores that scale well in advanced FPGA devices. The Radix2-MDC is a classical approach to building pipelined FFTs of varying lengths, as shown in Figure 2a for a 16-length FFT. It breaks the input sequence into two parallel data streams flowing forward with the correct “distance” between data elements that are entering the butterfly (a subelement of FFT algorithms) and that are scheduled by proper delays. The Radix2-MDC is relatively easy to parallelize using a wider data path and vector operations, as shown in Figure 2b. MDC structures also lend themselves easily to flow control and dynamic length reconfiguration, as opposed to single-path delay feedback (SDF) structures, where the incorporation of flow control (stall) signals typically reduces maximum throughput considerably.
Click on image to enlarge
Another choice that can affect scalability is the complex-multiplier implementation—that is, either 4multiply (4M) or 3multiply (3M) structures. Choosing a 3M complex multiply often leads to lower area usage in your design, but at the expense of slower clock speeds.  This trade-off is also very dependent on the DSP hardware of the FPGA device. Below are the most important parameters and the choices we made in the case study that we are about to present:
• Length = 1024
• Input precision = 16 bits
• Radix2-MDC architecture using 4Mult-5add complex multipliers
• Data path precision = 1-bit growth per stage (10 stages / bits for L=1024)
• Dynamic length programmability included
• Optional flow control and synchronization turned on
Case study: A 1024-length parallel FFT
For our case study, we used Synopsys’ Synphony Model Compiler (Synphony MC or SMC) high-level synthesis tool  to explore a set of parallel-FFT results for Xilinx Virtex-7 and Virtex-5 FPGAs.
Synphony MC has a parallel-FFT IP block that is designed specifically to achieve high-quality-of-results (QoR) DSP mapping for advanced FPGA devices like the Virtex-7. This custom IP block instantiates arithmetic primitives in the subsystem underneath to create an architecture based on user-specified options such as length, precision, flow control and dynamic programmability.
Using the vector support and a custom block methodology in Synphony MC, we were able to create a concise, parameterizable parallel Radix2-MDC block that enabled quick instantiation of parallel FFTs across different configurations. Figure 3 shows a subset of a 1024×16 PFFT data path using these custom blocks (called PR2MDC). Each block inputs and outputs a length-32 vector representing 16 complex samples of real and imaginary values, and a length and twiddle-factor address parameter. The fixed-point word length grows 1 bit in each stage, but is parameterized for easy adjustment. Underneath this block is the implementation of Figure 2b using Synphony MC arithmetic operators (multiply, add, shift registers, etc.).
Click on image to enlarge
Although SMC is a high-level design environment, you can use it to achieve specific mapping for a variety of target technologies. In this case, we chose a 4M complex multiply that maps into one DSP48E unit using 18×25 multiply mode. The block is designed to daisychain the twiddle-factor lookup for performance and flow control capability (note that Figure 3 does not show flow control). The SMC parallel-FFT core does not show any significant timing degradation between turning flow control on or off, or using dynamic length programmability; the throughput scalability it offers on advanced FPGA devices is a useful feature when building real-time spectral-monitoring applications.
We used Synphony MC to generate RTL optimized for Virtex-5 and Virtex-7 targets, and then used the 2012.09 release of Synphony Model Compiler and Synplify Pro with Xilinx’s ISE 14.2 tool suite for place-and-route. Table 2 shows the post-place-and-route area and timing results. The observed scalability matches very well to theoretical expectations and the FPGA family’s capabilities, achieving better than 7 Gsamples/s using an X16 parallel configuration with only approximately 6.3 times the resources of the 1X configuration. The better-than-linear area growth is due to multiply operations that become fixed-coefficient as parallelism increases, and thus get implemented more efficiently in logic than if you were using a full multiply in a DSP unit. This happens mostly in the last K-point FFT (see Figure 2b).
Parallel FFTs are required for high-throughput gigasample applications such as spectral monitoring. Our PFFT hardware implementation offers some insight and guidance on the throughput scalability for advanced Xilinx FPGA devices. A case study demonstrates that the parallel FFTs using the parallel Radix2-MDC architecture can effectively achieve throughput of more than 7 GHz on FPGA devices in an X16 parallel configuration. We developed the PFFT block in a high-level design flow using Synphony Model Compiler, with the ability to fine-tune the implementation for DSP hardware mapping across different Xilinx devices.  Synopsys is making this block freely available to all Synphony Model Compiler users.
For more information on Synphony Model Compiler, click here.
3. Shousheng He; Torkelson, M.; “A new approach to pipeline FFT processor,” Parallel Processing Symposium, 1996, Proceedings of IPPS ’96, The 10th International, April 1996
About the authors
Chris Eddington, Senior Technical Marketing Manager, Synopsys Inc.
Baijayanta Ray, Corporate Application Engineer for Synphony Model Compiler Synopsys Inc.
This case study piece was originally published in the first quarter edition 2013 of the Xcell Journal.