A Bit-Parallel Unrolled CORDIC

Center for New Music and Audio Technologies

Next: A Bit-Serial Iterative CORDIC Up: Implementation of various CORDIC Previous: A Bit-Parallel Iterative CORDIC

A Bit-Parallel Unrolled CORDIC

Instead of buffering the output of one iteration and using the same resources again, one could simply cascade the iterative CORDIC, which means rebuilding the basic CORDIC structure for each iteration. Consequently, the output of one stage is the input of the next one, as shown in Figure 1.5, and in the face of seperate stages two simplifications become possible. First, the shift operations for each step can be performed by wiring the connections between stages appropriately. Second, there is no need for changing constant values and those can therefore be hardwired as well.

**Figure 1.5:** *Unrolled CORDIC*
$\begin{figure} \centerline {\epsfig{figure=cordic.unrolled2.eps,width=120mm,clip=}}\end{figure}$

The purely unrolled design only consists of combinatorial components and computes one sine value per clock cycle. Input values find their path through the architecture on their own and do not need to be controlled.
Obviously the resources in an FPGA are not very suitable for this kind of architecture. As we talk about a bit-parallel unrolled design with 16 bit wordlength, each stage contains 48 in- and outputs plus a great number of cross-connections between single stages. Those cross-connections from the x-path through the shift components to the y-path and vice versa make the design difficult to route in an FPGA and cause additional delay times. From table 1.1 it can be seen how performance and resource usage change with the number of iterations if implemented in an XILINX FPGA XC4010E. Naturally, the area and therefore the maximum path delay increase as stages are added to the design where the path delay is an equivalent to the speed which the application could run at.

Table 1.1: Performance and CLB usage in an XC4010E

No. of Iterations	8	9	10	11	12	13
complexity [CLB]	184	208	232	256	280	304
max path delay[ns]	163.75	177.17	206.9	225.72	263.86	256.87

As described earlier, the area in FPGAs can be measured in CLBs, each of which consist of two lookup tables as well as storage cells with additional control components [12]. For the purely combinatorial design the CLB's function generators perform the add and shift operations and no storage cells are used. This means registers could be inserted easily without significantly increasing the area. Pipelining adds some latency, of course, but the application needs to output values at 48kHz and the latency for 14 iterations equals 312.5 $\mu$ s which is known to be inperceptible. However, inserting registers between stages would also reduce the maximum path delays and correspondingly a higher maximum speed can be achieved. Table 1.2 shows how the area versus speed trade off is affected by different pipelining methods.

Table 1.2: Performance and CLB usage for various methods of pipelining in an XC4010E

No. of Iterations between Registers	1	2	3	4	8	13
Complexity [CLB]	313	308	304	304	304	304
max. Frequency [MHz]	24.4	18.3	14.2	9.7	6.2	3.7

The values are taken from report files generated by the XILINX Foundation Series software when implementing the unrolled designs. It can be seen, that the number of CLBs stays almost the same while the maximum frequency increases as registers are inserted. The reason for that is the decreasing amount of combinatorial logic between sequentiell cells. Obviously, the gain of speed when inserting registers exceeds the cost of area and makes therefore the fully pipelined CORDIC a suitable solution for generating a sinewave in FPGAs. Especially if a sufficient number of CLBs is at one's disposal, as is the case in high density devices like XILINX's Virtex or ALTERA's FLEX families, this type of architecture becomes more and more attractive.

Next: A Bit-Serial Iterative CORDIC Up: Implementation of various CORDIC Previous: A Bit-Parallel Iterative CORDIC
Home

Norbert Lindlbauer
2000-01-19