# Cooley-Tukey FFT algorithm (C)

This program is under development.
is complete, remove the {{develop}} tag.

The Cooley-Tukey FFT algorithm is a popular fast Fourier transform algorithm for rapidly computing the discrete fourier transform of a sampled digital signal. It applies best to signal vectors whose lengths are highly composite, usually a power of 2.

Here we describe a C implementation of Cooley-Tukey. We begin with a straightforward implementation and then describe various performance improvements. Like most practical implementations, we will concentrate on the case where the vector length is a power of 2.

##  Complex arithmetic

Although the newest version of C, C99, has library support for complex number datatypes and arithmetic, it is not widely supported. We write simple implementations of a few basic operations we need here, with little explanation; for a more detailed discussion, see Category:Complex numbers.

<<complex number datatype>>=
typedef struct complex_t {
double re;
double im;
} complex;

<<complex number prototypes>>=
double  complex_magnitude(complex c);
complex complex_sub(complex left, complex right);
complex complex_mult(complex left, complex right);

<<complex number function definitions>>=
complex complex_from_polar(double r, double theta_radians) {
complex result;
return result;
}

double complex_magnitude(complex c) {
return sqrt(c.re*c.re + c.im*c.im);
}

complex complex_add(complex left, complex right) {
complex result;
result.re = left.re + right.re;
result.im = left.im + right.im;
return result;
}

complex complex_sub(complex left, complex right) {
complex result;
result.re = left.re - right.re;
result.im = left.im - right.im;
return result;
}

complex complex_mult(complex left, complex right) {
complex result;
result.re = left.re*right.re - left.im*right.im;
result.im = left.re*right.im + left.im*right.re;
return result;
}


The use of sin(), cos(), and sqrt() requires the header math.h:

<<complex number headers>>=
#include <math.h>


##  Naïve evaluation of definition

First we discuss the simplest possible DFT solution in order to address some of the issues in a simpler setting. Suppose that our input vector is x and our output vector is X. Then the discrete Fourier transform is defined by the formula:

For each $0 \le k < N$, $X_k = \sum_{n=0}^{N-1} x_n e^{-\frac{2\pi i}{N} nk}$

The quantity $e^{-\frac{2\pi i}{N} nk}$ is most easily described in terms of its polar complex coordinates: it is on the unit circle (distance r=1 from the origin) and forms an angle θ of $-\frac{2\pi}{N} nk$ radians with the positive real axis. This enables us to evaluate the definition as follows:

<<FFT prototypes>>=
complex* DFT_naive_1(complex* x, int N);
<<Naive DFT>>=
complex* DFT_naive_1(complex* x, int N) {
complex* X = (complex*) malloc(sizeof(struct complex_t) * N);
int k, n;
for(k=0; k<N; k++) {
X[k].re = X[k].im = 0.0;
for(n=0; n<N; n++) {
X[k] = complex_add(X[k], complex_mult(x[n], complex_from_polar(1, -2*PI*n*k/N)));
}
}
return X;
}


We require headers for memory management and need to define PI:

<<FFT headers>>=
#include <stdlib.h>

<<FFT constants>>=
#define PI  3.1415926535897932


This works well for very small inputs and does not place any requirements on N. At the expense of additional space, we can avoid repeatedly recomputing the factors $e^{-\frac{2\pi i}{N} nk}$. Instead, we compute only the N values $e^{-\frac{2\pi i}{N} k}$ for $0 \leq k < N$ and use the fact that eik = 1 to reduce to one of these:

$e^{-\frac{2\pi i}{N} nk} = e^{-\frac{2\pi i}{N} K}$ where $K \equiv nk\pmod N$ and $0 \leq K < N$.

In code:

<<FFT prototypes>>=
complex* DFT_naive_2(complex* x, int N);
<<Naive DFT>>=
complex* DFT_naive_2(complex* x, int N) {
complex* X = (complex*) malloc(sizeof(struct complex_t) * N);
complex* Nth_root = (complex*) malloc(sizeof(struct complex_t) * N);
int k, n;
for(k=0; k<N; k++) {
Nth_root[k] = complex_from_polar(1, -2*PI*k/N);
}
for(k=0; k<N; k++) {
X[k].re = X[k].im = 0.0;
for(n=0; n<N; n++) {
X[k] = complex_add(X[k], complex_mult(x[n], Nth_root[(n*k) % N]));
}
}
free(Nth_root);
return X;
}


This considerably speeds the algorithm in practice. Unfortunately, even with this change, directly evaluating this sum for each k requires O(N2) time overall, which is prohibitive for large N.

##  Radix-2 decimation in time

The radix-2 decimation in time algorithm uses a divide-and-conquer approach to improve efficiency. It begins by separating the input vector x into two smaller vectors d and e, with d containing the odd-numbered elements of the original vector and e containing the even-numbered elements. For example:

x = (1, 2, 3, 4, 5, 6)
d = (2, 4, 6)
e = (1, 3, 5)

(Keep in mind that the indexing starts at zero). We then recursively compute the FFTs of d and e, call them D and E. We use them to construct X according to the following formula, which can be shown directly from the definition:

$\begin{matrix} X_k & = & E_k + e^{-\frac{2\pi i}{N}k} D_k & \mbox{ } 0 \le k < N/2 \\ X_{k+N/2} & = & E_k - e^{-\frac{2\pi i}{N}k} D_k & \mbox{ } 0 \le k < N/2 \end{matrix}$

Thus, all we need to do to compute X is to compute D, E, and the N/2 complex roots of unity specified by $e^{-\frac{2\pi i}{N}k}$ for $0 \le k < N/2$ (called twiddle factors), then do some basic arithmetic. The additional work after computing D and E is linear, so the analysis is just like that of mergesort, requiring O(NlogN) worst-case time. Following is a straightforward implementation based on recursion using the above formulas:

<<FFT prototypes>>=
complex* FFT_simple(complex* x, int N /* must be a power of 2 */);
<<Simple FFT>>=
complex* FFT_simple(complex* x, int N /* must be a power of 2 */) {
complex* X = (complex*) malloc(sizeof(struct complex_t) * N);
complex * d, * e, * D, * E;
int k;

if (N == 1) {
X[0] = x[0];
return X;
}

e = (complex*) malloc(sizeof(struct complex_t) * N/2);
d = (complex*) malloc(sizeof(struct complex_t) * N/2);
for(k = 0; k < N/2; k++) {
e[k] = x[2*k];
d[k] = x[2*k + 1];
}

E = FFT_simple(e, N/2);
D = FFT_simple(d, N/2);

for(k = 0; k < N/2; k++) {
/* Multiply entries of D by the twiddle factors e^(-2*pi*i/N * k) */
D[k] = complex_mult(complex_from_polar(1, -2.0*PI*k/N), D[k]);
}

for(k = 0; k < N/2; k++) {
X[k + N/2] = complex_sub(E[k], D[k]);
}

free(D);
free(E);
return X;
}


Although this works, is simple, and has good asymptotic performance, in practice the large number of allocations above degrade speed and triple memory use, which is particularly harmful on large inputs. Additionally, many of the complex_from_polar() calls are redundant, since only Nth roots of unity are used anywhere in the recursion tree. Finally, this implementation has considerable function call overhead and would ideally be replaced by a simpler algorithm on small subvectors.

 TODO Discuss optimizations

##  Files

We create a few separate modules: one for our basic complex number support, one for the FFT implementations, and one for the test driver and main(), described in the next section. We include the complex number header in the FFT header for the complex type.

<<complex_simple.h>>=
#ifndef _COMPLEX_SIMPLE_H_
#define _COMPLEX_SIMPLE_H_

complex number datatype
complex number prototypes

#endif /* #ifndef _COMPLEX_SIMPLE_H_ */

<<complex_simple.c>>=
#include "complex_simple.h"

complex number function definitions

<<fft.h>>=
#ifndef _FFT_H_
#define _FFT_H_

#include "complex_simple.h"

FFT prototypes

#endif /* #ifndef _FFT_H_ */

<<fft.c>>=
#include "fft.h"

FFT constants

Naive DFT

Simple FFT