The popularity of the C programming language and its derivatives in signal processing applications is increasing. Two books specifically about signal processing in C have recently become available [Reid, David]. A broad collection of numerical algorithms with implementations in C, Fortran and Pascal has been available for some years [Press]. Efforts are underway to develop an ANSI standard for Digital Signal Processing extensions to C [Leary]. A discussion of the advantages and disadvantages of C for numerical computing may be found in [MacDonald].

When a new reduced instruction set computer (RISC) or digital signal processor (DSP) is introduced a C compiler is one of the first high level development tools available. The wide availability of compilers and the ease with which portable applications may be written has prompted the use of C as a target language for other programming languages, such as C++, Pascal and Fortran and for higher level tools, such as the Ptolemy system[Buck].

C is expressive enough to allow the careful programmer to obtain the numerical performance potential of a processor so important for many signal processing applications. This paper presents recommendations for efficiently implementing signal processing algorithms in C without sacrificing clarity or portability. These recommendations are derived from the author's experience implementing signal processing functions for musical applications [Freed 1992] with the MIPS R4000 [Kane] C compiler on a Silicon Graphics Indigo workstation. There is therefor a bias in the example code towards audio signal processing, but the recommendations apply to numerical computing in general.

Most of the recommendations can be interpreted for traditional C [Kernighan], ANSI C [ANSI] and C++ [Ellis]. Although the ANSI standardization of C did not greatly change the language, some of these changes directly concern implementation of numerical algorithms. These and issues relating to C++ will be discussed throughout.

Recommendations will look like this

If you are in a hurry just read recommendations in italics

The choice of data types for variables in signal processing applications relates directly to algorithm performance. The following sections discuss the basic data types available in C.

Integers in C come in four potentially different sizes: `char`,
`short`, `int` and `long`, and two flavors:
`signed` and `unsigned`. Long integers are at least 32-bit wide
in most modern machines and are a good choice for most integer calculations in
signal processing. Short integers are often 16-bit wide and are commonly used to
represent sound samples for storage on disks and for D/A and A/D conversions.
Although `int` is usually 32 bits, it has been customary for many
compilers to provide 16-bit wide `int`'s for the Motorola 68000 Family
of processors. Many low-cost digital signal processors provide 16 or 24-bit
operations and addressing.

Floating point numbers in C can be single (`float`), double (`double
`) or extended precision (`long double`). Most processors
that provide floating point operations support a 32-bit single precision format
with a 24-bit signed magnitude mantissa. Most use the IEEE 754 standard for such
numbers, although popular low-cost DSP chips, the TMS320c3x series for example,
do not. Most RISC machines and floating point coprocessors provide direct
support for double precision floating point numbers with a 53-bit signed
magnitude mantissa. `Long double`**'**s are the same as
`double`'s in the current generation of processors.

The following table is for the MIPS C compiler:

Name Type Size Range Character char 8 0 to 255 Integer int (long) 32 -27147483648 to 27147483647 short 16 -32768 to 32767 unsigned 32 0 to 4294967295 Floating Point float 32 +/-10+/-38 double (long double) 64 +/-10+/-380

Example: Musical Signals

Audio signals for musical applications will serve as an example of how to choose appropriate data types.

Most RISC processors have faster floating point multiply operations than integer ones. Integer additions on the other hand are usually as fast as or faster than floating point additions, because of the normalization step required in floating point arithmetic. Though the choice of data type for signals appears at first glance to depend on the instruction timing of a particular processor and the particular mix of multiplications and additions of a given algorithm, many other factors support the choice of floating point:

Floating point operations allow much greater dynamic range*

Applications driving future processor designs require good floating point performance.*

Use **float** for calculations on sound samples

*Many processors implement integer arithmetic using the floating point ALU, e.g. PA-RISC [Hansen]

`typedef` can be used to parameterize types to allow for future changes,
e.g.

typedef long Stype; /* sample */ typedef float Ftype; /* frequency */ typedef double Ttype; /* time */

In practice this parameterization is not as useful as might be hoped. Consider the filter:

#ifdef FLOAT typedef float Stype; #else typedef long Stype; /* sample */ #endif Stype simple_filter(Stype input, float k, float j) { static Stype output= 0; return output = output*k +input*j; }

This code will be slower when Stype is `long` than when it is `float
`, because variables of integer type in the same expression as variables of
floating point types are promoted to floating point. Not only will these
conversions slow the filter down, but the critical arithmetic operations will be
done in floating point anyway.

Here is the filter using integers:

#define SCALE 65536 long integer_filter(long input, long k, long j) { static long output = 0; return output = (output*k+input*j)/SCALE; }

Allowing for 16-bits of dynamic range in the input and output and to avoid integer overflow, the above function only allows for 15-bits of precision in the specification of the filter coefficients.

The above example illustrates a major difficulty for signal processing with C
integers : there is no fixed point arithmetic interpretation of those integers
such as that of fixed point signal processor chips like the DSP56001 [Lee].
Although multiplication of two 32-bit integers on RISC processors such as the
R3000 yields the expected 64 bit resolution result, the multiplication operator
in C yields an integer the same size as its operands: 32 bits. The top 32 bits
are unavailable to the C programmer and it is therefor impossible to maintain
higher resolution for subsequent additions within an expression. Although there
are important applications where 16 bit signals and filter coefficients suffice,
most audio applications demand at least 24-bit signal resolution and better than
48-bit resolution for additions. This difficulty is addressed in DSP/C by
adding a `fract` data type to C [Leary].

How many bits are needed to represent time to within a sample at 48kHz sample rate over the closed interval of one day? All the integers between 0 and 24*60*60*48000 = 4147200000 have to be represented. 32-bit floating point numbers and 24-bit or smaller integers can be eliminated as candidates since 2^23 = 8388352. These types cannot represent time to the required precision over the interval of one hour. This leaves the 32-bit integer or double precision floating point types. The floating point representation has the advantage that time could be easily represented in seconds and be independent of sample rate. The addition operation on 32-bit integers is faster than double precision addition on many processors and may be implemented concurrently with other instructions.

Use `long` for counting samples

Use `double
` or `long` for measuring elapsed time

Delay line lengths for waveguide filters, for example, [Laakso et al] need to be specified to a better precision than the sample interval. A 100th of a sample interval precision over the range 0 to a maximum delay of a few minutes covers the broad range of needs from, for example, accurately tuning high frequencies for a simulated flute to the long delays typical of large reverberant spaces.

Use `float` or `long` integers for fractional sample delays

For most applications the range 0 to 48Khz suffices. However it can be
useful to represent negative frequencies and it is now commonplace to sample at
a rate higher than required by the Nyquist criterion and the range of human
hearing. Let's allow for -1Mhz to 1Mhz. Although human pitch perception has
limited precision, it is a common mistake to use this to decide on the required
frequency precision. In additive synthesis, for example, sounds are synthesized
for more than their pitch percept. Their timbre and loudness is also important
and is a function of the phase and amplitude of a **set** of sinusoidal
components. Two sinusoidal components close in frequency are perceived as one
component with an amplitude modulation whose rate is a function of the
difference in frequency. Differences between these beating sounds are
perceptible when the frequency difference is less than 1 Hz.

A usable range of values is easily covered by a single precision floating point number or 32-bit integer. The 24-bit fractional format of the DSP56000 family suffices, but 16-bits is not enough.

Use `float` or an integer type with more than 16-bits for
frequency parameters

An important use of the frequency parameter is in the representation of a phase increment for synthesis of periodic signals using linear oscillators. If the signal to be synthesized is read from a table that has a length which is a power of 2, a very efficient oscillator implementation is possible by accumulating integer phase increments [Hartmann].

Use an unsigned integer type with more than 16-bits for phase increments and accumulators.

16-bit PCM (the standard used in Compact Discs) provides enough dynamic range for most listening contexts. However, when recording it is common to encounter situations where more dynamic range is needed. It is also useful to have more resolution to avoid the effects of round off errors during operations on signals. Although most high-quality audio is stored on tape and disk as 16-bit integers, many vendors now offer the option of 24-bits. Most already do at least 24-bit processing and many offer 20-bit D/A conversion [Sony Hi-bit reference].

Use signed and unsigned char for low-quality sound samples

Use short (16-bit) integers for sound samples on tape and disk

Use** float** or better than 16-bit integers for calculations on sounds

Consider the following vector function from UDI [Depalle]:

sun_vaddosb(vect_in1,incr_in1,vect_in2,incr_in2, vect_out,incr_out,vect_length) float *vect_in1, *vect_in2, *vect_out; unsigned long vect_length, incr_in1, incr_in2, incr_out; { register float tmp, tmp_1, tmp_2; if((incr_in1 == 1) && (incr_in2 == 1) && (incr_out == 1)) { while(vect_length--) { tmp_1 = *vect_in1++; tmp_2 = *vect_in2++; tmp = tmp_1 - tmp_2; *vect_out++ = (tmp_1 + tmp_2) / tmp; } } else { while(vect_length--) { tmp_1 = *vect_in1; tmp_2 = *vect_in2; tmp = tmp_1 - tmp_2; *vect_out = (tmp_1 + tmp_2) / tmp; vect_in1 += incr_in1; vect_in2 += incr_in2; vect_out += incr_out; } } }

This is typical of coding styles which have evolved since the earliest applications of C. Programmers have taken advantage of the ease with which C can be used to express low-level operations to optimize the performance of the code using the compiler for the machine at hand.

The following version might have been the starting point for these optimizations:

void vaddosb(vect_in1,incr_in1,vect_in2,incr_in2, vect_out,incr_out,vect_length) float *vect_in1, *vect_in2, *vect_out; unsigned long vect_length,incr_in1,incr_in2,incr_out; { unsigned int i; for(i=0;i<vect_length;++i) vect_out[i*incr_out] = (vect_in1[i*incr_in1] + vect_in2[i*incr_in2]) / (vect_in1[i*incr_in1] - vect_in2[i*incr_in2]); }

No register variables, pointers or temporary variables are used and no special case is made for vectors of contiguous elements. Will this smaller, clearer function be much slower because of the 5 multiplies introduced per iteration? Will it be slower without the special case which allows compilers to optimize the post increment? With the MIPS R3000 processor on the SGI Indigo, the above code is faster than the original hand optimized code! The following measurements are the sum of the user time of ten runs of UNIX time(1) command executing a program which called the aforementioned functions one million times with 32 element vectors:

vector stride 1

sun_vaddosb 225.9

vaddosb 225

vector stride 2

sun_vaddosb 225.4

vaddosb 225

The MIPS C compiler did all the optimizations the original programmer did by hand and was able to generate more efficient loop testing code, which accounts for the small advantage over the original function. A conclusion from these measurements is that it is better to write simple, clear programs and examine their performance in a particular application than to write in a machine specific optimized style. This section is devoted to recommendations which are likely to yield faster code with most machines and compilers without leading to hard to understand code. The last section will discuss machine and compiler specific optimizations. But first here are some general recommendations on changing code to make it more efficient.

It can be difficult to predict performance by inspecting C code. Now that many processors have optimizing assemblers it is not easy to predict performance by looking at the output of the C compiler either. The only way to tell that changes to code are improving performance is to measure performance. Most modern compilers come with tools for execution time profiling. Even when these are not available, the UNIX time(1) command can be used to good effect.

Measure real performance to changes in code

It is very easy to break working code while changing it to make it more efficient. Most programmers have experienced the punishment for not following these recommendations from [Kernighan] :

Make it right before you make it faster

Keep it right when you make it faster

Make it
clear before you make it faster

Arguments to functions defined with "old style" declarations are
promoted to larger types. `Char`'s and `short`'s are promoted to
`int`'s. `Float`'s are promoted to `double`'s. So
consider this version of the simple filter:

typedef float Stype Stype simple_filter(input, k,j) Stype input; /* will be promoted to double */ float k; /* will be promoted to double */ float j; /* will be promoted to double */ { static Stype output= 0; return output = output*k +input*j; }

On machines where the `float` and `double` types are
different sizes this version will be considerably slower than one with the new
declaration style introduced with ANSI C and C++. Not only will all the
arithmetic be performed in double precision, but two conversions between
`float` and `double` will be required to use and assign to the
output variable. Most RISC processors have efficient double precision ALU
operations. However the additional costs of conversion instructions cannot be
avoided. Bus bandwidth and storage requirements of `double`'s is usually
twice that of `float'`s. The size penalty may have an unpredictable and
negative impact on performance by wasting precious cache memory or floating
point registers.

Use new style declarations

Use conversion utilities to
change old-style declarations in traditional C code

Note that the constant 3.141 is of type `double` and its use in an
expression will cause other types to be promoted to `double`.

Use the ANSI C **f **or **F** suffix for single precision
floating point constants,

e.g. 3.141f

or when the suffix is
unavailable use a caste:

e.g. (float) 3.141

In this example, most compilers will generate code to multiply `k` by t,
`PI`, and `2`.

#define PI 3.14159265358979 a = r*sin(2*PI*k*t);

So this constant folding will be more efficient:

#define TWOPI 6.2831853072 /* 2[[pi]] */ a = r*sin(TWOPI*k*t);

Fold constant expressions into single constants

Division is usually considerably more expensive than multiplication and
compilers often do not do strength reduction, i.e. `x/2.0` can be
written as `x*0.5`. Strength reduction of `2.0*x `to `x+x
`is often not done by RISC compilers because in some instruction schedules
the addition may be a slower choice.

Consider strength reduction when it is measurably faster

Standard C library functions such as `sin()` and `cos()`return
double precision results to a double precision argument. ANSI C standardized
single precision floating point versions of these routines.

When single precision suffices use faster library routines, e.g. fsin() and fcos() instead of sin() and cos()

Mathematical library routines are designed to achieve good performance without
compromising accuracy for a wide range of input values. This means that they
work over a much larger domain than an application may require. They are also
required to maintain the `errno` variable for exception handling.

Consider restrictive and faster implementations of C library functions

Remember thought that considerable effort may have been expended on efficient implementations of the C library functions. The hard work to do better ones for specific applications may be difficult to justify.

Certain arithmetic operations such as division by zero may require special additional exception processing. The C standard does not define this additional work. It is usually processor specific but may be well defined in the case of IEEE 754 floating point exceptions such as overflow and underflow. In any event many of these exceptions result in a potentially expensive processor context switch to handle the exception.

Avoid computations which may lead to expensive exception processing

Conversions from one data type to another may be expensive. Conversions from floating point numbers to signed and unsigned integers are commonly slow operations. On the R3000 it is cheaper to convert a floating point number to a signed integer and then that integer to an unsigned integer than directly from floating point to an unsigned integer. The results are not necessarily the same.

Avoid expensive conversions from one type to another

In traditional C, compilers are completely free to a change the order the evaluation of commutative and associative operators. ANSI C requires that compilers honor the order of expression evaluation specified by parentheses. Unless numerical precision requirements dictate otherwise (since floating point addition is not associative) it therefor is best to minimize use of parentheses and give compilers the most options for optimization. This unfortunately may compromize the readability of programs.

With ANSI C consider parentheses to trade numerical precision for optimization opportunity

Modern computer systems are designed to take advantage of locality of memory references, whether these systems be based on RISC, CISC, or DSP architecture. Register variables and data cache memory are used to take advantage of temporal locality - the tendency for most variable references to be to variables which were most recently referenced. Instruction caches and queues are designed to take advantage of temporal and spatial locality of instruction streams - the tendency for instructions streams to contain contiguous instructions or repeats of short sequences from loops. Cache lines and page mode memory chips are used to take advantage of spatial locality of data and instructions.

It is important therefor to consider algorithms and implementations which enhance the temporal and spatial locality of instruction and data references. This will become even more significant as systems evolve to include more than one processor, because of the high cost of references to variables shared between processors. Although the recommendations to enhance locality are machine independent, the payoff in their application will vary considerably among processors and memory subsystem designs.

The cost of access to a variable may vary considerably according to its address in memory.

Use a memory allocator which aligns data to memory page, and cache line boundaries - the vendor supplied library function for memory allocation may do this automatically

Register files and cache memory are much smaller in size than main memory, so it
is important to be economical in the use of space. For numerical applications
this means not unnecessarily using large data types such as
`double` when `float` will do.

`double` variables usually use twice the register and cache space of
float variables, use them only when the precision they offer is required

In C it is usually easy to control spatial locality of variables, since arrays
are implemented in contiguous memory locations and successive elements in
structures are required to be implemented contiguously. For example on the SGI
indigo, a reference to the real element of a complex number consisting of two
`float` structure elements results in the complex component being
brought into cache memory concurrently, speeding up a subsequent access to the
complex element of the structure.

Place variables that are commonly used together into contiguous memory
locations,

e.g. typedef struct { float r, c; } complex ;

Variables declared with local scope, are not necessarily implemented on the stack contiguously, so it may pay to create a structure to group together such variables, unless of course the compiler is expected to place the variables in registers anyway.

Modern C compilers offer a range of optimization and other controls over the code generation process. They are usually described concisely, perhaps in the UNIX manual format.

Consult compiler manual for control over code generation

The MIPS C compiler for SGI workstations, like many ANSI C compliant compilers,
has an option to revert to traditional C definition (-cckr). This option is
useful for compiling old code, but it should be used with the -float option,
which overrides the old type promotion rules and allows the compiler to generate
faster single precision floating point operations for expression with no types
larger than `float`.

use cc -cckr -float for old code

use cc -O2 -c or cc -O3 for ANSI code

The -g[0-3] argument which is used to specify generation of information for symbolic debugging suppresses optimization.

Use compiler optimization after code has been debugged

Profiling code is another potential source of inefficiency.

Turn off profiling options if it effects performance

In addition to documentation on the compiler itself, a reference manual for
the language it implements is usually provided. It is now standard practice to
include in such a manual a description of implementation-defined behavior. One
of these implementation choices is the interpretation of the `register`
storage-class specifier. The MIPS Compiler uses up to 6 register storage-class
specifiers when not optimizing and ignores them when optimizing. It always
ignores register storage-class specifiers in new style function definitions and
declarations.

The MIPS compiler like most modern C compilers make very good choices
automatically of which variables to put into registers. On the rare occasions
when it can be shown, by examining the output of the compiler and measuring
performance, that its use of registers can be improved on, a better way is
needed to guide the compiler. Unfortunately it is better to compile most modules
with optimization so `register` storage-class specifications will be
ignored. We can however use the `volatile` type qualifier, introduced
with ANSI C. Variables qualified this way will not usually be placed in
registers.

Use volatile to defeat a compilers unfruitful register assignments

The MIPS processors have an overflow exception on signed integer additions and subtractions but not on unsigned integer operations. When using integers for phase accumulators in digital oscillators it is common practice to allow the accumulator to wrap by simply overflowing.

Use unsigned integers for phase accumulators to avoid exceptions

Recursive digital filters (IIR) with impulsive inputs may cause floating point underflows. MIPS floating point has an unmaskable and expensive exception on underflow. This data dependent instruction execution time is disastrous for hard real-time applications such as sound synthesis. One solution to this is to clamp the state variables of the recursive filter to 0 as they approach 0. Recent designs such as the Motorola 88110 allow for exceptions to be completely disabled.

The following functions all perform the addition of two non zero length
vectors:

void add(int count, float *a, float *b, float *result) /* fortran */ { int i; for(i=0;i<count;++i) result[i] = a[i] +b[i]; } void add(int count, float *a, float *b, float *result) /* UDI */ { while(count--) *result++ = *a++ + *b++; } void add(int count, float *a, float *b, float *result) /* Csound */ { do { *result++ = *a++ + *b++; }while(--count); }

void add(int count, float *a, float *b, float *result) /* Resound */ { float *end = result+count; while(result<end) { *result++ = *a++ + *b++; } } void add(int count, float *a, float *b, float *result) /* Ptolemy */ { int i; for(i=count-1;i >=0;--i) result[i] = a[i] +b[i]; }

For the interesting values of `count` greater than 0, these
functions all perform the same operation. For values less than or equal to 0
they do not all perform the same way. It is unreasonable to expect a compiler to
generate the same code for each function. Which function will be the fastest?
For many compilers for RISC machines the first example will be fastest. Compiler
developers cannot optimize every way to express loops. Perhaps because of a
tradition of optimizing Fortran `do` loops, the MIPS compiler generates
good code for loops with array indices which are functions of loop iteration
count variables. It unrolls loops thereby increasing the chance for scheduling
optimizations, and minimizing loop control between blocks. It can also generate
fast test instructions for iteration variables with an upper and lower bound.

Use **for** loops and an iteration variable instead of
**while**

Reference arrays with iteration variables
instead of pointers

Compilers do not generate code to place the variable `count` in the
following function into a register:

int count; loopindex() { int i; for(i=0; i < count; ++i) x(i*i); }

Since the function `x(int i)` may modify `count` the
compiler cannot generate code to put a copy of `count` in a register at
the start of the loop. If it is known that there are no such side effects it is
possible to to help the compiler to come to the same conclusion by assigning all
the variables in the loop to registers:

int count; loopindex() { int i; int n = count; for(i=0; i < n; ++i) x(i*i); }

Increase the probability of fruitful register assignments by avoiding side effects and narrowing scope of variables so side effects may be checked for by the compiler

The recommendations in this paper have been used by the author to create an efficient vector and function library for real-time audio signal processing on MIPS processors. Preliminary experiences porting this work to other processors is very encouraging.

Silicon Graphics

PacBell

*ANSI C language specification,* American National Standard
X3.159-1989

Buck, J., Ha, S., Lee, E. A., Messserschmitt, D. G., *Ptolemy: A Platform for
Heterogeneous Simulation and Prototyping*, 1991, Proc. of the 1991 European
Simulation Conference, Copenhagen, Denmark, June 17-19

David, Ruth A., Stearns, Samuel D., *Signal Processing Algorithms Using
Fortran and C*, 1992, Prentice-Hall

Depalle, Philippe Rodet, Xavier,* UDI: A Unified DSP Interface for Sound
Signal Analysis and Synthesis*, Proceedings of ICMC, 1990, Computer Music
Association

Ellis, Margaret A., Stroustrup, Bjarne, *The Annotated C++ Reference Mnaual
*, 1990, Addison-Wesley

Freed, A., *MacMix: Mixing Music with a Mouse*, Proceedings of the 12th
International Computer Music Conference, The Hague, 1986, Computer Music
Association

Freed A., *New Tools for Rapid Prototyping of Music Sound Synthesis
Algorithms and Control Strategies*, Proc. International Computer Music
Conference, San Jose, Oct. 1992.

Freed, A., *Recording, Mixing, and Signal Processing on a Personal Computer
*, Proceedings of the AES 5th International Conference on Music and Digital
Technology, 1987

Hansen, Robert C., 1992, New Optimizations for PA-RISC Compilers, Hewlett-Packard Journal, June 1992.

Hartmann, W. M., *Digital Waveform Generation by Fractional Addressing*,
J. Acoust. Soc. Amer. 82, 1987, p. 1183-1891

Kane, Gerry, Heinrich, Joe, *MIPS RISC Architecture*, 1992, Prentice Hall

Kastens, U., *Compilation for Instruction Parallel Processors*, 1990,
Proceedings 3rd Compiler Compilers Conference 1990, Springer-Verlag

Kernighan, B. W., Ritchie, D. M., *The C Programming Language*, 1978,
1988, Prentice-Hall

Kernighan, B. W., Plauger, P. J., *The Elements of Programming Style*,
1978, McGraw-Hill

Laakso, T. I., Välimäki, V., Karjalainen, M., Laine, U. K.,
*Real-Time Implementation Techniques for a Continuously Variable Digital
Delay in Modeling Musical Instruments*, Proceedings of ICMC 1992, CMA, San
Francisco.

Leary, Kevin W., Waddington, William, *DSP/C: A Standard High Level Language
for DSP and Numeric Processing*, Proceedings of the ICASSP, 1990, p.
1065-1068

Lee, Edward A., *Programmable DSP Architectures*, October 1988, IEEE ASSP
Magazine

Moore, F. Richard, *Elements of Computer Music*, 1990, Prentice-Hall

Press, William H. et al, *Numerical Recipes in C - 2nd Edition*, 1992,
Cambridge University Press

Reid, Chistopher E., Passin, Thomas B., *Signal Processing in C*, 1992,
John Wiley & Sons

Macdonald, T., 1991, *C for Numeric Computing,* The journal of
Supercomputing, 5, 31-48, Kluwer Academic Pulishers, Boston.