Clear, Efficient Musical Signal Processing in ANSI C

Clear, Efficient Audio Signal Processing in ANSI C

Adrian Freed

Introduction

The popularity of the C programming language and its derivatives in signal processing applications is increasing. Two books specifically about signal processing in C have recently become available [Reid, David]. A broad collection of numerical algorithms with implementations in C, Fortran and Pascal has been available for some years [Press]. Efforts are underway to develop an ANSI standard for Digital Signal Processing extensions to C [Leary]. A discussion of the advantages and disadvantages of C for numerical computing may be found in [MacDonald].

When a new reduced instruction set computer (RISC) or digital signal processor (DSP) is introduced a C compiler is one of the first high level development tools available. The wide availability of compilers and the ease with which portable applications may be written has prompted the use of C as a target language for other programming languages, such as C++, Pascal and Fortran and for higher level tools, such as the Ptolemy system[Buck].

C is expressive enough to allow the careful programmer to obtain the numerical performance potential of a processor so important for many signal processing applications. This paper presents recommendations for efficiently implementing signal processing algorithms in C without sacrificing clarity or portability. These recommendations are derived from the author's experience implementing signal processing functions for musical applications [Freed 1992] with the MIPS R4000 [Kane] C compiler on a Silicon Graphics Indigo workstation. There is therefor a bias in the example code towards audio signal processing, but the recommendations apply to numerical computing in general.

Most of the recommendations can be interpreted for traditional C [Kernighan], ANSI C [ANSI] and C++ [Ellis]. Although the ANSI standardization of C did not greatly change the language, some of these changes directly concern implementation of numerical algorithms. These and issues relating to C++ will be discussed throughout.

Recommendations will look like this

If you are in a hurry just read recommendations in italics

Data Types for Signals

Introduction

The choice of data types for variables in signal processing applications relates directly to algorithm performance. The following sections discuss the basic data types available in C.

Integers

Integers in C come in four potentially different sizes: char, short, int and long, and two flavors: signed and unsigned. Long integers are at least 32-bit wide in most modern machines and are a good choice for most integer calculations in signal processing. Short integers are often 16-bit wide and are commonly used to represent sound samples for storage on disks and for D/A and A/D conversions. Although int is usually 32 bits, it has been customary for many compilers to provide 16-bit wide int's for the Motorola 68000 Family of processors. Many low-cost digital signal processors provide 16 or 24-bit operations and addressing.

Floating point

Floating point numbers in C can be single (float), double (double) or extended precision (long double). Most processors that provide floating point operations support a 32-bit single precision format with a 24-bit signed magnitude mantissa. Most use the IEEE 754 standard for such numbers, although popular low-cost DSP chips, the TMS320c3x series for example, do not. Most RISC machines and floating point coprocessors provide direct support for double precision floating point numbers with a 53-bit signed magnitude mantissa. Long double's are the same as double's in the current generation of processors.

The following table is for the MIPS C compiler:

Name              Type                          Size    Range                              
Character         char                          8       0 to 255                           
Integer           int (long)                    32      -27147483648 to 27147483647        
                  short                         16      -32768 to 32767                    
                  unsigned                      32      0 to 4294967295                    
Floating Point    float                         32      +/-10+/-38                         
                  double (long double)          64      +/-10+/-380

Example: Musical Signals

Audio signals for musical applications will serve as an example of how to choose appropriate data types.

Samples: Integers or Floating Point?

Most RISC processors have faster floating point multiply operations than integer ones. Integer additions on the other hand are usually as fast as or faster than floating point additions, because of the normalization step required in floating point arithmetic. Though the choice of data type for signals appears at first glance to depend on the instruction timing of a particular processor and the particular mix of multiplications and additions of a given algorithm, many other factors support the choice of floating point:

Floating point operations allow much greater dynamic range*

Applications driving future processor designs require good floating point performance.*

Use float for calculations on sound samples

*Many processors implement integer arithmetic using the floating point ALU, e.g. PA-RISC [Hansen]

typedef can be used to parameterize types to allow for future changes, e.g.

typedef long Stype;	/* sample */
typedef float Ftype;	/* frequency */
typedef double Ttype;	/* time */

In practice this parameterization is not as useful as might be hoped. Consider the filter:

#ifdef FLOAT
typedef float Stype;
#else
typedef long Stype;	/* sample */
#endif
Stype simple_filter(Stype input, float k, float j)
{
	static Stype output= 0; 
	
	return output = output*k +input*j;
}

This code will be slower when Stype is long than when it is float, because variables of integer type in the same expression as variables of floating point types are promoted to floating point. Not only will these conversions slow the filter down, but the critical arithmetic operations will be done in floating point anyway.

Here is the filter using integers:

#define SCALE 65536
long integer_filter(long input, long k, long j)
{
	static long output = 0;

	return output = (output*k+input*j)/SCALE;
}

Allowing for 16-bits of dynamic range in the input and output and to avoid integer overflow, the above function only allows for 15-bits of precision in the specification of the filter coefficients.

The above example illustrates a major difficulty for signal processing with C integers : there is no fixed point arithmetic interpretation of those integers such as that of fixed point signal processor chips like the DSP56001 [Lee]. Although multiplication of two 32-bit integers on RISC processors such as the R3000 yields the expected 64 bit resolution result, the multiplication operator in C yields an integer the same size as its operands: 32 bits. The top 32 bits are unavailable to the C programmer and it is therefor impossible to maintain higher resolution for subsequent additions within an expression. Although there are important applications where 16 bit signals and filter coefficients suffice, most audio applications demand at least 24-bit signal resolution and better than 48-bit resolution for additions. This difficulty is addressed in DSP/C by adding a fract data type to C [Leary].

Time

How many bits are needed to represent time to within a sample at 48kHz sample rate over the closed interval of one day? All the integers between 0 and 24*60*60*48000 = 4147200000 have to be represented. 32-bit floating point numbers and 24-bit or smaller integers can be eliminated as candidates since 2^23 = 8388352. These types cannot represent time to the required precision over the interval of one hour. This leaves the 32-bit integer or double precision floating point types. The floating point representation has the advantage that time could be easily represented in seconds and be independent of sample rate. The addition operation on 32-bit integers is faster than double precision addition on many processors and may be implemented concurrently with other instructions.

Use long for counting samples
Use double or long for measuring elapsed time

Delay

Delay line lengths for waveguide filters, for example, [Laakso et al] need to be specified to a better precision than the sample interval. A 100th of a sample interval precision over the range 0 to a maximum delay of a few minutes covers the broad range of needs from, for example, accurately tuning high frequencies for a simulated flute to the long delays typical of large reverberant spaces.

Use float or long integers for fractional sample delays

Frequency

For most applications the range 0 to 48Khz suffices. However it can be useful to represent negative frequencies and it is now commonplace to sample at a rate higher than required by the Nyquist criterion and the range of human hearing. Let's allow for -1Mhz to 1Mhz. Although human pitch perception has limited precision, it is a common mistake to use this to decide on the required frequency precision. In additive synthesis, for example, sounds are synthesized for more than their pitch percept. Their timbre and loudness is also important and is a function of the phase and amplitude of a set of sinusoidal components. Two sinusoidal components close in frequency are perceived as one component with an amplitude modulation whose rate is a function of the difference in frequency. Differences between these beating sounds are perceptible when the frequency difference is less than 1 Hz.

A usable range of values is easily covered by a single precision floating point number or 32-bit integer. The 24-bit fractional format of the DSP56000 family suffices, but 16-bits is not enough.

Use float or an integer type with more than 16-bits for frequency parameters

Phase Increments

An important use of the frequency parameter is in the representation of a phase increment for synthesis of periodic signals using linear oscillators. If the signal to be synthesized is read from a table that has a length which is a power of 2, a very efficient oscillator implementation is possible by accumulating integer phase increments [Hartmann].

Use an unsigned integer type with more than 16-bits for phase increments and accumulators.

Amplitude

16-bit PCM (the standard used in Compact Discs) provides enough dynamic range for most listening contexts. However, when recording it is common to encounter situations where more dynamic range is needed. It is also useful to have more resolution to avoid the effects of round off errors during operations on signals. Although most high-quality audio is stored on tape and disk as 16-bit integers, many vendors now offer the option of 24-bits. Most already do at least 24-bit processing and many offer 20-bit D/A conversion [Sony Hi-bit reference].

Use signed and unsigned char for low-quality sound samples
Use short (16-bit) integers for sound samples on tape and disk
Use float or better than 16-bit integers for calculations on sounds

Machine and Compiler Independent Optimizations

Consider the following vector function from UDI [Depalle]:

sun_vaddosb(vect_in1,incr_in1,vect_in2,incr_in2,
		vect_out,incr_out,vect_length)
float *vect_in1, *vect_in2, *vect_out;
unsigned long vect_length, incr_in1, incr_in2, incr_out;

{

	register float tmp,
		 tmp_1,
		 tmp_2;

	if((incr_in1 == 1) && (incr_in2 == 1) && (incr_out == 1))
	{
		while(vect_length--)
		{
			tmp_1 = *vect_in1++;
			tmp_2 = *vect_in2++;
			tmp   = tmp_1 - tmp_2;
	 
        			*vect_out++ =  (tmp_1 + tmp_2) / tmp;
       		}
	}
   	else
	{
       	while(vect_length--)
        	{
	    	tmp_1 = *vect_in1;
	    	tmp_2 = *vect_in2;
	    	tmp   = tmp_1 - tmp_2;
	     
          		*vect_out =  (tmp_1 + tmp_2) / tmp;

          		vect_in1 += incr_in1;
          		vect_in2 += incr_in2;
          		vect_out += incr_out;
  	}
      }
 }

This is typical of coding styles which have evolved since the earliest applications of C. Programmers have taken advantage of the ease with which C can be used to express low-level operations to optimize the performance of the code using the compiler for the machine at hand.

The following version might have been the starting point for these optimizations:

void vaddosb(vect_in1,incr_in1,vect_in2,incr_in2,
				vect_out,incr_out,vect_length)
float *vect_in1, *vect_in2, *vect_out;
unsigned long vect_length,incr_in1,incr_in2,incr_out;

{
	unsigned int i;

    	for(i=0;i<vect_length;++i)
		vect_out[i*incr_out] = 
		(vect_in1[i*incr_in1] + vect_in2[i*incr_in2]) 
            	/ (vect_in1[i*incr_in1] - vect_in2[i*incr_in2]); 
}

No register variables, pointers or temporary variables are used and no special case is made for vectors of contiguous elements. Will this smaller, clearer function be much slower because of the 5 multiplies introduced per iteration? Will it be slower without the special case which allows compilers to optimize the post increment? With the MIPS R3000 processor on the SGI Indigo, the above code is faster than the original hand optimized code! The following measurements are the sum of the user time of ten runs of UNIX time(1) command executing a program which called the aforementioned functions one million times with 32 element vectors:

vector stride 1
sun_vaddosb 225.9
vaddosb 225

vector stride 2
sun_vaddosb 225.4
vaddosb 225

The MIPS C compiler did all the optimizations the original programmer did by hand and was able to generate more efficient loop testing code, which accounts for the small advantage over the original function. A conclusion from these measurements is that it is better to write simple, clear programs and examine their performance in a particular application than to write in a machine specific optimized style. This section is devoted to recommendations which are likely to yield faster code with most machines and compilers without leading to hard to understand code. The last section will discuss machine and compiler specific optimizations. But first here are some general recommendations on changing code to make it more efficient.

It can be difficult to predict performance by inspecting C code. Now that many processors have optimizing assemblers it is not easy to predict performance by looking at the output of the C compiler either. The only way to tell that changes to code are improving performance is to measure performance. Most modern compilers come with tools for execution time profiling. Even when these are not available, the UNIX time(1) command can be used to good effect.

Measure real performance to changes in code

It is very easy to break working code while changing it to make it more efficient. Most programmers have experienced the punishment for not following these recommendations from [Kernighan] :

Make it right before you make it faster
Keep it right when you make it faster
Make it clear before you make it faster

Function Argument Promotion

Arguments to functions defined with "old style" declarations are promoted to larger types. Char's and short's are promoted to int's. Float's are promoted to double's. So consider this version of the simple filter:

typedef float Stype
Stype simple_filter(input, k,j)
Stype input;	/* will be promoted to double */
float k;		/* will be promoted to double */
 float j;		/* will be promoted to double */
{
	static Stype output= 0; 
	
	return output = output*k +input*j;
}

On machines where the float and double types are different sizes this version will be considerably slower than one with the new declaration style introduced with ANSI C and C++. Not only will all the arithmetic be performed in double precision, but two conversions between float and double will be required to use and assign to the output variable. Most RISC processors have efficient double precision ALU operations. However the additional costs of conversion instructions cannot be avoided. Bus bandwidth and storage requirements of double's is usually twice that of float's. The size penalty may have an unpredictable and negative impact on performance by wasting precious cache memory or floating point registers.

Use new style declarations
Use conversion utilities to change old-style declarations in traditional C code

Floating Point Constants and Expressions

Note that the constant 3.141 is of type double and its use in an expression will cause other types to be promoted to double.

Use the ANSI C f or F suffix for single precision floating point constants,
e.g. 3.141f
or when the suffix is unavailable use a caste:
e.g. (float) 3.141

In this example, most compilers will generate code to multiply k by t, PI, and 2.

#define PI 3.14159265358979
a = r*sin(2*PI*k*t);

So this constant folding will be more efficient:

#define TWOPI 	6.2831853072 /* 2[[pi]] */
a = r*sin(TWOPI*k*t);

Fold constant expressions into single constants

Division is usually considerably more expensive than multiplication and compilers often do not do strength reduction, i.e. x/2.0 can be written as x*0.5. Strength reduction of 2.0*x to x+xis often not done by RISC compilers because in some instruction schedules the addition may be a slower choice.

Consider strength reduction when it is measurably faster

C Standard Library Functions

Standard C library functions such as sin() and cos()return double precision results to a double precision argument. ANSI C standardized single precision floating point versions of these routines.

When single precision suffices use faster library routines, e.g. fsin() and fcos() instead of sin() and cos()

Mathematical library routines are designed to achieve good performance without compromising accuracy for a wide range of input values. This means that they work over a much larger domain than an application may require. They are also required to maintain the errno variable for exception handling.

Consider restrictive and faster implementations of C library functions

Remember thought that considerable effort may have been expended on efficient implementations of the C library functions. The hard work to do better ones for specific applications may be difficult to justify.

Exceptions

Certain arithmetic operations such as division by zero may require special additional exception processing. The C standard does not define this additional work. It is usually processor specific but may be well defined in the case of IEEE 754 floating point exceptions such as overflow and underflow. In any event many of these exceptions result in a potentially expensive processor context switch to handle the exception.

Avoid computations which may lead to expensive exception processing

Expensive Conversions

Conversions from one data type to another may be expensive. Conversions from floating point numbers to signed and unsigned integers are commonly slow operations. On the R3000 it is cheaper to convert a floating point number to a signed integer and then that integer to an unsigned integer than directly from floating point to an unsigned integer. The results are not necessarily the same.

Avoid expensive conversions from one type to another

Parentheses

In traditional C, compilers are completely free to a change the order the evaluation of commutative and associative operators. ANSI C requires that compilers honor the order of expression evaluation specified by parentheses. Unless numerical precision requirements dictate otherwise (since floating point addition is not associative) it therefor is best to minimize use of parentheses and give compilers the most options for optimization. This unfortunately may compromize the readability of programs.

With ANSI C consider parentheses to trade numerical precision for optimization opportunity

Increase Locality of References

Modern computer systems are designed to take advantage of locality of memory references, whether these systems be based on RISC, CISC, or DSP architecture. Register variables and data cache memory are used to take advantage of temporal locality - the tendency for most variable references to be to variables which were most recently referenced. Instruction caches and queues are designed to take advantage of temporal and spatial locality of instruction streams - the tendency for instructions streams to contain contiguous instructions or repeats of short sequences from loops. Cache lines and page mode memory chips are used to take advantage of spatial locality of data and instructions.

It is important therefor to consider algorithms and implementations which enhance the temporal and spatial locality of instruction and data references. This will become even more significant as systems evolve to include more than one processor, because of the high cost of references to variables shared between processors. Although the recommendations to enhance locality are machine independent, the payoff in their application will vary considerably among processors and memory subsystem designs.

The cost of access to a variable may vary considerably according to its address in memory.

Use a memory allocator which aligns data to memory page, and cache line boundaries - the vendor supplied library function for memory allocation may do this automatically

Register files and cache memory are much smaller in size than main memory, so it is important to be economical in the use of space. For numerical applications this means not unnecessarily using large data types such as double when float will do.

double variables usually use twice the register and cache space of float variables, use them only when the precision they offer is required

In C it is usually easy to control spatial locality of variables, since arrays are implemented in contiguous memory locations and successive elements in structures are required to be implemented contiguously. For example on the SGI indigo, a reference to the real element of a complex number consisting of two float structure elements results in the complex component being brought into cache memory concurrently, speeding up a subsequent access to the complex element of the structure.

Place variables that are commonly used together into contiguous memory locations,
e.g. typedef struct { float r, c; } complex ;

Variables declared with local scope, are not necessarily implemented on the stack contiguously, so it may pay to create a structure to group together such variables, unless of course the compiler is expected to place the variables in registers anyway.

Machine and Compiler Dependent Optimizations

Compiler Options

Modern C compilers offer a range of optimization and other controls over the code generation process. They are usually described concisely, perhaps in the UNIX manual format.

Consult compiler manual for control over code generation

The MIPS C compiler for SGI workstations, like many ANSI C compliant compilers, has an option to revert to traditional C definition (-cckr). This option is useful for compiling old code, but it should be used with the -float option, which overrides the old type promotion rules and allows the compiler to generate faster single precision floating point operations for expression with no types larger than float.

use cc -cckr -float for old code
use cc -O2 -c or cc -O3 for ANSI code

The -g[0-3] argument which is used to specify generation of information for symbolic debugging suppresses optimization.

Use compiler optimization after code has been debugged

Profiling code is another potential source of inefficiency.

Turn off profiling options if it effects performance

Register Variables

In addition to documentation on the compiler itself, a reference manual for the language it implements is usually provided. It is now standard practice to include in such a manual a description of implementation-defined behavior. One of these implementation choices is the interpretation of the register storage-class specifier. The MIPS Compiler uses up to 6 register storage-class specifiers when not optimizing and ignores them when optimizing. It always ignores register storage-class specifiers in new style function definitions and declarations.

The MIPS compiler like most modern C compilers make very good choices automatically of which variables to put into registers. On the rare occasions when it can be shown, by examining the output of the compiler and measuring performance, that its use of registers can be improved on, a better way is needed to guide the compiler. Unfortunately it is better to compile most modules with optimization so register storage-class specifications will be ignored. We can however use the volatile type qualifier, introduced with ANSI C. Variables qualified this way will not usually be placed in registers.

Use volatile to defeat a compilers unfruitful register assignments

Exceptions

The MIPS processors have an overflow exception on signed integer additions and subtractions but not on unsigned integer operations. When using integers for phase accumulators in digital oscillators it is common practice to allow the accumulator to wrap by simply overflowing.

Use unsigned integers for phase accumulators to avoid exceptions

Recursive digital filters (IIR) with impulsive inputs may cause floating point underflows. MIPS floating point has an unmaskable and expensive exception on underflow. This data dependent instruction execution time is disastrous for hard real-time applications such as sound synthesis. One solution to this is to clamp the state variables of the recursive filter to 0 as they approach 0. Recent designs such as the Motorola 88110 allow for exceptions to be completely disabled.

Loops

The following functions all perform the addition of two non zero length vectors:

void add(int count, float *a, float *b, float *result)	/* fortran */
{
	int i;
	for(i=0;i<count;++i)
		result[i] = a[i] +b[i];
}

void add(int count, float *a, float *b, float *result)	/* UDI */
{
	while(count--)
		*result++ = *a++ + *b++;
}

void add(int count, float *a, float *b, float *result)	/* Csound */
{
	do
	{
		*result++ = *a++ + *b++;
	}while(--count);
}

void add(int count, float *a, float *b, float *result)	/* Resound */
{
	float *end = result+count;
	while(result<end)
	{
		*result++ = *a++ + *b++;
	}
}

void add(int count, float *a, float *b, float *result)	/* Ptolemy */
{
	int i;
	for(i=count-1;i >=0;--i)
		result[i] = a[i] +b[i];
}

For the interesting values of count greater than 0, these functions all perform the same operation. For values less than or equal to 0 they do not all perform the same way. It is unreasonable to expect a compiler to generate the same code for each function. Which function will be the fastest? For many compilers for RISC machines the first example will be fastest. Compiler developers cannot optimize every way to express loops. Perhaps because of a tradition of optimizing Fortran do loops, the MIPS compiler generates good code for loops with array indices which are functions of loop iteration count variables. It unrolls loops thereby increasing the chance for scheduling optimizations, and minimizing loop control between blocks. It can also generate fast test instructions for iteration variables with an upper and lower bound.

Use for loops and an iteration variable instead of while
Reference arrays with iteration variables instead of pointers

Compilers do not generate code to place the variable count in the following function into a register:

int count; loopindex() { int i; for(i=0; i < count; ++i) x(i*i); }

Since the function x(int i) may modify count the compiler cannot generate code to put a copy of count in a register at the start of the loop. If it is known that there are no such side effects it is possible to to help the compiler to come to the same conclusion by assigning all the variables in the loop to registers:

int count; loopindex() { int i; int n = count; for(i=0; i < n; ++i) x(i*i); }

Increase the probability of fruitful register assignments by avoiding side effects and narrowing scope of variables so side effects may be checked for by the compiler

Conclusion

The recommendations in this paper have been used by the author to create an efficient vector and function library for real-time audio signal processing on MIPS processors. Preliminary experiences porting this work to other processors is very encouraging.

Acknowledgments

Silicon Graphics
PacBell

References

ANSI C language specification, American National Standard X3.159-1989

Buck, J., Ha, S., Lee, E. A., Messserschmitt, D. G., Ptolemy: A Platform for Heterogeneous Simulation and Prototyping, 1991, Proc. of the 1991 European Simulation Conference, Copenhagen, Denmark, June 17-19

David, Ruth A., Stearns, Samuel D., Signal Processing Algorithms Using Fortran and C, 1992, Prentice-Hall

Depalle, Philippe Rodet, Xavier, UDI: A Unified DSP Interface for Sound Signal Analysis and Synthesis, Proceedings of ICMC, 1990, Computer Music Association

Ellis, Margaret A., Stroustrup, Bjarne, The Annotated C++ Reference Mnaual , 1990, Addison-Wesley

Freed, A., MacMix: Mixing Music with a Mouse, Proceedings of the 12th International Computer Music Conference, The Hague, 1986, Computer Music Association

Freed A., New Tools for Rapid Prototyping of Music Sound Synthesis Algorithms and Control Strategies, Proc. International Computer Music Conference, San Jose, Oct. 1992.

Freed, A., Recording, Mixing, and Signal Processing on a Personal Computer , Proceedings of the AES 5th International Conference on Music and Digital Technology, 1987

Hansen, Robert C., 1992, New Optimizations for PA-RISC Compilers, Hewlett-Packard Journal, June 1992.

Hartmann, W. M., Digital Waveform Generation by Fractional Addressing, J. Acoust. Soc. Amer. 82, 1987, p. 1183-1891

Kane, Gerry, Heinrich, Joe, MIPS RISC Architecture, 1992, Prentice Hall

Kastens, U., Compilation for Instruction Parallel Processors, 1990, Proceedings 3rd Compiler Compilers Conference 1990, Springer-Verlag

Kernighan, B. W., Ritchie, D. M., The C Programming Language, 1978, 1988, Prentice-Hall

Kernighan, B. W., Plauger, P. J., The Elements of Programming Style, 1978, McGraw-Hill

Laakso, T. I., Välimäki, V., Karjalainen, M., Laine, U. K., Real-Time Implementation Techniques for a Continuously Variable Digital Delay in Modeling Musical Instruments, Proceedings of ICMC 1992, CMA, San Francisco.

Leary, Kevin W., Waddington, William, DSP/C: A Standard High Level Language for DSP and Numeric Processing, Proceedings of the ICASSP, 1990, p. 1065-1068

Lee, Edward A., Programmable DSP Architectures, October 1988, IEEE ASSP Magazine

Moore, F. Richard, Elements of Computer Music, 1990, Prentice-Hall

Press, William H. et al, Numerical Recipes in C - 2nd Edition, 1992, Cambridge University Press

Reid, Chistopher E., Passin, Thomas B., Signal Processing in C, 1992, John Wiley & Sons

Macdonald, T., 1991, C for Numeric Computing, The journal of Supercomputing, 5, 31-48, Kluwer Academic Pulishers, Boston.