Siggraph 1996 Course Notes - Adrian Freed

1e) Integrating Sound & Graphics: Systems Issues

Introduction

Users of interactive applications that include sound and graphics reasonably expect them to:

To satisfy these needs, the application programmer works to orchestrate services provided by the operating system for their target platform(s). Operating systems implement an interface between applications and hardware. Being the carpets of the computer business, operating systms have a lot of dirt, political and technical, swept under them.. Their design and especially their implementation defines what is easy, possible and almost impossible for application programmers. Unfortunately, truly interactive graphics and sound applications fall into the almost impossible catagory on personal computer operating systems.

Operating System Support

Good support for real-time interactive multimedia requires a significant overhaul of currently available operating systems. It is unlikely that we will see these upgrades until mid-1997. As a stop-gap measure, special services have been patched onto existing operating systems to support sound and graphics, but the focus has been the delivery of streams of compressed video and sound, not on interactivity. Buffering, the main strategy introduced to support continuous media, is the enemy of interactivity. Buffering means delay is added between the user's controlling gestures and display and audition. These delays are principally responsible for the significant disorienting effect in many virtual reality environments and the scarcity of interactivity beyond the click and wander paradigm.

SGI IRIX

Over the last few years, SGI has been quietly and smoothly integrating media support on their platform with real-time applications in mind. In section 6 we will examine the services provided in SGI IRIX to support sound synthesis scheduling to simultaneously satisfy all of the above user requirements. These techniques have been implemented in the author's HTM system, demonstrated in the afternoon session. Challenges associated with the sound and MIDI services of Windows95 and Macintosh System 7.5 will also be covered.

5) Controlling and Scripting Sound Synthesis and Processing

5c) Firewire, LAN's and Buses

Why wont MIDI go away?

MIDI is the dominant standard network platform for commercial computer music. Its limitations have been long felt and documented. Since faster networking hardware than MIDI has been around for years, why is MIDI not obsolete? One reason is that there are a few special requirements of audio and musical applications that readily available LAN technologies don't simultaneously satisfy:

Local Area Networks

FDDI and 10BaseT Ethernet

FDDI over optical fibre provides good electrical isolation, but is too expensive. The closest LAN technology to satisfies these requirements is 10BaseT Ethernet. It is cheap and readily embedded.

Unfortunately when more than than two nodes are to be connected, a hub is required-another piece of gear to carry around with its associated wall transformer.

There is also the superstition that real-time requirements are hard to meet using Ethernet protocols. CNMAT successfully uses Ethernet for musical control applications, by careful use of the UDP component of TCP/IP.

Fast Ethernet

The development of Fast Ethernet means that computer LAN's will represent an attractive solution for audio control networking especially in institutional or studio environments.

There are two kinds of fast Ethernet 100BaseVG (http://www.io.com/~richardr/vg/) and 100BaseT (http://cs.wpi.edu/~mmurray/fast_ethernet/fetech.html).

100BaseVG has special support for real-time media applications. As Richard Russel explains:

"When a node has one or more packets to transmit, it sends a signal (a demand) to the hub that says `hey! let me transmit a packet!' When the hub decides it is time for the node to transmit, the hub sends the node a signal that means `OK! Transmit one packet right now.' A node can send a hub two types of transmit requests--normal priority and high priority. A hub cycles through each of the requesting nodes, in port order, allowing each to transmit one packet. A hub will service each of the nodes asserting a high priority request before servicing any nodes with normal priority requests. It is important to note that the hub pays no attention to nodes that don't need to transmit; they are skipped and do not take time in the hub's round robin algorithm. It is important to note that the round robin scanning is extremely fast and is implemented in hardware by the RMAC chips in the hub.

There two types of transmit demands--normal priority and high priority. This feature is designed to support multi-media video streams or other time-critical applications that require low transmit latency. "

Bus solutions

MIDI was developed when computers, music synthesizers and gestural controllers were physically separate and possibly distant components. Most computers are now capable of good sound synthesis and contain slots that can be filled with specialized sound synthesis hardware. Computers are now small enough to be integrated into gestural controllers. These trends indicate that much musical control information may flow across fast bus interfaces rather than over wires between separate boxes.

IEEE 1394:Firewire

A very attractive new option will be widely available next year: Firewire (http://www.firewire.org). As well as satisfying requirements of current MIDI users, it will be used widely to interconnect professional, consumer electronics and computer devices.

Firewire features:

5d) Coherent, Multiplatform Control

Introduction

The music industry has focussed its efforts on networks of loosly and slowly coupled, cheap specialized boxes. This was the state of the computing industry decades ago. The key to lower costs and user convenience in the long term is more integration. We see future platforms for unified control of sound synthesis to be quite different from the MIDI model.

OpenSynth

In a world where control parameters are crossing a high-speed bus (motherboard), operating system interface (software synthesis), or medium speed serial LAN (firewire, fast ethernet), the cost per transaction is far more important than the size of the data transferred. This observation and the need for an easy to understand and use protocol, drives the design of OpenSynth, a sound synthesis control protocol being developed at CNMAT. OpenSynth evolved from HTM a software synthesis system based on a client/server model:

Design Goals for Open Synth Control

Example Control Environments

Apple Newton

This sequence (Summertime) was called up by writing its name. Clicking on the score starts the sequence from that point. Faders at the bottom are for orchestration (Timbre Space) and tempo. Synthesis is on an SGI machine messaged through an RS422 serial port.

Resonance Editor

In this editor the horizontal axis is frequency. The vertical axis is amplitude in dB's. Each vertical line represents an exponentially decaying sinewave. The top of the line represents the initial amplitude. The length of the line (on a perceptually motivated scale) represents the decay rate of the partial. Longer lines represent more slowly decaying sine waves.

Opcode MAX

MAX is an extensible, visual, data-flow language running on the Macintosh. Data flow is from top to bottom. The top object is a slider that sets pitch. The top right button starts an envelope function. Data from these gestures is bundled by the htm object into a buffer to be sent over the Ethernet with the UDPWrite object at the bottom.

6a) Platforms for Sound Synthesis

Comparison

MIDI Synths. MIDI modules Sounds Cards Mother board Audio Software Synths.
Sound Quality high medium medium medium high
Voices high medium medium medium variable
Palette medium poor poor poor high
Platform consistancy medium poor poor poor high
Synthesis methods fixed fixed fixed fixed Flexible
O/S requirements slight slight medium edium high
Standards Yes Yes Yes Yes No

Graphics and Sound Applications

Situations when software synthesis or programmable sound cards are preferable include:

Non event-oriented sounds such as sonification, speech , sung voice

Behavioural animation where sound properties are derived from graphic element interactions

Special sound spatialization such as VR (VRML 2.0)

Cost or space budget doesn't allow for additional music modules

6c) Networked Audio

Section 5 covered control of sound synthesis processes in a networked environment. This section will look at delivery of sound itself over a network.

Introduction

A comparison of typical networking data rates with synthesized sound data rates illustrates that sound delivery over the most common interconnect (modem) is not straightforward :

T3 ~30Mbit/S??
T1 1Mbit/S
CD ~1.4Mbit/S
ISDN 128Kbit/S
Modem ~30Kbit/S



Successful network delivery of audio requires careful balancing of three resources: the server, the network and client. We distinguish two different cases: in-house tightly specified networks and the Internet.

Intranets: In-house Networked Audio

Choosing the network hardware:

Internets: Internetworked Audio

Servers are usually optimized for I/O throughput and because of economies of scale can have large memories. Not surprisingly successful Internet audio applications consist mainly of delivery of canned, highly space-optimized sounds.

The Internet is a very diverse collection of computing resources. What ties it all together is the TCP/IP protocol which can run over many kinds of wire. The problem with TCP/IP is that it was not designed for timely delivery of data. Real-time extensions to TCP/IP have been developed in "lab. conditions", but their widespread use on the Internet will be very challenging.

The diversity of client computers creates special challenges. Because the network is the bottleneck, highly space efficient coding methods are favored for audio. These assume sufficient computing horsepower at the clients for decoding. Whilst this is generally available on many machines, very few are powerful enough to simultaneously decode compressed video and audio.

For real-time audio applications over the Internet, reliability is a primary concern. Reliability is suffering under the strains of the rapid growth of the Internet. According to Vern Paxson from the Network Research Group at Lawrence Berkeley National Laboratory: "the likelihood of encountering a major routing pathology more than doubled between the end of 1994 and the end of 1995, rising from 1.5% to 3.4%." Vern Paxson (http://ee.lbl.gov/) is one of the first to study end-to-end TCP dynamics.

Resources:

6b) Reliability, Latency and Real-Time Issues

A major advantage of using dedicated hardware such as sound cards and MIDI modules is that real-time perormance and reliability issues are taken care of in the boxes. However compromises in synthesis control and timbre are often too great for many applications and we have to do our own synthesis. Also delays are introduced controlling the external devices.

In this section we discuss issues associated with making sound synthesis and graphics coexist on the same computer.

First we review operating system services required.

Resource Allocation and Management

Synchronization

Bounded Latency Input/Output

Resource Monitoring

simplesynth.c

Now we introduce a simple example program developed for sound synthesis on the SGI IRIX platform.

The IRIX audio library implements a FIFO queue for sound output to the D/A convertors. To avoid audible clicks, this queue must always contain a minimum number of samples. We refer to this as the low water mark. It is tempting to prevent buffer underrun by stuffing as many sound samples in the buffer as we can. This "greedy" strategy has an important failing: unbounded latency. What we want for interactive sound synthesis and synchronisation is bounded latency. Although application requirements vary, the 10+/-1 mS rule of thumb is good to keep in mind. So we have to keep samples in the buffer but not more than a certain high water mark. When the synthesis process has filled to the high water mark it should release control of the CPU so that another process (perhaps a graphics animation) can run. Now there is a real danger that other processes will occupy the CPU so long that the FIFO will drain below the low water mark. This has been deftly handled in the SGI audio library by providing a semaphor for the FIFO queue. The sound driver takes note when the sample drains below a defined low water mark. By careful use of UNIX's group semaphor mechanism, the select(2) system call, it is possible to request the CPU when the low water mark event occurs.

For this to work reliably however the synthesis process priority has to be set so that it is guaranteed to be the next process to run when the operating system reschedules.

These ideas have been embodied in a simple sound synthesiser an extract of which is below:

       /* obtain a file descriptor associated with sound output port */
         dacfd =ALgetfd(alp);

/* set process priority  to high and non-degrading */
        if (schedctl (NDPRI,getpid(), NDPHIMIN) < 0)
                perror ("schedctl NDPNORMMIN");
/* lock memory to avoid paging delays */
         plock(PROCLOCK);
        /* schedctl requires set user id root, so
           we need to return to regular user id to avoid security problems */
        setuid(getuid());
/* synthesize */
        {
                /* time */
                double t=0.0;
                /* sine wave frequency */
                double f = 440.0;
                /* high and low water marks */
                int hwm =300, lwm=256;
                fd_set read_fds, write_fds;
                /* largest file descriptor to search for */
                int nfds=dacfd+1;

#define VSIZE 32 /* vector size */
                short samplebuffer[VSIZE];
                for(;;)
                {
                /*      compute sine wave samples while the sound output buffer is
                        below the high water mark */
                while(ALgetfilled(alp)<hwm)
                        {
                                int i;
                                for(i=0;i<VSIZE;++i)
                                {
                                        /* appropriately scaled sine wave */
                                        samplebuffer[i] = 32767.0f *sin(2.0*PI*f*t);

                                        t += 1.0/SRATE; /* the march of time */
                                }
                                /* send samples out the door */
                                ALwritesamps(alp, samplebuffer, VSIZE);
                        }
                /* set the low water mark, i.e. when
                        we want control from select<2) */
                        ALsetfillpoint(alp, OUTPUTQUEUESIZE-lwm);

                        /* set up select */
                        FD_ZERO(&read_fds);     /* clear read_fds        */
                        FD_ZERO(&write_fds);    /* clear write_fds        */
                        FD_SET(dacfd, &write_fds);
                        FD_SET(0, &read_fds);
                /* give control back to OS scheduler to put us to sleep
                        until the Dac queue drains and/or a
                        character is available from standard input */

                         if (  select(nfds, &read_fds, &write_fds, (fd_set *)0,
                                        (struct timeval *)0) < 0)
                        { /* select reported an error */
                                        ALcloseport(alp); perror("bad select"); exit(1);
                        }
                        /* is there a character in the queue? */
                        if(FD_ISSET(0, &read_fds))
                        {
                                /* this will never block */
                                char c =  getchar();

                                if(c=='q') /* quit */
                                        break;
                                else    /* tweak frequency */
                                        if((c<='9') && (c>='0'))
                                                f = 440.0+100.0*(c-'0');
                        }
                }

 

The select(2) call blocks on 2 devices, the DAC fifo and standard input. This illustrates how synthesis control may be integrated into this user level real-time scheduling. Note that most I/O on UNIX systems is coordinated through file descriptors that may be used in such a select(2) call. The SGI MIDI system works this way, so it is simple to extend the above example into a MIDI-controlled software synthesizer.

It Still Clicks

If you try the simple synthesizer above on your own SGI machine, you may be disappointed to still hear some clicks from time to time. This is usually from inteference from many daemons running alongside your program.

The following things commonly disturb real-time performance:


Many of these daemons and services are controlled on SGI machines in /etc/config. We reboot our machines with as few daemons as possible, e.g. in single user mode, for critical real-time performance.

Macintosh and PC Software Synthesis

The Mac O/S, like Windows, does not have a pre-emptive scheduling. Programs explicitly pass control to each other through a coroutine mechanism. The only preemption that occurs is through I/O interrupts. So the only way to achieve real-time is to schedule code running at interrupt time.

On SGI machines, samples are "pushed" by the user process into a FIFO. On the PC's, interrupt level code "pulls" samples that a user supplied-function provides. The complication of the pull scheme is that the user-supplied callback function is constrained because it runs at interrupt level. For example, on the Macintosh it cannot allocate memory. It is also not wise to spend too much time computing in this routine otherwise pending I/O may fail.

In the pull scheme latency is controlled by the operating system not by the user process. For the Power Macintosh, for example, it is hard to achieve latencies better than 30mS.

6d) Synchronization with Gesture and Graphics

Synchronization on a single machine

Synchronization itself depends on the ability to accurately time when gestures occur, when samples are heard and when images are seen. It is amazing that most computers and operating systems dont provide any way to time these three things to a single clock source. Again the culprit is buffering. We may be able to time accurately when we are finished computing an image or sound, but we can't achieve synchronization if we dont know how long the OS and hardware will take to deliver the media to the user. Again we have to turn to SGI systems to see how to do this properly. The basic idea is to reference everything to a highly accurate dependable hardware clock source.

Here is how SGI describes the clock:

"dmGetUST(2) returns a high-resolution, unsigned 64-bit number to processes using the digital media subsystem. The value of UST is the number of nanoseconds since the system was booted. Though the resolution is 1 nanosecond, the actual accuracy may be somewhat lower and varies from system to system. Unlike other representations of time under UNIX, the value from which this timestamp derives is never adjusted; therefore it is guaranteed to be monotonically increasing."

Then there is a synchronization primitive for each media type. For audio it works as follows:

"ALgetframetime(2) returns an atomic pair of (fnum, time). For an input port, the time returned is the time at which the returned sample frame arrived at the electrical input on the machine. For an output port, the time returned is the time at which the returned sample frame will arrive at the electrical output on the machine. Algetframetime therefore accounts for all the latency within the machine, including the group delay of the A/D and D/A converters."

The SGI video subsystem provides the analogous primitive: vlGetUSTMSCPair(2).

Gestures communicated using MIDI events are also tagged with the the same UST clock.

With these primitives in place the application programmers job is simplified to scheduling the requisite delay between the various media types, so that they are perceived together by the user.

Synchronization between machines

In the common situation that there is not enough horsepower in a single machine to do both graphics and audio, we have to achieve synchronization between machines without a common hardware clock. The key is to have the machines on the same LAN and use a time daemon (e.g. timed(1)) to synchronize their clocks.

On a very local area network, this can be achieved to within 1mS. Although SGI appears to have ommitted a system call that provides an atomic (system clock, UST) pair, a reasonable pair can be found on a quiet system by comparing a few dozen repeated requests for these individual times. These pairs have to be then communicated amongst cooperating machines (since each machine booted at a different time). Now each machine can coordinate media streams with a common and fairly accurate clock. There are many applications however where 1mS slop is too long. The human ear can easily discern relative delays in audio streams of a mere sample. If any correlation is expected between audio streams (such as 3D audio, spatialization and stereo), all such streams should be computed on the same machine.

8b) CAST

CAST is an acronym for CNMAT Additive Synthesis Tools. Here are some reasons we have bet the farm on additive synthesis:

Softcast, is an extensible software additive synthesiser, currently tuned for the SGI platform. Extensibility is provided using a plug-in mechanism to define and transform spectral descriptions of sounds.

Softcast will be used to demonstrate:

This example demonstates how time can be stretched and moved forwards and backwards.

In this example of fine timbral control, odd and even partials are modified independently:

This is an exploration of the noise residual from analyzed saxaphone.

A simple physical model of the singing voice:

In the editor below, the horizontal axis is frequency. The vertical axis is amplitude in dB's. Each vertical line represents an exponentially decaying sinewave. The top of the line represents the initial amplitude. The length of the line (on a perceptually motivated scale) represents the decay rate of the partial. Longer lines represent more slowly decaying sine waves.

Replacing the glottis with a piano:

9c) Sound Connected to Web applications

Webcast

Webcast is the integration of the CAST software synthesizer as a web browser plug-in. Using specialized Java user interfaces the following applications will be demonstrated:

Appendix

Links

References

Freed, A. ICMC95 paper "Bring Your Own Control to Additive Synthesis", Proceedings of The International Computer Music Conference , Banff, Canada, 1995

Freed, A. "Codevelopment of User Interface, Control and Digital Signal Processing with the HTM Environment," Proceedings of The International Conference on Signal Processing Applications & Technology, Dallas, Texas, 1994

Freed, A . "Guidelines for signal processing applications in C" , C Users Journal v11, n9 (Sept, 1993):85

Freed, A., Rodet, X., Depalle, P. "Synthesis and Control of Hundreds of Sinusoidal Partials on a Desktop Computer without Custom Hardware". Proceedings of The International Conference on Signal Processing Applications & Technology, Santa Clara, CA, 1993.

Freed, A., Rodet, X., Depalle, P. "Performance, Synthesis and Control of Additive Synthesis on a Desktop Computer Using FFT-1 . Proceedings of the 19th International Computer Music

Conference, Waseda University Center for Scholarly Information 1993, International Computer Music Association.

Freed, A. "New Tools for Rapid Prototyping of Musical Sound Synthesis Algorithms and Control Strategies." Proceedings of the 18th International Computer Music Conference., San Jose State University , 1992, International Computer Music Association.