1. Introduction
The emergence of real-time audio synthesis on desktop computer systems is providing musicians and sound designers with richer and more complex control over sounds. These sounds are specified by different representations, including time-domain waveforms, frequency-domain sinusoidal components, physical models and resonance models. OpenSoundEdit is a sound editing system that provides a three-dimensional user interface to visualize and edit complex sounds composed from these different sound representations.
Sound in most popular digital audio systems is represented as a time-dependent sequence of audio samples, called a waveform. Waveform representations of sound are similar to image representations for visual scenes, allowing transformations over sections of the array, but very little control over individual perceptual components. In graphics, geometric modeling is preferred over images when such control is required. Similarly, sound representations based on synthesis use a function, called a sound model, and a template, called a timbral prototype that together are used to produce a sound. Just as complex graphical scenes can be built by transforming geometric models, sounds in synthesizers are produced by performing transformations on timbral prototypes.
Traditional hardware music synthesizers [A85, K89] motivated the development of synthesis models based on timbral prototypes and transformations. Early synthesizers were constructed using custom-designed hardware. Today, modern computers are fast enough to implement a synthesis engine in software. Software synthesizers offer increased flexibility and extensibility compared to hardware synthesizers. A variety of model and algorithm combinations have been developed on these software instruments, including large-scale additive synthesis [FRD93], waveguide physical modeling [S96] and resonance models [BBF89].
A sound visualization and editing system is needed that provides control
over different sound representations including synthesis models. This editor
is a client of a synthesis server. A synthesis server is an application
that accepts sound representations and control messages from clients and
generates waveforms for audio output in real time, as shown in Figure
1.1.
An effective client interface should have the following properties:
![]() |
![]() |
OpenSoundEdit, or OSE, extends the ideas found in these systems with the following features:
2. An Overview of Sound Representations and Synthesis
Sounds are most commonly represented using a waveform signal,
a function x(t) describing the amplitude at time t.
Sounds can be also be described using parameterized functions, called sound
models. The process of producing a waveform representation from a sound
model representation is called synthesis:
![]() |
(Equation 2.1) |
OSE currently supports three types of synthesis: sampling, additive synthesis and resonance modeling. It can be easily extended to support other types of synthesis. The remainder of this section describes the three synthesis models supported by OSE.
The first synthesis type, sampling synthesis, builds an output
waveform from input waveforms scaled by a supplied amplitude and frequency.
It can be described using the following function:
![]() |
(Equation 2.2) |
Although the equations presented for these synthesis models are continuous
functions of time, the actual implementations in the synthesis server use
discrete formulae that replace time with samples. For example, the discrete
case of sampling synthesis uses the following equation:
![]() |
(Equation 2.3) |
The second supported synthesis model is additive synthesis. In
additive synthesis, sounds are modeled using sinusoid functions whose amplitude,
frequency and phase change over time. Each sinusoid is described using
the following equation:
![]() |
(Equation 2.4) |
![]() |
(Equation 2.5) |
The third type of synthesis supported is resonance modeling.
Sounds in which the frequencies of the partials remain constant and amplitudes
are exponentially decaying functions are referred to as resonances.
Resonance models can be used to describe a wide variety of sounds, including
many musical instruments such as piano strings, percussion instruments
and the human vocal tract. Sounds based on resonance models are represented
more efficiently using a special form of the additive synthesis model:
![]() |
(Equation 2.6) |
Table 2.1 summarizes the three types of synthesis
used to model sounds in OSE.
The OSE interface is illustrated in Figure 3.1.
|
||
![]() ![]() |
Each view contains a coordinate system drawn on the ground plane, a visualization of the data set (i.e., the yellow lines in Figure 3.1), one or more moveable selection frames (outlined with white bars), and 2D projections of the data set onto the frames (i.e., the black lines). The data set visualizations will be described in detail in the following subsections. The selection frames are used for marking a position on one of the axes. For example, the ResonanceEditor view in Figure 3.1 has selection frames in the frequency-amplitude and ground planes. The frame in the frequency-amplitude plane can be moved along the time axis, and the projected black lines will change to reflect the value of the resonance model at the selected time.
Editing operations are performed via direct manipulation in the 3D display window, or using external controls, as shown at the bottom of Figure 3.1. The behavior of these edit controls are specific to the type of view selected.
The following subsections describe the Track, Resonance and Waveform
views in greater detail.
3.1. Track Editor View
The TrackEditor view is used to display and edit additive synthesis
prototypes (i.e., sinusoidal tracks). Tracks are rendered as connected
line segments representing the change in amplitude and frequency over time.
Time is measured in sections. Frequency and amplitude are measured in Hertz
and decibels, respectively, and plotted on logarithmic scales. Phase is
not shown. An example of an additive synthesis representation used in the
TrackEditor is illustrated in Figure 3.4.
![]() ![]() |
![]() |
The TrackEditor includes a time-selection frame parallel to the frequency-amplitude
plane. It can be moved along the time axis, as illustrated in Figure
3.5. The interpolated frequency and amplitude values of the tracks
at the selected time are projected onto the frame as black lines.
![]() |
![]() |
The time-selection frame can also be used to control time manually when
the prototype is being realized on a synthesis server. This technique is
known as scrubbing in the professional audio community. When scrubbing
is used, the values of the tracks are held constant at the current time
window position T:
![]() |
(Equation 3.1) |
Tracks can be selected by clicking on the white bar at the top of the
time-selection frame and dragging the pointer along the range of frequency
values to be selected, as shown in Figure 3.6a.
The maximum and minimum frequencies of the selection are indicated in Hertz
and standard pitch notation (i.e., degree and octave). Selected tracks
can be copied into other TrackEditors using the traditional cut, copy and
paste operations. Tracks can also be scaled by a constant factor along
the amplitude axis or the frequency axis (i.e., transposed), as illustrated
in Figures 3.6b and 3.6c.
Selection can also be used to play or scrub a subset of tracks on a synthesis server instead of the entire prototype.
In addition to scaling in the amplitude and frequency dimensions, the
user can reshape a track. This operation is performed by selecting and
scaling individual data points within specified tracks. Only those points
of the specified tracks that intersect the time window are scaled.
3.2. Resonance Editor View
The ResonanceEditor is used to view and edit resonance timbral prototypes.
Recall resonance modeling from Equation 2.5:
![]() |
(Equation 3.2) |
![]() |
(Equation 3.3) |
![]() ![]() |
![]() |
The ResonanceEditor view includes moveable time-selection and amplitude-threshold
frames parallel to the frequency-amplitude and time-frequency (i.e., ground)
planes, respectively. As in TrackEditor views, the time-selection frame
can be moved along the time axis. Lines representing the amplitude values
at the selected time are projected on the frame, as shown in Figure
3.8a. The threshold frame can be moved along the amplitude axis, and
changes the value of Amin. Lines indicating how long
each resonance lasts until its energy reaches the selected energy level
are drawn on the frame, as shown in Figure 3.8b.
The time is calculated using Equation 3.3.
|
||
|
As in the TrackEditor view, the time-selection frame can be used for
scrubbing. During scrubbing, the energy values of the sinusoids are held
constant at time position T:
![]() |
(Equation 3.4) |
|
||
|
||
|
The user can play or scrub a group of selected resonances on a synthesis server, as well as the entire prototype.
3.3. Waveform View
The Waveform view displays sampled sounds as time-domain waveforms. Waveforms have only two dimensions, time and amplitude. However, waveforms are often separated into channels, representing different audio outputs. For example, a stereo sound has two channels. A Waveform view contains sequences of samples separated into one or more channels, as illustrated in Figure 3.10. The x-axis represents time, the y-axis represents amplitude on a scale from -1 to 1, and the z-axis represents the channel as an integer value.
Waveform views do not have any editing controls. They are primarily used to view the output waveform of a synthesis server, or sampled sounds to be loaded by a server.
4. Implementation Issues
OSE is a client process that connects to a real-time synthesis server, as shown in Figure 4.1. It communicates with the server using the Open Sound Control (OSC) [WF97] protocol and uses the Sound Description Interchange Format (SDIF) [SDIF97] to read and write sound representations in files and share them with other users and applications.
The remainder of this section describes the implementation in greater detail. Section 4.1 describes the synthesis server and the OSC protocol, section 4.2 describes SDIF, and section 4.3 discusses the implementation of graphics and user-interface objects.
4.1. Synthesis Server Communication
A synthesis server is an application that accepts sound representations and control messages from clients, and generates waveforms for audio output in real time [F94]. OSE uses a synthesis server to play the sounds being edited.
The synthesis server can write samples to an audio device, disk file or network device. It can run on the same machine as OSE, or on a separate machine. If the server is on a separate machine, it can write samples to its own audio device, or send them back to the client over a network.
The following examples illustrate the Tcl commands used by OSE to open a synthesis server connection and execute various operations (e.g., load and play a sound file, and define and manipulate sound representations). Note that users do not normally enter these commands; rather, OSE executes them in response to use instructions (e.g., executing operations in buttons or as the result of direct manipulation).
OSE uses an Otcl object, called SynthesisServer, to access these services. A new SynthesisServer object is instantiated by supplying the address of the server host and the port on which the server listens. For example, the Tcl command
The SynthesisServer object binds these high-level methods (e.g., LoadResonance, PlayAt, etc.) to messages sent to the server using the OSC protocol. This protocol requires reliable delivery and bounded delay. The bounded delay is needed for commands that change play parameters during real-time synthesis. Reliable delivery is required for commands and data sets that are larger than one packet.
A synthesis server is implemented using an object hierarchy that represents the various transformations, it uses to shape the sound output. These transformations include scaling operations, control of time and conversion from one sound representation to another. OSC uses a hierarchical address to identify a synthesizer object and the messages being sent to it. The address is followed by the message arguments. An example message is:
4.2. Sound Description Interchange Format
The Sound Description Interchange Format (SDIF) is a data format for
storing and exchanging sound representations. An SDIF stream is a sequence
of frames arranged in time-ascending order, as illustrated in Figure
4.2. Each frame has a time stamp and a tag corresponding to one of
the registered frame types listed in Table 4.1,
followed by the actual sound representation for the frame.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Because the sound representation type is determined on a frame-by-frame basis, it is possible to mix different sound representations in a single SDIF stream. Each sound representation that is part of a stream is given a unique identification number.
More details on the SDIF representation are presented elsewhere [SDIF97].
4.3. Implementation of OSE
OSE is built on a suite of portable technologies, as shown in Figure 4.4. VTK, the Visualization Toolkit [SML96], is used for modeling 3D objects. Tcl/Tk and Otcl are used to bind user input (i.e., from direct manipulation of the 3D objects or from external Tk widgets) to editing operations and OSC commands for the synthesis server.
VTK is a portable C++ library for 3D graphics, and is particularly well
suited for data visualization. VTK provides three basic abstractions: datasets,
mappers and actors. Mappers convert datasets into graphics primitives.
Actors are geometric modeling objects that contain other actors or graphics
primitives produced by mappers. OSE uses special mappers for track and
resonance representations. Each OSE view is a complex actor that contains
the graphics output of a track or resonance mapper and actors representing
the selection frames. The selection frames are themselves composed of primitive
actors representing pipes and corners. Figure 4.5
illustrates the implementation of TrackEditor and ResonanceEditor views
using VTK objects.
In addition to providing abstractions, VTK also includes useful graphics
optimizations, such as display lists, in which graphics primitives produced
by a mapper are cached until the associated data set is modified. These
optimizations are particularly important when visualizing larger prototypes.
5. Discussion
The 3D interface adopted in OSE provides the user a more expressive and intuitive view of sound representations than can be provided on a 2D display. The varying third dimension can be used to express parameters that previously required special mappings onto two dimensions [BBF89, F97, F87]. The unified display allows the user to work with sounds across different representations without separate 2D windows or the clutter of 2D overlays. Changing the viewpoint and moving views on the ground plane makes it easier for the user to locate components of a sound representation (e.g., resonance partials) that are responsible for a particular characteristic of a sound. Direct manipulation allows efficient editing of sound representations using familiar gestures, such as pulling, stretching and sweeping.
Early tests underscored the need for 3D interface models that are well known. For example, early versions of OSE displayed sounds on a completely black background and allowed the user unconstrained translation and rotation of the viewpoint. Such an interface presents the user with the metaphor of "flying through space," which is unfamiliar and disorienting. The addition of the ground and sky metaphors provide a familiar orientation, which is further enhanced by constraining movement to be parallel to the ground plane. Furthermore, properties of familiar 3D environments can be used to provide additional useful information. For example, the black lines projected onto the time and threshold frames use a familiar "shadow" metaphor for projecting the three dimensional representations onto two dimensions.
User feedback for the current implementation has been encouraging. Experienced users of sound editing tools particularly enjoy the ResonanceEditor view. It is easier to identify and modify components of resonances, which have only one data point per partial, than with sinusoidal tracks, which may have hundreds of points for each partial. Modifying the frequency, gain and decay rate parameters of a resonance partial has an intuitive effect on the sound output. The compact representation also allows more tightly synchronized graphics and audio output when editing resonances.
Although the TrackEditor view is useful for visualizing and exploring track representations, the primitive editing capabilities for scaling individual points or entire tracks have proved inadequate. However, these primitive controls can be used as the basis for customized curve and surface deformations [HHK92, LV97]. Advanced users can add their own curve and surface methods to OSE using Tcl and VTK. Techniques suggested by users include a "pillow model," in which data points near a modification are raised or lowered in order to maintain a constant total amplitude [W97a].
The current implementation of OSE contains approximately 1500 lines of Tcl code, plus an additional 1000 lines of C++ to implement the custom VTK mapper classes. It was been run on an SGI O2 with Irix 6.3 and an Intel Pentium Pro 200Mhz with Windows NT Workstation 4.0. The O2 includes its own highly optimized OpenGL graphics engine, and the Intel machine uses a Matrox Millennium graphics accelerator card. Performance is adequate on both platforms when viewing small prototypes, but degrades as the size and number of views increases. In an effort to determine the primary performance bottlenecks in the current implementation, the following test was conducted on the O2 using SGI SpeedShop performance tools. A large timbral prototype was loaded and centered in the display. The time-selection frame was then scrubbed from 0 to 2 seconds in 20 steps. The animated centering operation uses display lists for redrawing the view, while scrubbing uses the mapper to draw the changing projection lines on the time-selection frame. The relative amount of time spend executing OpenGL, Tcl, VTK, the custom VTK mapper classes and standard C library routines was determined for each run of the test. The results on large track and resonance prototypes are presented in Figures 5.1a and 5.1b, respectively. It was discovered that the use of anti-aliased lines [OGL97] severely degrades performance, particularly for large resonance prototypes, so separate tests were done with anti-aliasing enabled and disabled.
Both TrackEditor and ResonanceEditor views sepnt a significant amount of time in OpenGL system routines. The TrackEditor spend proportionally more time in Tcl and libc routines during initialization because the sinusoidal track prototype was much larger than the resonance prototype (over 50,000 data points compared to 920). Likewise, the VTK TrackMapper requires much more time than the ResonanceMapper to traverse its dataset during redraw operations. Graphics optimizations that could improve the mapping process include culling, in which hidden objects are not rendered, and multi-resolution methods [HG97], in which distant models are approximated using fewer data points. Aside from the custom mapper classes, use of the VTK library incurred no significant execution costs.
Acknowledgements
I gratefully acknowledge the NSF Graduate Research Fellowship Program and Silicon Graphics, Inc. for their support of this research.
References
[A85] | J. Allen. "Computer Architectures for Digital Signal Processing." Proc of the IEEE, 73(5), 1985. |
[BBF89] | J-P Barrière, P-F Baisnee, A. Freed. "A Digital Signal Multiprocessor and its Musical Application." Proceedings of the 15th International Computer Music Conference, Ohio State University, 1989. |
[FRD93] | A. Freed, X. Rodet, X., P. Depalle. "Synthesis and Control of Hundreds of Sinusoidal Partials on a Desktop Computer without Custom Hardware." Proceedings of the International Conference on Signal Processing Applications & Technology, Santa Clara, CA, 1993. http://www.cnmat.berkeley.edu/Research |
[F97] | http://cnmat.CNMAT.Berkeley.EDU/Research/Resonances/
This page discusses Adrian Freeds 2D representation of resonances in MacMix [F87], and the transition to the ResonanceEditor in OSE. |
[F94] | A. Freed. "Codevelopment of User Interface, Control and Digital Signal Processing with the HTM Environment." Proceedings of the International Conference on Signal Processing Applications & Technology, Dallas, TX, 1994.http://www.cnmat.berkeley.edu/Research |
[F87] | A. Freed. "Recording, Mixing, and Signal Processing on a Personal Computer." Proceedings of the AES 5th International Conference on Music and Digital Technology, 1987. http://www.cnmat.berkeley.edu/Research |
[HG97] | P. Heckbert, M. Garland. "Survey of Surface Simplification Algorithms." Notes from SIGGRAPH 97 Course on Multiresolution Surface Modeling, Los Angeles, CA, 1997. |
[HHK92] | W.M. Hsu, J.F. Hughes, H. Kaufman. "Direct Manipulation of Free-Form Deformations." Proceedings of SIGGRAPH 92, Chicago, IL, 1992. |
[K89] | S.J. Kaplan. "Developing a Commercial Digital Sound Synthesizer." The Music Machine. MIT Press, Cambridge, MA, 1989. |
[LV97] | J.C. Leon, P. Veron. "Semiglobal Deformation and Correction of Free-form Surfaces Using a Mechanical Alternative." Visual Computer, 13(3), Springer-Verlag, 109-126, 1997. |
[OGL97] | M. Woo, J. Neider, T. Davis. OpenGL Programming Guide, 2nd ed. Addison-Wesley, Reading, MA, 1997. |
[P94] | A. W. Peevers. "Real-time 3D Signal Analysis/Synthesis Tool Based on the Short Time Fourier Transform." Masters Degree Report, University of California at Berkeley, 1994 |
[RFL96] | X. Rodet, D. François, G. Levy. "Xspect: A New X/Motif Signal Visualisation, Analysis and Editing Program." Proceedings of 22nd International Computer Music Conference, Honk Kong, 1996 |
[RS92] | L.A. Rowe, B.C. Smith. "A Continuous Media Player." Proc. 3rd Int. Workshop on Network and Operating System Support for Digital Audio and Video, San Diego, CA, 1992. http://bmrc.berkeley.edu/papers |
[SDIF97] | Sound Description Interchange Format (SDIF),
http://www.cnmat.berkeley.edu/SDIF |
[SML96] | W. Schroeder, K. Martin, B. Lorensen. The Visualization Toolkit: An Object-Oriented Approach to 3D Graphics. Prentice Hall, Upper Saddle River, NJ, 1996. |
[S96] | J. O. Smith, "Physical Modeling Synthesis Update." Computer Music Journal, 20(2), 44-56, 1996. |
[S97] | The softcast synthesizer is documented at http://cnmat.CNMAT.Berkeley.EDU/CAST/ |
[W97] | B. B. Welch. Practical Programming in Tcl & Tk, 2nd Edition. Prentice Hall, Upper Saddle River, NJ, 1997. |
[W97a] | Personal communication with David Wessel, CNMAT, 1997. |
[WL95] | D. Wetherall, C. J. Lindblad. "Extending Tcl for dynamic object-oriented programming." Proceedings of the Tcl/Tk Workshop 1995, Toronto, Ontario, 1995. |
[WF97] | M. Wright, A. Freed. "OpenSound Control: A New Protocol for Communicating with Sound Synthesizers." Proceedings of the 23rd International Computer Music Conference, Thessoloniki, Greece, 1997. http://www.cnmat.berkeley.edu/Research |