9/29/99 updated by Matthew Wright

Organization of This Document

The SDIF standard includes an extensible collection of standard frame and matrix types, listed in this document.

Each standard matrix type exists independently of the standard frame types that must include it; any matrix may appear in a frame of any type. However, for clarity, this document describes each matrix type in the context of the frame type for which it was invented, with a special section at the end for matrix types invented to be a part of any frame.

SDIF Standard Frame Types

The following frame types have been defined as part of the SDIF standard. Each of these frame types has one or more corresponding matrix types. To give a sense of what kind of data is in each frame type, this table also lists the columns of the main matrix type for each frame type. Click on the frame type ID for a detailed description.

Frame Type ID  Frame Type  Columns of Main Matrix
1FQ0 Fundamental Frequency Estimates Fundamental frequency, confidence
1STF Discrete Short-Term Fourier Transform Real & imaginary bin values
1PIC Picked Spectral Peaks Freq, Amp, phase, confidence
1TRC Sinusoidal Tracks Index, freq, amp, phase
1HRM Pseudo-harmonic Sinusoidal Tracks Harmonic partial #, freq, amp, phase
1RES Resonances / Exponentially Decaying Sinusoids Freq, amp, decay rate, phase
1TDS Time Domain Samples Channels of sample data

Frame Types to be Standardized

The following sound descriptions should eventually have standard SDIF frame types. We have decided to delay the definition of these types until the base SDIF standard has been accepted by the community. We welcome any ideas or proposals about how to represent this data in SDIF frames.

Conventions Followed By SDIF's Standard Frame and Matrix types

Time-Domain Samples

Time-domain samples are the typical representation for digitally sampled sound, used by common sound file formats such as WAV and AIFF. The goal of SDIF's time-domain samples frame type to provide a uniform representation and the convenience of having time domain samples in the same SDIF file or stream as other sound descriptions, not to codify every ingenious scheme for representing audio in the minimum number of bits. Therefore we restrict this type to linearly quantized samples with no compression.

1TDS frames must contain a 1TDS matrix to hold the samples and a ITDS "time domain info" matrix that says how to interpret the samples:

1TDS matrix:

ITDS matrix:

More columns may be added to the ITDSmatrix in the future, including the following:

Unlike most other SDIF frame types, a frame of 1TDS data represents an interval of time (equal to the number of rows in the 1TDS matrix divided by the sampling rate) rather than an instant of time. The time tag of a 1TDS frame represents the beginning of this interval.

Most SDIF streams containing 1TDS data will consist of a single large frame at time zero with all of the samples for the stream in a single matrix. The same data could be represented equivalently in a series of shorter frames, for example, a series of frames containing one-second intervals of sample data at times 0, 1, 2, 3, 4, etc., or unequal-sized frames, e.g., 1.5 seconds at time 0, 2 seconds at time 1.5, 0.7 seconds at time 3.5, 1 second at time 4.2, etc. Note that at a 96K Hz sampling rate, the limit of 2^32 rows in a matrix imposes a limit of about 12.4 hours of sound in a single frame.

There is also the possibility of "gaps" in the time axis, for example, one second of sound in a frame at time 0 followed by more sound in a frame at time 10. In these cases, the stream implicitly contains zero-valued samples in any intervals of time not spanned by sample data in frames. So, in this example, there would be one second of sound, followed by 9 seconds of silence, followed by more sound.

There is also the possibility of 1TDS frames that overlap in the time axis, for example, a frame at time zero with 2 seconds of samples, followed by a frame at time 1 with more samples. In these cases, the semantics are that the sample values are added together.

Separate Matrix Type for Annotating Multi-Channel Data

Rather than define some fixed interpretation of multi-channel data like "1 is front left, 2 is front right, 3 is rear left, 4 is rear right", we propose to invent an SDIF matrix type specifically for describing multi-channel data. This would allow simple textual labels like those above, but also precise geometric measurements about exact microphone placement, speaker placement, etc. It would also support textual annotations about the content of each channel, e.g., the name of an instrument on a particular channel of a multi-track recording.

This matrix type would be optional in 1TDS frames, or any other frame type with multi-channel data.

Fundamental Frequency Estimates

Not all sounds have a definite fundamental frequency; some have multiple possible fundamental frequencies. Note that we use the term "fundamental frequency" or "f0" rather than "pitch"; this is because pitch is a perceptual phenomenon while fundamental frequency is a signal processing quantity. We might invent a new SDIF frame and matrix type for pitch to represent the result of a true pitch estimator that applied a model based on human perception.

1FQ0 frames consist of a single 1FQ0 matrix:

Note that this format accommodates estimators that vote amongst fundamental frequency candidates. Each row in the data vector is an estimated fundamental frequency.

Note that this format does not support the notion of "tracking" various fundamental frequency estimates over time. In this respect it is more like the 1PIC frame type than the 1TRC frame type. We are considering adding another frame type for "tracked fundamental frequency estimates" that would include an index for each fundamental frequency.

Discrete Short-Term Fourier Transform/Phase Vocoder

1STF frames represent the data that come out of a discrete short-term time-domain to frequency-domain transform such as an FFT.

Here is a precise mathematical definition of this frame type:

We define the input to the transform, x(n), as follows. Note that the windowed signal is 'put' at the beginning of the vector x(n).

Let x(n) =   s(i+n) * w(n)  for  0 <= n <= M-1
    x(n) =   0              for  M <= n <= N-1

(This is slightly redundant, since we define w(m)=0 when m>=M.)

The 1STF matrix data is the Discrete Fourier Transform (DFT) of size N, i.e. the X(k) as follows.

The DFT is a length N vector X, with these elements:

              N-1
       X(k) = sum  x(n) * exp(-j * 2 * pi * k * n/N)
              n=0

       0 <= k <=N-1

The time tag in a 1STF frame is the time of the center of the window, i.e., (i + M/2)/SR, not the beginning.

Notes:

1STF frames consist of an ISTF "info" matrix to record overall information about the transform, plus a 1STF matrix that contains the actual bin data.

STFT info matrix:

Each 1STF frame must also contain a 1WIN matrix specifying the window function.

STFT data matrix:

You can convert these complex numbers into polar form to get magnitude and phase.

Picked Spectral Peaks

Picked spectral peaks represent peaks (local maxima) in a spectrum at a given time. Peak pickers typically fit some kind of curve to 1STF data, providing frequency, amplitude, and phase estimates that are more accurate than the bins themselves.

1PIC frames consist of a single 1PIC matrix:

The confidence factor might be used to indicate how much of the energy around this peak was from a sinusoid or how well the energy around this peak matches a sinusoid.

Sinusoidal Tracks

Sinusoidal tracks represent sinusoids that maintain their continuity over time as their frequencies, amplitudes, and phases evolve. Sinusoidal tracks are the standard data format used as the input to classical additive synthesis.

1TRC frames consist of a single 1TRC matrix:

Synthesizers of 1TRC frames are expected to match the data for each sinusoid from frame to frame using the index numbers. Values for amplitude and frequency should somehow be interpolated so that they change smoothly between each frame.

As phase is the integral of the instantaneous frequency over time, the phase values in each frame may not necessarily match a synthesizer's concept of what the phase should be based on the previous phase and the frequency trajectory since the previous phase. Some synthesizers will ignore the phase field or use it only for the initial phase. Others will take the phases into account when interpolating frequencies from frame to frame. Others will "cheat" the desired frequencies to produce the desired phases.

We imagine SDIF utilities that would check the "reasonableness" of phase values based on the frequencies.

There is no guarantee that a partial appearing in one frame will also appear in the next frame. The situation where a partial appears in one frame but not the next is called a "death", and when a partial does not appear in one frame but does appear in the next frame it's called a "birth". These cases are challenging when writing a synthesizer. It's recommended that partials appearing for the first or last time in a series of frames should have amplitudes of zero, so that the semantics of fading in and out are explicitly in the SDIF data rather than needing to be added on by the synthesizer.

Pseudo-harmonic Sinusoidal Tracks

Pseudo-harmonic sinusoidal tracks frames are exactly like sinusoidal track frames except that the partials are understood to lie on or close to a harmonic series. Thus, the index column of the 1HRM matrix represents harmonic partial number rather than an arbitrary index. Partial numbers start from 1, so the frequency of each pseudo-harmonic sinusoid should be close to the partial number times the fundamental frequency.

Exponentially Decaying Sinusoids/Resonances

Resonances data can describe the characteristics of a resonant system like a group of tuned filter banks, or can specify parameters for a model of sinusoids with fixed frequencies and exponentially decaying amplitudes. (If you put an impulse into such a group of filter banks, the output should be a sum of sinusoids with fixed frequencies and exponentially decaying amplitudes, so these two situations are in a certain sense the same.)

1RES frames consist of a single 1RES matrix:

The decay curve of a resonance should be the same as that of a two-pole filter with bandwidth equal to decay rate divided by pi. This formula gives the amplitude of each sinusoid over time:

	amp(t) = initial_amp * e ^ (- decay_rate * t)

The phase of a resonance specifies the initial phase of each decaying sinusoid. (Thanks to Jean Laroche for suggesting that we include phase in this frame type.)

The original SDIF spec included some interesting extra columns for resonances.


back to SDIF Main Page