Home | Site Map | Watch | FAQ | History | Store | Contact

WAVAlign: Audio/MIDI alignment tool


The flow of data in my current methods of music visualization goes something like this:

That is to say: I (sometimes compose and) perform the music, recording my performance as MIDI data (and sometimes editing it in that form). Then, I run the MIDI data through my custom animation software (which generates individual movie frames) and into a synthesizer (and record the output as digital audio). Finally, I put it all together in movie editing software, add titles, and render the final movie.

This approach guarantees that the video is in sync with the audio because they're both generated from the same data. However, it limits me to MIDI-enabled instruments and to music I can perform myself.

To move beyond this, I need a way to use other people's audio recordings. The question is: how do I get the precise timing data to base the animation on? In the approach I'm currently working on, I will synchronize MIDI to existing audio. Here's what the data flow looks like:

The audio and MIDI data entering the alignment software have same notes in the same order, but with different timings. The MIDI data exiting the alignment software has the same note timings as the audio. From there, the process is the same as it is now: write the animation software, generate the movie frames, add it to the audio, add titles, render.

Progress (design & implementation notes)

I chose Tarrega's guitar piece Recuerdos de la Alhambra as my first example. There were two reasons for this choice: guitar music should be relatively simple to align, and my friend James Edwards gave me permission to use his recording of the piece. Here's what his recording sounds like: MP3 of opening eight bars.

I entered the score into the Sibelius notation software; here are the first eight bars ...

... and here's what the audio generated from the MIDI version of the score sounds like.

As a first reality check, I generated gammatone spectrograms of the two versions to verify that they were similar; here's the original ...

... and here's the audio generated from the MIDI...

As I expected, the MIDI version is much more regular, but a lot of the main landmarks are in the same places.

Next, because I'm mostly concerned about getting the starts of notes aligned, I calculated the positive spectral changes:

Again, the MIDI version is more regular, but the structure of the two is similar enough to give me confidence.

In fact, even the sums of these spectral deltas are pretty similar:


Now, the question is: how to align these?

In the last couple of years, I'd seen self-similarity plots for music (for an example, see Jonathan Foote's article Visualizing Music and Audio using Self-Similarity), and I realized that I could remove the 'self-' from this to create a similarity plot for my two audio streams.

This image shows the similarity between the two time series shown above (just the first 5 seconds):

In this, the most similar (zero difference) is dark blue, with lighter colors indicating less similarity (greater difference). The light lines running vertically and horizontally are where a note onset (positive spectral delta) in one recording is being compared to a non-onset region in the other recording. Where these line intersect is where there's a note onset in both recordings; these points of intersection are darker there because the difference is smaller. In this view, the inter-file alignment will be represented by a diagonal line running from the upper-left-hand corner to the lower-right-hand corner; most of the time, it will pass through the dark points at the intersections of lines.


Okay, now I need to get real: do I want to use a gammatone filter or an FFT filter? Here's what the gammatone looks like ...

... and here's a multi-resolution FFT for the same passage:

The FFT is "noisier," but in the context of a tool where I can disregard the glitches (since I'll be looking at the images myself not expecting the software to do all the analysis), I think it's preferable.


What FFT should I use? I'm currently considering FFTW and Intel's Math Kernel Library. Intel says that theirs is faster, but it's not free, so I think I'll start with FFTW and see whether it's fast enough.


It occurred to me that I'd want to see both the spectrograms and the similarity matrix. Here's how that might look:

I tried arranging this so that the two spectrograms were more visually equal, but I couldn't figure out a way to do that without violating other constraints. I decided that since the audio was the more "real" of the two (and, as the one that wouldn't be altered, the "foundation"), it should be on the bottom, with time flowing from left to right.


This weekend I finally got my "FFT set" built in my C++ environment (previous tests were all done in Matlab). The FFT set is a software module that does FFTs of various lengths to get a reasonable compromise of good time resolution at high frequencies (which is only possible with short FFTs) and good frequency resolution at low frequencies (which is only possible with long FFTs). Here's what a speech sample looks like: