For those interested in the actual song key identification research, the source code for that is available here. The algorithm we used was based on Hidden Markov model classifiers and BeatSynchronous Chromagram audio features.
Extracting annotations from a MIDI file is often much simpler than other formats because the timing information, instruments, and notes are all encoded in the file itself. Unfortunately the default MIDI soundbanks on most computers are pretty terrible and don’t sound like real instruments or musicians, so you don’t want to use them for evaluating or training MIR systems. However, there are many realistic soundfonts available online that can make a MIDI recording sound close to real instruments. To avoid overfitting our system on specific sounds, we wanted to use real audio training sets with diverse instrument sounds (rock, jazz, classical, metal, etc.).
The following Python script takes a directory of MIDI files and a directory of SF2 soundfont files and generates corresponding audio files (wav, aiff, mp3, etc.). The selection of which soundfont is used for each MIDI file is random. To use this script, you need to have FluidSynth installed.
Before we start, we need a data representation for complex numbers and a pure trait to test different FFT functions. Note that the FFT trait specifies a Numeric type class so it can work with any sequence of numbers.
1 2 3 4 5 

1 2 3 

As mentioned in the previous post, the CooleyTurkey algorithm requires that the data length be a power of 2. All of our equations in the previous post were in terms of the complex exponential function (e^{ix}). Using Euler’s formula we can instead rely on sine and cosine functions in our implementations.
The recursive nature of the standard CooleyTurkey algorithm lends itself nicely to a pure functional implementation. Since we should always prefer pure functional code, we’ll start there.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

This actually follows the mathematical definition pretty nicely and is very concise and readable. Note that we are recursively breaking the data into smaller DFTs by even and odd indexes. We’re calculating the phase (twiddle) factors seperately and relying on the symmetric properties of the DFT to recombine the values for 0 ≤ k < N/2 and N/2 ≤ k < N.
So how well does this algorithm perform? First try with 1024 random Double values on my machine takes ~ 100 ms. OK, let’s see how this does once the machine warms up. If we try 10 random sequences in a row (size 1024), we get:
We can see that it’s starting to settle. After running 1000 iterations, we get an average of ~ 3 ms per fft call.
It’s no secret that optimizing Scala code can sometimes be ugly (see Erik Osheim’s Premature Optimization). So let’s see if we move towards an imperative version of the FFT.
The following is basically a translation of the algorithm from Apache CommonsMath into Scala. This algorithm is still based on the CooleyTurkey algorithm, but the implementation is much more verbose and harder to follow than the recursive version.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 

Yikes, we went from 30 lines of code to 120. Let’s take a look at a few points in this algorithm though. Since we are no longer recursively selecting even/odd indexes, we perform a bitreverse shuffle of the data up front. This allows us to traverse the data in essentially the same order. Also, note that the W_{N}^{k} real and imaginary factors are precomputed. This, in combination with the fact that we are operating on arrays and avoiding boxing and unboxing will certainly make this algorithm faster. Let’s see how much.
Using the same test as before, our first try with 1024 random samples takes ~ 20 ms. Next up, let’s test with the warmup using 10 iterations:
After 1000 iterations, it takes an average of ~ 0.19 ms per fft call.
The following table shows a sidebyside comparison of both algorithms. The times in these tables were averaged from 1000 iterations on increasing frame sizes.
Frame Size  Recursive Time (ms)  Imperative Time (ms) 

512  1.56  0.14 
1024  2.98  0.19 
2048  6.04  0.27 
4096  13.18  0.47 
The imperative algorithm is clearly faster, but much more verbose and harder to understand.
Scala gets knocked sometimes for allowing both OO/imperative and functional styles of coding. In my opinion this is actually a huge benefit for the language. You can favor the functional style and resort to imperative code in cases where performance is critical. These cases can be isolated and the details can be hidden. Looking at our imperative algorithm above, the FFT is still referentially transparent.
]]>The Fourier transform is a mapping function that takes a series of samples (or function) in the time domain and maps them into the frequency domain. The transform is based on the Fourier Series, which is an expansion of a periodic function or signal into the sum of simpler sine and cosine functions.
Looking at the example above, the periodic time data can be described as the sum of 4 sinusoidal functions with frequencies at 110, 220, 330, and 440 hz. So how does this mapping work? Unfortunately most descriptions of the Fourier transform (and its inverse) jump right into the following math with little explanation:
It is often difficult to grasp how we’re actually mapping from the time domain f(x) to the frequency domain F(k) (and vice versa) with these equations. To understand how this works, we need to look at some important properties about the spectrum analysis first. Let’s start by taking two periodic signals A and B, where A is our input signal and B is a signal we are generating:
If we multiply these signals together and sum up the areas underneath the curves, we have:
As you can see, about half the area is positive and the other half of the signal is negative and they will cancel each other out. However, if we multiply two signals together that share a frequency (say A x A), we’ll get:
This tells us that our input signal has significant energy at the frequency of our test (generated) signal. If we extend this idea, sweeping from frequencies ∞ to ∞, we will end up with spikes (or Dirac δ functions) where our signals share frequencies and zero energy elsewhere. This is the basic idea behind Fourier transforms.
Since we are often working with small frames of sample data, we can’t actually test all frequencies from ∞ to ∞. The Discrete Fourier transform (DFT) is a modification of the Fourier transform that works with discrete sampled data. Our equations from above become:
The DFT will test evenly spaced frequencies from 0 hz to the sampling frequency (S_{r}). For example, if we have a signal sampled at 44.1 kHz and 1024 samples N, the transform will test only the frequencies shown. The time complexity of the standard DFT is O(n^{2}).
In 1965, J.W. Cooley and John Turkey came up with a divide and conquer algorithm for calculating a Fast Fourier Transform (FFT) in O(n log n) time. To this day, this is the most widely used algorithm for Fourier transforms.
The CooleyTurkey algorithm is often referred to as a radix2 decimationintime (DIT) algorithm. The algorithm works by recursively splitting the data into smaller frames of size N/2 until you are calculating the FFT of a single value, which is the value itself. For this reason, the length of the input frames must be a power of 2. On each iteration, the data is split by even and odd indexes. This interleaving split is where the term “radix2” comes from. The term “decimationintime” comes from the fact that we are splitting indexes that correspond to time.
Another important property of the DFT is that the outputs for 0 ≤ k < N/2 are identical to the outputs for N/2 ≤ k < N. Taking this into account, the above DFT algorithm can be split into the following:
Note that for the odd terms, we were able to pull out the phase factor. This term is often referred to as the twiddle factor.
There are many implementations of the FFT in different languages. The fastest and most widely used implementation is FFTW, based on highly optimized C code. One interesting thing about FFTW is that the C code is actually generated by an OCaml program called ‘genfft’. In the next post, I’ll explore some implementations of the FFT in Scala.
]]>