One of my groups of undergraduate students here is creating a very interesting application involving vocal training for not-necessarily good singers. Pitch is of course something of direct importance for singing (and enjoying listening to said singin), but so is the quality of the tone. I suggested to them that, with the tools they already have at hand (Max/MSP), they look at making their system discriminate between good and bad tones. To make the problem more concrete I suggested they first recognize which vowel is being sung. I was thinking that the easiest way to do this (without invoking any signal processing that they do not know) would be to use Max to find the first \(n\) peak frequencies and their powers relative to the fundamental of windowed segments to try to capture the shape of the formant, and then use nearest neighbor classification, or codebook vector quantization. This would not be very gender neutral of course, but might give adequate results.
I used MATLAB to see what kind of performance I could achieve for their small database (3 females, 3 males, five different vowels each). By using signal segments 100 ms in duration (with a Hamming window and some zero padding in the discrete Fourier transform), and taking parameters of the first seven partials, I achieved the following mean classification performance for their five vowel types:
I generated this “confusion table” from nearly 4,000 independent random trials, using a single nearest neighbor classifier with respect to the Euclidean distance between each pair of fourteen-element feature vectors. I kept the training data and testing data separate. What we see here is that when the classifier is presented a labeled ‘a’ sound, 41% of the time it is classified correctly, 23% of the time it is classified as ‘o’, and 27% of the time it is classified as a ‘re’ sound. The best performance is seen with ‘i’ sounds, where it is labeled correctly 60% of the time. In other words, this classifier is not very good! Clearly these features are not capturing the formant shapes.
Just to compare to an industry standard, I looked at using the first 13 Mel-frequency cepstral coefficients with the same classifier and testing approach. I obtained the following performance:
We can see that this classifier has the hardest time classifying ‘re’ with a low correct rate of 63%, but correctly labels ‘i’ 95% of the time! The others have accuracies greater than 73%. Considering that the data are extremely noisy, with people talking in the background,
blowing in the microphone, and other strange paranormal disturbances, not to mention that to my own ears it is sometimes very difficult to determine which vowel is being sung, I would consider this performance quite acceptable and satisfying. I am just at a loss for how to effectively explain MFCCs (and formants and autoregressive processes and linear prediction, and so one) without the attendant math; and I do not think Max/MSP has an object that will spit out MFCCs.
My MATLAB code is available upon request.