# Digging into MusiCNN, pt. 9

In part 1, I review the MusiCNN system, and in part 2 I review the architecture of the “spectrogram”-based model trained on 3-second audio segments. In part 3, I look at how the input to MusiCNN is computed from an audio signal. In part 4 I begin to look at the first layer (CNN_1) of MusiCNN – in particular the 153 one-dimensional convolutions and their impact on an input dB Mel spectrogram. In part 5 I inspect the 408 two-dimensional convolutions in CNN_1. Part 6 looks at what the rest of CNN_1 does to these 561 transformations of the input normalized dB Mel spectrogram. Part 7 looks at the normalization layer just after CNN_1. Part 8 looks at the final portion of the first stage, which reduces all the time-series created so far into one 561-dimensional time series – which is the input into the second stage. We are here:

Before we proceed deeper into the network, let’s make absolutely clear how to interpret the output of the first stage, i.e., the 561 time series each of length 187 stacked into matrix of size 561×187 $\mathbf{U}$. Here is what $\mathbf{U}$ looks like for one 3-s segment of GTZAN Disco Duck and MusiCNN trained on MSD:

Each colored band represents the set of units in CNN_1 using a kernel of a particular size, from the bottom to the top: 38×7, 67×7, 1×128, 1×64, 1×32. Each row (or time series) comes from a unit/neuron in CNN_1. Of these rows, 499 are unrelated to the input to MusiCNN, and effectively zero. The other 62 rows are related to the input, i.e., the normalized dB Mel spectrogram $\hat{\mathbf{M}}_{\textrm{dB}}$. Starting from the bottom, the first non-trivial row we encounter is produced from the output of unit 12. Let’s notate that row $[\mathbf{U}]_{12,*}$. Here’s what its values look like:

I make this a stem plot to emphasize that this is a discrete sequence with a time resolution of 16 ms. This length-187 sequence is a summary of the output of unit 12, notated $\mathbf{Y}_{12}$. This summary is created in the following way. First $\mathbf{Y}_{12}$ is normalized:

$[\hat{\mathbf{Y}}_{12}]_{m,n} = 7.43 [\mathbf{Y}_{12}]_{m,n} - 3.28$

In other words, we take the element of $\mathbf{Y}_{12}$ in row $m$ and column $n$, multiply it by 7.43 and then subtract 3.28. It is important to note that this normalization does not flip the polarity of the output of the 12th unit. It is just amplifying the output and then shifting the result in the negative direction. (Big thanks to Carl Thomé for helping me get at the parameters in the tensorflow v1 model.) Second, $[\mathbf{U}]_{12,*}$ comes from marginalizing $\hat{\mathbf{Y}}_{12}$ over its rows/bands by selecting the maximum at each time step $n$:

$[\mathbf{U}]_{12,n} = \max_m [\hat{\mathbf{Y}}_{12}]_{m,n}$

Let’s now look at what is happening inside unit 12. It is convolving the normalized dB Mel spectrogram $\hat{\mathbf{M}}_{\textrm{dB}}$ with a kernel of size 38×7. Equivalently, it is computing the correlation between the below 38×7 sequence and each possible patch of size 38×7 extracted from $\hat{\mathbf{M}}_{\textrm{dB}}$:

We saw this in part 5. It is the (flipped and reversed) impulse response of the filter with the 7th largest Frobenius norm of the 22 non-trivial filters having kernels of size 38×7. Dark is high (0.38) and light is low (-0.2). Unit 12 is thus moving this around $\hat{\mathbf{M}}_{\textrm{dB}}$ with strides of 1 in both the time and band direction, multiplying the coincident values, and adding them together to create the new output, which has the dimension 59×187. The result of this is shown below overlayed on the dB Mel spectrogram. Finally, from this $\mathbf{Y}_{12}$ is created by adding a bias to each element (-0.415), and then setting all negative values to zero. The resulting values, i.e., $[\mathbf{U}]_{12,*}$ are shown overlayed.

So that leads us to unlock the meaning of our sequence above (and the rest of the non-trivial rows of $\mathbf{U}$) for this 3-s segment of GTZAN Disco Duck. The value of $[\mathbf{U}]_{12,*}$ at each time indicates the maximum correlation of the kernel above with all contiguous 38 bands of the dB Mel spectrogram of the input at that time. If that value is large, then there is a large correlation between them. The dB Mel spectrogram in that region looks a lot like the kernel. If it is small then the correlation is small, or possibly negative. This ambiguity arises from the bias applied by the unit, followed by the rectified linear activation of the unit. If the amplitude coefficient in the normalization step had been negative, then this would be reversed: the largest values would show how anti-correlated the kernel is with patches of the dB Mel spectrogram.

So 62 rows of the 561 in the matrix $\mathbf{U}$ show the maximum (anti-)correlation of the dB Mel spectrogram $\mathbf{M}_{\textrm{dB}}$ to a specific pattern in time (1×128, 1×64, 1×32) or time-frequency (38×7, 67×7). MusiCNN trained in MSD has learned to detect these specific 62 patterns by minimizing the difference between its output computed from an input 3-s audio signal in MSD and the “ground truth” attached to that audio. Every possible output of MusiCNN trained in MSD comes from these 62 sequences in $\mathbf{U}$ in some way.

Let’s dig in to CNN_2 to start to see how.