Digging into MusiCNN, pt. 12

In part 1, I review the MusiCNN system, and in part 2 I review the architecture of the “spectrogram”-based model trained on 3-second audio segments. In part 3, I look at how the input to MusiCNN is computed from an audio signal. In part 4 I begin to look at the first layer (CNN_1) of MusiCNN – in particular the 153 one-dimensional convolutions and their impact on an input dB Mel spectrogram. In part 5 I inspect the 408 two-dimensional convolutions in CNN_1. Part 6 looks at what the bias and activations of CNN_1 do to these 561 filtered versions of the input normalized dB Mel spectrogram. Part 7 looks at the normalization layer just after CNN_1. Part 8 looks at the final portion of the first stage, which reduces all the time-series created so far into one 561-dimensional time series – which is the input into the second stage. Part 9 then puts this all together: for MusiCNN trained in MSD, the maximum (anti-)correlations of the dB Mel spectrogram with only 62 specific time and time-frequency structures is what leads to the weights the system assigns to 50 specific tags. Part 10 looks at the processing of the output of CNN_1 by a layer of convolutions, CNN_2, followed by a normalization procedure. And part 11 looks at the processing of the output of CNN_2 by a layer of convolutions, CNN_3, followed by a normalization procedure and a skip connection. Now we move on to the next part of the system. We are here:

CNN_4 is the last convolutional layer of MusiCNN. As for the previous two convolutional layers, it consists of 64 units each having a kernel of size 64×7, a bias parameter, and the rectified linear activation function. Here’s the Frobenius norms of all CNN_4 kernels for MusiCNN trained in MSD and MTT:

It appears all 64 kernels are non-trivial for MTT CNN_4, and all but one of 64 for MSD CNN_4. Here’s what the 63 MSD CNN_4 kernels look like, ordered left to right by decreasing Frobenius norm:

We see that the output of all MSD CNN_2 and MSD CNN_3 units are weighted by these kernels. None are being ignored at this stage. Let’s have a look at specific outputs of MSD CNN_4 units compared with the dB Mel spectrogram of the input, our excerpt of GTZAN Disco Duck. Units 44 and 48 look quite active.

Units 1 and 60 appear sparse:

Concatenating these 64 time-series creates the output matrix \mathbf{V}_3, which for our four 3-s segments of GTZAN Disco Duck in MSD MusiCNN appears:

Units 5 and 33 appear to be take on a consistently high value. Here’s what the time-series look like, showing high means:

After CNN_4, each of the output time series are normalized, as done for the output of CNN_3. Here’s the distribution of parameters for this normalization layer of MSD MusiCNN:

Just as a reminder, these parameters appear as so:

[\hat{\mathbf{V}_3}]_{m,n} = \beta_m + \gamma_m \frac{[\mathbf{V}_3]_{m,n} - \mu_m}{\sqrt{\sigma_m^2+\epsilon}}

All \beta_m are pretty much zero.

Here’s what the normalized matrix \hat{\mathbf{V}}_3 looks like:

Curiously, the smallest value we see in \hat{\mathbf{V}}_3 for the segments of GTZAN Disco Duck is -17 and the largest value is about 111, which is curiously large compared with the maximum value of \mathbf{V}_3 (10). I would think the normalization would shrink the values.

Overlaying \hat{\mathbf{V}}_3 on the dB Mel spectrogram shows:

What does it mean? I don’t know. But at this point, MusiCNN will add \hat{\mathbf{V}}_1, \hat{\mathbf{V}}_2 and \hat{\mathbf{V}}_3. The three matrices side by side are shown below

The color scale on is set to [0,10]. We see \hat{\mathbf{V}}_3 (on the right) is much more active than the other two, and has higher values in general. Some units on both are producing sparse time series, such as units 4, 5, 52 and 53. Let’s look at how the energies in each row of these matrices compare:

The energies of the time series (l2-norm squared of each row) are computed with reference to the highest energy of a time series from CNN_2, i.e., of \hat{\mathbf{V}}_1. Row 34 of \hat{\mathbf{V}}_2 appears to be active while inactive in the other two matrices.

The addition of these three matrices produces $\hat{\mathbf{V}}_1+\hat{\mathbf{V}}_2+\hat{\mathbf{V}}_3$, which is shown below:

where the color map is -10 to 10. The maximum value we see is almost 125 and the minimum is -17. So this is not a sparse matrix. If we sort all the values in each row, and then subtract the minimum value, and then divide by the maximum value of the result, we get the following image:

We see of all rows in this sum, only 5, 33 and 59 show a lot of activity; and 3, 29 and 53 show very little activity. I don’t know what this means yet.

We now reach the output of the second stage of MusiCNN: the concatenation of \hat{\mathbf{V}}_1 and \hat{\mathbf{V}}_1+\hat{\mathbf{V}}_2 and \hat{\mathbf{V}}_1+\hat{\mathbf{V}}_2+hat{\mathbf{V}}_3 and finally \mathbf{U}. This produces the matrix \mathbf{W}, which is 753×187. That is, \mathbf{W} consists of 753 time series computed by three CNN layers from the dB Mel Spectrogram of a 3-s segment of audio.

It is time to analyze the final stage of MusiCNN and hopefully unlock the significance of each of these time series!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s