Digging into MusiCNN, pt. 12

Posted on December 17, 2022 by Bob L. T. Sturm

In part 1, I review the MusiCNN system, and in part 2 I review the architecture of the “spectrogram”-based model trained on 3-second audio segments. In part 3, I look at how the input to MusiCNN is computed from an audio signal. In part 4 I begin to look at the first layer (CNN_1) of MusiCNN – in particular the 153 one-dimensional convolutions and their impact on an input dB Mel spectrogram. In part 5 I inspect the 408 two-dimensional convolutions in CNN_1. Part 6 looks at what the bias and activations of CNN_1 do to these 561 filtered versions of the input normalized dB Mel spectrogram. Part 7 looks at the normalization layer just after CNN_1. Part 8 looks at the final portion of the first stage, which reduces all the time-series created so far into one 561-dimensional time series – which is the input into the second stage. Part 9 then puts this all together: for MusiCNN trained in MSD, the maximum (anti-)correlations of the dB Mel spectrogram with only 62 specific time and time-frequency structures is what leads to the weights the system assigns to 50 specific tags. Part 10 looks at the processing of the output of CNN_1 by a layer of convolutions, CNN_2, followed by a normalization procedure. And part 11 looks at the processing of the output of CNN_2 by a layer of convolutions, CNN_3, followed by a normalization procedure and a skip connection. Now we move on to the next part of the system. We are here:

CNN_4 is the last convolutional layer of MusiCNN. As for the previous two convolutional layers, it consists of 64 units each having a kernel of size 64×7, a bias parameter, and the rectified linear activation function. Here’s the Frobenius norms of all CNN_4 kernels for MusiCNN trained in MSD and MTT:

It appears all 64 kernels are non-trivial for MTT CNN_4, and all but one of 64 for MSD CNN_4. Here’s what the 63 MSD CNN_4 kernels look like, ordered left to right by decreasing Frobenius norm:

We see that the output of all MSD CNN_2 and MSD CNN_3 units are weighted by these kernels. None are being ignored at this stage. Let’s have a look at specific outputs of MSD CNN_4 units compared with the dB Mel spectrogram of the input, our excerpt of GTZAN Disco Duck. Units 44 and 48 look quite active.

Units 1 and 60 appear sparse:

Concatenating these 64 time-series creates the output matrix $\mathbf{V}_3$ , which for our four 3-s segments of GTZAN Disco Duck in MSD MusiCNN appears:

Units 5 and 33 appear to be take on a consistently high value. Here’s what the time-series look like, showing high means:

After CNN_4, each of the output time series are normalized, as done for the output of CNN_3. Here’s the distribution of parameters for this normalization layer of MSD MusiCNN:

Just as a reminder, these parameters appear as so:

$[\hat{\mathbf{V}_3}]_{m,n} = \beta_m + \gamma_m \frac{[\mathbf{V}_3]_{m,n} - \mu_m}{\sqrt{\sigma_m^2+\epsilon}}$

All $\beta_m$ are pretty much zero.

Here’s what the normalized matrix $\hat{\mathbf{V}}_3$ looks like:

Curiously, the smallest value we see in $\hat{\mathbf{V}}_3$ for the segments of GTZAN Disco Duck is -17 and the largest value is about 111, which is curiously large compared with the maximum value of $\mathbf{V}_3$ (10). I would think the normalization would shrink the values.

Overlaying $\hat{\mathbf{V}}_3$ on the dB Mel spectrogram shows:

What does it mean? I don’t know. But at this point, MusiCNN will add $\hat{\mathbf{V}}_1$ , $\hat{\mathbf{V}}_2$ and $\hat{\mathbf{V}}_3$ . The three matrices side by side are shown below

The color scale on is set to [0,10]. We see $\hat{\mathbf{V}}_3$ (on the right) is much more active than the other two, and has higher values in general. Some units on both are producing sparse time series, such as units 4, 5, 52 and 53. Let’s look at how the energies in each row of these matrices compare:

The energies of the time series (l2-norm squared of each row) are computed with reference to the highest energy of a time series from CNN_2, i.e., of $\hat{\mathbf{V}}_1$ . Row 34 of $\hat{\mathbf{V}}_2$ appears to be active while inactive in the other two matrices.

The addition of these three matrices produces $\hat{\mathbf{V}}_1+\hat{\mathbf{V}}_2+\hat{\mathbf{V}}_3$, which is shown below:

where the color map is -10 to 10. The maximum value we see is almost 125 and the minimum is -17. So this is not a sparse matrix. If we sort all the values in each row, and then subtract the minimum value, and then divide by the maximum value of the result, we get the following image:

We see of all rows in this sum, only 5, 33 and 59 show a lot of activity; and 3, 29 and 53 show very little activity. I don’t know what this means yet.

We now reach the output of the second stage of MusiCNN: the concatenation of $\hat{\mathbf{V}}_1$ and $\hat{\mathbf{V}}_1+\hat{\mathbf{V}}_2$ and $\hat{\mathbf{V}}_1+\hat{\mathbf{V}}_2+hat{\mathbf{V}}_3$ and finally $\mathbf{U}$ . This produces the matrix $\mathbf{W}$ , which is 753×187. That is, $\mathbf{W}$ consists of 753 time series computed by three CNN layers from the dB Mel Spectrogram of a 3-s segment of audio.

It is time to analyze the final stage of MusiCNN and hopefully unlock the significance of each of these time series!

Digging into MusiCNN, pt. 11

Posted on December 12, 2022 by Bob L. T. Sturm

In part 1, I review the MusiCNN system, and in part 2 I review the architecture of the “spectrogram”-based model trained on 3-second audio segments. In part 3, I look at how the input to MusiCNN is computed from an audio signal. In part 4 I begin to look at the first layer (CNN_1) of MusiCNN – in particular the 153 one-dimensional convolutions and their impact on an input dB Mel spectrogram. In part 5 I inspect the 408 two-dimensional convolutions in CNN_1. Part 6 looks at what the bias and activations of CNN_1 do to these 561 filtered versions of the input normalized dB Mel spectrogram. Part 7 looks at the normalization layer just after CNN_1. Part 8 looks at the final portion of the first stage, which reduces all the time-series created so far into one 561-dimensional time series – which is the input into the second stage. Part 9 then puts this all together: for MusiCNN trained in MSD, the maximum (anti-)correlations of the dB Mel spectrogram with only 62 specific time and time-frequency structures is what leads to the weights the system assigns to 50 specific tags. Part 10 looks at the processing of the output of CNN_1 by a layer of convolutions, CNN_2, followed by a normalization procedure. We see for MusiCNN trained in MSD only 20 of the 64 units in CNN_2 have anything to do with the system input. Now we move on to the next part of the system. We are here:

CNN_3 is the third layer of convolutions of MusiCNN, which processes the normalized output of CNN_2, i.e., $\hat{\mathbf{V}}_1$ using 64 units. Each using performs a convolution with a kernel of size 64×7. Let’s have a look at their Frobenius norms:

We see that for CNN_3 trained in either dataset, nearly all of the kernels are non-trivial. Only three units in MSD CNN_3 are trivial, and only one unit in MTT CNN_3 is trivial. All others are doing things with the 20 non-trivial rows of $\hat{\mathbf{V}}_1$ . Here are what the 61 non-trivial kernels look like for MSD CNN_3 (left most one has highest Frobenius norm):

It appears these kernels are also sparse, like those in MSD CNN_2. However, there are many little values in these rows. Here’s what the distributions of dB energies in each CNN_2 band look like across these 61 kernels (reference energy is 0.01):

The x-axis is dB energy. Each row is related to an output of MSD CNN_2, or equivalently a row of $\hat{\mathbf{V}}_1$ . So row “0” is the distribution of energies of row 1 of the 61 non-trivial kernels of MSD CNN_3. We see a mode around -25 dB. For comparison, the trivial kernels of MSD CNN_2 had energies less than -200 dB. Now, for the 20 rows with energies around 0 dB, are they the same MSD CNN_2 units that have non-trivial kernels? These are MSD CNN_2 units 51, 22, 33, 53, 10, 12, 40, 42, 25, 50, 6, 47, 63, 13, 20, 18, 60, 17, 4, 31: Yes, they match exactly. The other 44 units of MSD CNN_2 are outputting values that are unrelated to the input of MusiCNN.

But we see here that there are some trivial outputs of MSD CNN_2 that are multiplied by values in MSD CNN_3 kernels that are not trivial, such as 15, 19, 21 and 57. What role do those play? If we force them to zero, will MusiCNN trained in MSD behave the same? After I have finished this autopsy I might try such interventions.

Let’s reminds ourselves what has happened up to the point of CNN_3 for MusiCNN trained in MSD processing our excerpt of GTZAN Disco Duck. Here’s the four contiguous 3-s input dB Mel spectrograms, each notated $0.1\mathbf{M}_\textrm{dB}$ :

After processing by the first stage of MusiCNN, we arrive to the input of CNN_2, which is a matrix denoted $\mathbf{U}$ . Below we see the concatenation of the four matrices resulting from the four dB Mel spectrograms seen above:

Of the 561 rows, 20 have non-trivial elements. Now we arrive to the input of CNN_3, which is a matrix denoted $\hat{\mathbf{V}}_1$ . Below we see the concatenation of the four matrices resulting from the four dB Mel spectrograms:

Of the 64 rows, 61 have non-trivial elements. Each of these matrices is an input to MSD CNN_3, which will output a new matrix $\mathbf{V}_2$ of size 64×187. Let’s look at some of the outputs of MSD CNN_3 units for our signal overlayed on the dB Mel Spectrogram. This is after the convolution, addition of bias, and the rectified linear activation. Here’s the time series output by unit 51:

This looks like a pretty good beat tracker!

Some units output very sparse time series, such as unit 7, 28 and 39:

What do these units specialize in?

Unit 58 seems to be sensitive to the inflection point of dynamic harmonic partials. Could this be tracking speech-like characteristics?

Stacking these 64 time-series produces the matrix output of MSD CNN_3, denoted $\mathbf{V}_2$ :

This appears to be a significantly different representation than what is produced by MSD CNN_2:

I might even believe that the representation output by MSD CNN_3 is related to beats and rhythm. However, there seems to be a lot of redundancy. A look at the distribution of angles between pairs of kernels shows many are orthogonal, but there is a bit of a tail showing kernels having smaller angles between them.

Why would MSD CNN_3 learn to produce such redundancy?

The normalization of each row of $\mathbf{V}_2$ produces the matrix $\hat{\mathbf{V}}_2$ , which pretty much appears as above, except values are magnified. After normalization, the maximum value is about 9, and the minimum is -1.3.

Here is it overlaid on the dB Mel spectrogram:

Now, we reach the next important step: $\hat{\mathbf{V}}_1$ and $\hat{\mathbf{V}}_2$ are added to form the input to final convolutional layer of MusiCNN, CNN_4. Potentially we have a collision of information here. The 20 rows of $\hat{\mathbf{V}}_1$ that are not trivial will be merged with the 61 non-trivial rows of $\hat{\mathbf{V}}_2$ . Here are the two matrices side by side with the same color scale [0,2]:

We can see several rows in $\hat{\mathbf{V}}_2$ will occupy the trivial rows of $\hat{\mathbf{V}}_1$ , such as 56+1 and 61+1. But there are many collisions, such as rows 33+1 and 51+1 (indexing above starts at 0). To get a better idea of these collisions, here’s a graph comparing the energy in each row of each matrix:

The reference energy is the maximum energy seen in a row of $\hat{\mathbf{V}}_1$ , occuring in row 3. The non-trivial rows of $\hat{\mathbf{V}}_1$ are those having energy above -12 dB. We can see non-trivial rows of $\hat{\mathbf{V}}_2$ “taking over” trivial rows of $\hat{\mathbf{V}}_1$ , e.g., 10, 28–31 and 36–40; and non-trivial rows of $\hat{\mathbf{V}}_1$ taking over trivial rows of $\hat{\mathbf{V}}_1$ , e.g., 19, 21, 41, 43 and 61.

The final matrices input to MSD CNN_4 for our excerpt of GTZAN Disco Duck appears as so:

Now it is time to analyze this final convolutional layer.

Digging into MusiCNN, pt. 10

Posted on December 7, 2022 by Bob L. T. Sturm

CNN_2 consists of 64 neurons, each performing a convolution of the 561×187 input $\mathbf{U}$ with a kernel of size 561×7. That is, each kernel has a support over all units of CNN_1, but only 7 time steps. Each convolution just results in a one-dimensional time-series of length 187 (zero padding ensures this length). As in CNN_1, each unit of CNN_2 adds a specific bias, and then replaces all negative values with zeros. The concatenation of all the time series produced by the units of CNN_2 produces a matrix of size 64×187, denoted $\mathbf{V}_1$ . In order to uncover the significance of each row of $\mathbf{V}_1$ , we must inspect the kernels in the CNN_2 units to see how exactly they are combining the rows of $\mathbf{U}$ .

Let’s first look at the Frobenius norms of the 64 kernels of CNN_2 trained on either MSD or MTT:

Like we saw in CNN_1, most of the units in CNN_2 for both MSD and MTT are “dead”. Only 20 (MSD) or 21 (MTT) of the 64 contribute anything downstream. Below we see what the kernels in those 20 units of CNN_2 for MSD look like, with color bands relating the values to the different sized kernels in CNN_1 (from the bottom to the top: 38×7, 67×7, 1×128, 1×64, 1×32). The units in CNN_2 are labeled on the x-axis. For comparison, I put below this the matrix $\mathbf{U}$ from one 3-s segment of GTZAN Disco Duck output by the first stage of MusiCNN trained in MSD. (The coloring between the two do not correspond.)

We see that each of the 20 non-trivial kernels in CNN_2 of MusiCNN trained on MSD is very sparse (unsurprising given the sparseness of $\mathbf{U}$ ), and appear to combine information from most of the 62 non-trivial rows of $\mathbf{U}$ . There appear to be some omissions, however. Of the 20 kernels, seven have weight in all 62 non-trivial rows of $\mathbf{U}$ , and 13 exclude at least one row. Removing all rows in the MSD CNN_2 kernels multiplying the output of trivial units of CNN_1, produces the figure below (the left-most kernel has the highest Frobenius norm, and the right-most the least; unit numbers are labeled):

Color coding shows the MSD CNN_1 units using kernels of specific sizes, from the bottom to top: 38×7, 67×7, 1×128, 1×64, 1×32. Each kernel of MSD CNN_2 involves a complex combination of information from CNN_1, which makes its interpretability difficult. At least in CNN_1 the y-axis of the kernels is related to Mel bands. Here we only know the x-axis is time. The maximum magnitude Pearson correlation between these 20 kernels is 0.19, which means they are somewhat orthogonal (each separated by at least 79 degrees).

Let’s look more closely at the range of weights in these kernels. Unit 63 in MSD CNN_2 has a kernel that weights all 62 non-trivial rows of $\mathbf{U}$ :

The 62 blue dots signify the non-trivial rows of $\mathbf{U}$ , with color bands relating them to the kernel sizes of MSD CNN_1 units. Each vertical line segment represents the amplitude interval of the corresponding row of this CNN_2 kernel over its 7 columns.

The kernel in CNN_2 unit 53 has very small weights on four rows of $\mathbf{U}$ :

Information in these four rows of $\mathbf{U}$ has less impact on the output of CNN_2 unit 53 than all the other rows.

Let us now look at some of the output sequences from the convolutions in CNN_2 trained in MSD. Here’s what the convolution in CNN_2 unit 63 looks like for our four 3-s segments GTZAN Disco Duck:

This is before bias is added and the rectified linear function is applied. It’s hard to tell what this is all about, but possibly we see dips when the underlying signal has upward sweeping frequency components?

Here’s the result of the convolution in unit 13:

This one seems to have a much more regular pulse, coinciding with onsets in the signal. Perhaps these will become more interpretable at the output of the CNN_2 units, i.e., applying the bias and the rectified linear nonlinearity.

Here are the outputs of CNN_2 unit 63 and unit 13, sonified as an amplitude modulation on the underlying 12-second excerpt:

The output of CNN_2 unit 4 is very sparse for some reason:

The output of CNN_2 is a matrix created by concatenating the 64 length-187 time-series, which we denote by $\mathbf{V}_1$ . Here’s what MSD CNN_2 output $\mathbf{V}_1$ looks like for the 4 3-second segments of our excerpt of GTZAN Disco Duck:

Forty-four of the rows of $\mathbf{V}_1$ are zeros. And when I say zero, I mean they are zero. The biases in these units are such that they pull down the output of their convolutions to below zero, which are then demolished by the rectified linear function. Here’s the biases for the 64 units in MSD CNN_2. The orange dots are the 20 units that produce non-zero output, and the black dots are the 44 others.

When we concatenate just the 20 rows of $\mathbf{V}_1$ that are non-zero, this is the result:

A few of the outputs are quite sparse, and a few are very dense.

The next step after CNN_2 is a normalization step, where each row is modified like described in Part 7. Specifically, the scaling of element $m,n$ of matrix $\mathbf{V}_1$ looks like the following:

$[\hat{\mathbf{V}_1}]_{m,n} = \beta_m + \gamma_m \frac{[\mathbf{V}_1]_{m,n} - \mu_m}{\sqrt{\sigma_m^2+\epsilon}}$

where $\epsilon = 0.001$ (which is the case for all normalization layers in MusiCNN) and $\gamma_m, \beta_m, \mu_m$ and $\sigma_m^2$ are learned. Let’s have a look at their distributions for the normalization unit following MSD CNN_2:

The blue lines show the normalization parameter densities of the 20 non-trivial rows of $\mathbf{V}_1$ . The orange lines are the normalization parameter densities of the 44 other rows. In the case of $\mu_m$ , these parameters are effectively 0 (1e-36). And the variance of these is pretty much $-\infty$ (-240 dB below $\epsilon$ ). So that’s why there are no orange lines on those two plots. The trivial rows of $\mathbf{V}_1$ are all zeros, and so after normalization row $m$ will take on the value of $\beta_m$ . We see these values center on 0, but curiously are not all exactly zero. Nonetheless, the values in these rows are independent of whatever is input to MusiCNN trained in MSD.

At the output of the normalization layer then sits $\hat{\mathbf{V}_1}$ . Let’s have a look:

The minimum value we see is -1.3 and the maximum is 3.9. The white color is zeros. Let’s plot the 20 active rows of $\hat{\mathbf{V}_1}$ against the dB Mel spectrogram:

I have stretched the 20 rows to fit into the 96 rows of the dB Mel spectrogram. Keep in mind that the axis for $\hat{\mathbf{V}_1}$ is not frequency, as it is for the dB Mel spectrogram, so this overlay can be misleading. But anyhow, I can’t tell what is what here. Disentangling the information after CNN_2 is going to take more thought.

In the next part, we will look at CNN_3 and the normalization that follows.

Digging into MusiCNN, pt. 9

Posted on November 18, 2022 by Bob L. T. Sturm

Before we proceed deeper into the network, let’s make absolutely clear how to interpret the output of the first stage, i.e., the 561 time series each of length 187 stacked into matrix of size 561×187 $\mathbf{U}$ . Here is what $\mathbf{U}$ looks like for one 3-s segment of GTZAN Disco Duck and MusiCNN trained on MSD:

Each colored band represents the set of units in CNN_1 using a kernel of a particular size, from the bottom to the top: 38×7, 67×7, 1×128, 1×64, 1×32. Each row (or time series) comes from a unit/neuron in CNN_1. Of these rows, 499 are unrelated to the input to MusiCNN, and effectively zero. The other 62 rows are related to the input, i.e., the normalized dB Mel spectrogram $\hat{\mathbf{M}}_{\textrm{dB}}$ . Starting from the bottom, the first non-trivial row we encounter is produced from the output of unit 12. Let’s notate that row $[\mathbf{U}]_{12,*}$ . Here’s what its values look like:

I make this a stem plot to emphasize that this is a discrete sequence with a time resolution of 16 ms. This length-187 sequence is a summary of the output of unit 12, notated $\mathbf{Y}_{12}$ . This summary is created in the following way. First $\mathbf{Y}_{12}$ is normalized:

$[\hat{\mathbf{Y}}_{12}]_{m,n} = 7.43 [\mathbf{Y}_{12}]_{m,n} - 3.28$

In other words, we take the element of $\mathbf{Y}_{12}$ in row $m$ and column $n$ , multiply it by 7.43 and then subtract 3.28. It is important to note that this normalization does not flip the polarity of the output of the 12th unit. It is just amplifying the output and then shifting the result in the negative direction. (Big thanks to Carl Thomé for helping me get at the parameters in the tensorflow v1 model.) Second, $[\mathbf{U}]_{12,*}$ comes from marginalizing $\hat{\mathbf{Y}}_{12}$ over its rows/bands by selecting the maximum at each time step $n$ :

$[\mathbf{U}]_{12,n} = \max_m [\hat{\mathbf{Y}}_{12}]_{m,n}$

Let’s now look at what is happening inside unit 12. It is convolving the normalized dB Mel spectrogram $\hat{\mathbf{M}}_{\textrm{dB}}$ with a kernel of size 38×7. Equivalently, it is computing the correlation between the below 38×7 sequence and each possible patch of size 38×7 extracted from $\hat{\mathbf{M}}_{\textrm{dB}}$ :

We saw this in part 5. It is the (flipped and reversed) impulse response of the filter with the 7th largest Frobenius norm of the 22 non-trivial filters having kernels of size 38×7. Dark is high (0.38) and light is low (-0.2). Unit 12 is thus moving this around $\hat{\mathbf{M}}_{\textrm{dB}}$ with strides of 1 in both the time and band direction, multiplying the coincident values, and adding them together to create the new output, which has the dimension 59×187. The result of this is shown below overlayed on the dB Mel spectrogram. Finally, from this $\mathbf{Y}_{12}$ is created by adding a bias to each element (-0.415), and then setting all negative values to zero. The resulting values, i.e., $[\mathbf{U}]_{12,*}$ are shown overlayed.

So that leads us to unlock the meaning of our sequence above (and the rest of the non-trivial rows of $\mathbf{U}$ ) for this 3-s segment of GTZAN Disco Duck. The value of $[\mathbf{U}]_{12,*}$ at each time indicates the maximum correlation of the kernel above with all contiguous 38 bands of the dB Mel spectrogram of the input at that time. If that value is large, then there is a large correlation between them. The dB Mel spectrogram in that region looks a lot like the kernel. If it is small then the correlation is small, or possibly negative. This ambiguity arises from the bias applied by the unit, followed by the rectified linear activation of the unit. If the amplitude coefficient in the normalization step had been negative, then this would be reversed: the largest values would show how anti-correlated the kernel is with patches of the dB Mel spectrogram.

So 62 rows of the 561 in the matrix $\mathbf{U}$ show the maximum (anti-)correlation of the dB Mel spectrogram $\mathbf{M}_{\textrm{dB}}$ to a specific pattern in time (1×128, 1×64, 1×32) or time-frequency (38×7, 67×7). MusiCNN trained in MSD has learned to detect these specific 62 patterns by minimizing the difference between its output computed from an input 3-s audio signal in MSD and the “ground truth” attached to that audio. Every possible output of MusiCNN trained in MSD comes from these 62 sequences in $\mathbf{U}$ in some way.

Let’s dig in to CNN_2 to start to see how.

Digging into MusiCNN, pt. 8

Posted on November 16, 2022 by Bob L. T. Sturm

The normalized matrix from unit $j$ in CNN_1, denoted $\hat{\mathbf{E}}_j$ , is now brutally reduced to a single row of length 187. Each column is summarized by the scalar having the maximum value. These are then concatenated to form the new matrix $\mathbf{U}$ , which has a size of 561×187. Let’s see how these rows look.

For the CNN_1 units with length-128 impulse responses (IR) trained on MSD, below are juxtapositions of the input dB Mel spectrogram for the GTZAN Disco Duck (four segments of 3 seconds each), the normalized output matrix, and the resulting values of this max pooling to create a single row. Here’s unit 12 (the one with the largest energy IR) :

All of these values appear to be positive. The resulting row from unit 40 has a smaller range, but also includes negative values:

Here’s the output contributed from the remaining active CNN_1 unit having length-128 IR, unit 38:

The range of this row is smaller still. Guess what the max pooled results of the other 48 units? In the range of [-3e-6,3e-6] with very small variances (1e-25). Here’s what the final matrix $\mathbf{U}$ looks like for this portion of CNN_1 trained on MSD:

It is going to be interesting to see what CNN_2 does with these very sparse matrices.

Let’s have a look at one of the CNN_1 units with length-64 IRs trained on MSD: unit 50, the least most energetic IR:

That seems to just be tracking the beginning and ending of the 3-s segments. It is surprisingly regular!

Here’s the resulting matrix from the length-64 units:

As before, it is very sparse: six of the 51 units contribute anything; 45 are just a bunch of values in [-3e-6,3e-6]. The matrix from the length-32 IR units is less sparse, but 24 units are contributing nothing.

For the CNN_1 units having a two-dimensional impulse response trained on MSD, here’s the result for 67×7 unit 194:

This unit seems like it is acting as a beat detector. Here are all the results from max pooling the outputs of all these units, showing only 14 of 204 are contributing anything:

And here is the resulting matrix for the CNN_1 units with size 38×7 IRs, showing 22 units of 204 contributing anything:

The concatenation of these five matrices takes the order of 38×7, 67×7, 1×128, 1×64, 1×32. Here is the densified representation of GTZAN Disco Duck as viewed by the first stage of MusiCNN trained on MSD:

The outputs of this stage of MusiCNN trained on MTT for the same input is much more active, as expected.

One thing I wonder about is how highly correlated these 62 rows/sequences are going into the next stage. If many are then that would represent redundancy… which could come about from dropout. But these MusiCNN models only used dropout in the third stage.

For the four 3-s segments of the GTZAN Disco Duck input into MusiCNN trained on MSD, and the 62 non-zero sequences of the first stage, below is the distribution of Pearson correlations of each first-order differenced sequence observed between pairs of different units for the four segments. I look at the first-order differenced sequences in order to avoid spurious correlations arising from two series that may not be correlated but follow a long-term trend.

This is the cosine of the angle between the detrended sequences. We see that more than 90% of the first-order differenced sequences from these 62 units have magnitude correlations less than 0.3. The minimum and maximum correlations we see are -0.643 and 0.797. Looking at the correlations having magnitudes greater than 0.6 over these four segments shows the following pairs of units:

Red is positive correlation, blue is negative correlation. The sizes of the filters are demarcated, and the unit IDs in these blocks are printed. The sparsity of this is quite nice to see, actually; it shows that among the 62 units of the 561 contributing something to the next stage, only the outputs of a few units are somewhat correlated, but not exactly so.

Here are some the first-order differenced outputs of units 23 and 28, both employing 1×32 kernels:

The first-order differenced outputs of units 84 and 163, both employing 67×7 kernels, are highly negatively correlated:

The first-order differenced outputs of units 121 (1×32) and 45 (38×7) are positively correlated:

We have now finished analyzing the first stage of MusiCNN! In this stage,

MusiCNN transforms a 3-s audio segment into a Mel dB spectrogram, which is a multidimensional time series (96×187)
MusiCNN normalizes the Mel dB spectrogram with parameters learned from training
MusiCNN filters this normalized Mel dB spectrogram with 62 different filters (each tuned to some particular time and time-frequency characteristic learned from training), adds a bias to each output (learned from training), and then zeros out all negative elements. This creates 62 new multidimensional time series, each having length 187.
MusiCNN normalizes these 62 multi-dimensional time series with parameters learned from training
MusiCNN reduces each multi-dimensional time series to a one-dimensional time series by selecting the maximum value across dimensions at each time step
MusiCNN intersperses these one-dimensional time series with 499 one-dimensional time series of effectively zeros, to create a new multidimensional time series (561×187).

It is time to start investigating the second stage of MusiCNN, which processes the output of the first stage.

Digging into MusiCNN, pt. 7

Posted on November 14, 2022 by Bob L. T. Sturm

It is time to reduce 561 matrices to one that has dimension 561×187. Or more practically, to reduce $(96 \times 187) \times 51 \times 3 + (59\times187 + 30\times187)\times 204 = 6,141,828$ numbers to $561 \times 187 = 104,907$ numbers.

MusiCNN takes each of the 561 matrices and normalizes them (using parameters that are learned). Then it collapses each matrix to a single row of length 187 by taking the maximum value of each column. Finally, MusiCNN concatenates these 561 rows into a matrix, thus creating the 561×187 matrix, let’s denote it by $\mathbf{U}$ , that is input to the second stage. This will be the subject of the next part in this series, but first we need to understand the normalization layer.

In the normalization step, the elements of a matrix output by a CNN_1 unit are scaled using parameters that are learned. The scaling of element $m,n$ of matrix $\mathbf{Y}$ looks like the following:

$[\hat{\mathbf{Y}}]_{m,n} = \beta + \gamma \frac{[\mathbf{Y}]_{m,n} - \mu}{\sqrt{\sigma^2+\epsilon}}$

where $\epsilon = 0.001$ and $\gamma, \beta, \mu$ and $\sigma^2$ are learned. In this layer, $\mu$ and $\sigma^2$ are empirical estimates of the mean and variance of previously computed matrices from the training data by this unit. The update of these two parameters for a batch (the size of which for MusiCNN is 1) looks like the following

$\mu \leftarrow \rho \mu + (1-\rho) \bar{\mu}$

$\sigma^2 \leftarrow \rho \sigma^2 + (1-\rho) \bar{\sigma^2}$

where $\rho = 0.99$ is called the momentum, $\bar{\mu}$ is the empirical mean of the batch and $\bar{\sigma^2}$ is its empirical variance.

For the CNN_1 units trained in MSD having IRs with energy no larger than -36 dB, what are the distributions of the parameters of the normalization after CNN_1? I find $\beta \sim \mathcal{N}(-1.53e-06,8.12e-10)$ … so very small, and pretty much zero. The other three parameters are shown below for the units with different sizes. $\gamma$ is pretty much centered on zero. The variances are very, very small, being more than 100 dB below the factor $\epsilon = 0.001$ … effectively zero.

All of this means that the normalization in this stage for the 499 units of CNN_1 trained in MSD operating independently of the input is something like

$[\hat{\mathbf{Y}}]_{mn} \approx 31.6 \gamma ([\mathbf{Y}]_{mn} - \mu)$

How does the parameter $\mu$ relate to the bias of the CNN_1 unit?

Ha ha ha ha ha! Effectively, it seems for the 499 units that operate independently of the input, this normalization layer removes the positive biases added by the CNN_1 unit, or just keeps the zero output as zero. I don’t know why, but I find this funny.

The $\gamma$ and $\sigma^2$ parameters are quite different for CNN_1 trained in MTT:

I find $\beta \sim \mathcal{N}(-0.17,0.74)$ . The scatter plot of the bias and $\mu$ are also extremely different. In this case the normalization is not working against the bias in CNN_1.

Now we reach the next part: max pooling. In this case, each normalized matrix $\hat{\mathbf{Y}}$ is reduced to a single row by selecting the maximum value in each column.

Digging into MusiCNN, pt. 6

Posted on November 11, 2022 by Bob L. T. Sturm

We are still here:

Unit $j \in [1, \ldots, 561]$ is going to create a new matrix $\mathbf{Y}_j$ from the normalized dB Mel spectrogram $\hat{\mathbf{M}}_\textrm{dB}$ by first convolving it with impulse response $\mathbf{h}_j$ , then adding a bias $b_j$ to each element, and finally setting all negative values to zero. Element $m, n$ of $\mathbf{Y}_j$ is given by

$[\mathbf{Y}_j]_{mn} = \textrm{ReLU} \left ( [\hat{\mathbf{M}}_\textrm{dB} \ast \mathbf{h}_j]_{mn} + b_j \right)$

$\mathbf{Y}_j$ is a matrix having 187 columns (each column corresponds to an analysis window), but different numbers of rows. If $\mathbf{h}_j$ is a vector (as it is for the temporal filters in MusiCNN), then $\mathbf{Y}_j$ has size 96×187. If $\mathbf{h}_j$ is a matrix (as it for the timbral filters in MusiCNN) then $\mathbf{Y}_j$ has a size either 30×187 or 59×187, for either the tall kernel (67×7) or the shorter one (38×7).

Getting the bias from each of these units is a simple matter of sending in a bunch of zeros and making the activiation be linear. Let us look at how $b_j$ and $\|\mathbf{h}_j\|_\textrm{F}$ (Frobenius norm) are related in CNN_1. Here are the results for the temporal filters of CNN_1 trained on MSD:

All of the filters with energies less than -50 dB are pretty much obliterating their input; but we see a range of biases so these particular units will be outputting non-zero but constant matrices of size 96×187. The 26 units with filters having non-negligible energies have biases between -1 to almost 1.5.

Here’s the results for the timbral filters of CNN_1 trained on MSD:

We again see here that for the units with filters obliterating the input, they also have a range of biases: and so these units will be outputting non-zero but constant matrices of size 30×187 or 59×187. Compared to the temporal units, the biases of the 36 units with filters having non-negligible energies have a wider spread from -2 to nearly 6.

The biases for CNN_1 trained on MTT appear quite different:

The evidence is building that MusiCNN trained on MTT or MSD are quite different at the input. The latter involved at least an order of magnitude more data than the former, so why does this lead to less active input units?

Anyhow, we will now move towards looking at and listening to sonifications of some of the outputs of CNN_1 trained on MSD, but there’s an important detail here we must consider: MusiCNN takes the audio segment $\mathbf{x}$ , computes the Mel spectrogram, computes the dB Mel spectrogram $0.1 \mathbf{M}_\textrm{dB}$ from that, then normalizes it to create $\hat{\mathbf{M}}_\textrm{dB}$ , and finally passes that to CNN_1. This means the input to CNN_1 is sensitive to the attenuation of $\mathbf{x}$ . In particular, if we attenuate the audio signal by a factor $\alpha>0$ , then the values of the Mel spectrogram values are attenuated by $\alpha^2$ . This will significantly change the dB Mel spectrogram $0.1 \mathbf{M}_\textrm{dB}$ , and thus the normalized dB Mel spectrogram

To develop an appreciation for this, here’s what the four inputs to CNN_1 look like for Disco Duck (demarcated by vertical red lines):

The range of values we see is [-1.7, 4], and the color scale is [-1, 5]. If we attenuate the audio signal by 0.1, here’s what the input looks like (using the same color scale):

Now the range of values is [-0.76, 6.65]. If we attenuate the audio signal by 0.01, here’s what the input looks like (using the same color scale):

The range of values now is [-0.3, 17.1].

This lack of invariance to the attenuation of the input audio signal is not surprising: each instance of MusiCNN is trained only on audio data taken directly from the mastered commercial music recording. This puts into question the reliability of MusiCNN for audio recording data collected from a microphone in some environment – where the distance to the sound source is attenuating the audio source – or even audio data that is not mastered in the same way as the commercial music it is trained on. This makes MusiCNN quite a strange type of music listener: imagine a human music listener whose description of the music playing on a stereo depends on how loud the stereo is. Let us see how MusiCNN tags Disco Duck at five different attenuations. Here are the top five tags across all four segments:

When $\alpha = 1$ MusiCNN is processing Disco Duck as it appears in the GTZAN dataset. When $\alpha = 1.79$ , I am normalizing the audio data to have a maximum magnitude of 1. Apparently MusiCNN is more confident in “funk” for GTZAN Disco Duck, and “rock” for the normalized audio. As we attenuate the input audio data more, the variability in the top labels decreases. For $\alpha = 0.01$ , MusiCNN applies “quiet” and “slow”. The first one may be appropriate, but not the second! Does quiet music tend to be slow? Also, “quiet” as in low volume or Morton Feldman? Finally, when the input audio is perfect silence, we see the top tags include “classical” and “piano”… which is somewhat appropriate for Cage’s 4’33”.

I’m not sure what should be done here, or what tags should be invariant to such attenuation. We might not want to normalize a 3-s segment of audio if in relation to the recording from which it is excerpted it should be quiet. Perhaps it is a matter of data augmentation for training MusiCNN with segments that are attenuated an appropriate amount. I don’t know, but this should be given more careful thought.

Anyhow, let’s plow forward and see some of the outputs of the units of CNN_1 trained on MSD. Here’s is the output of the unit with the length-128 IR having the largest energy, and its sonification, for the GTZAN Disco Duck (a.k.a. MSD CNN_1(1×128) unit 12):

The normalized dB Mel spectrogram is shown first, with a color scale in [0,10]; the unit output is then shown, where all elements that are zero are colored green, and the color scale is [0,1]. This is a very sparse activation, where the output values are in the range [0.0, 0.56].

Here’s the output of the unit with the length-128 IR having the second-highest energy (a.k.a. MSD CNN_1(1×128) unit 40):

Its values are in the range [0.0, 1.05].

Looking at the units with length-64 IRs, here’s the output of the one with the IR having the highest energy (a.k.a. MSD CNN_1(1×64) unit 31):

MSD CNN_1(1×64) unit 11 seems to be picking out the voices:

And here’s the output of the length-64 IR unit with the least-most-significant energy, a.k.a. MSD CNN_1(1×64) unit 50:

Is seems to be looking at the portions of the normalized dB Mel spectrogram that are the least energetic.

The output of the length-32 IR unit with the most energy, a.k.a. MSD CNN_1(1×32) unit 17 :

I don’t know what it is doing.

Also not knowing what it is doing: the length-32 IR unit with the 5th-most energetic IR, a.k.a. MSD CNN_1(1×32) unit 31:

Let’s look at the units with 38×7 IRs. Here’s the output of the 2nd-most energetic one, a.k.a. MSD CNN_1(38×7) unit 70:

This one is preserving increasing partials. The 3rd-most energetic one, a.k.a. MSD CNN_1(38×7) unit 59, is preserving decreasing partials:

Check out MSD CNN_1(38×7) unit 121:

MSD CNN_1(38×7) unit 125 sounds like a jaw harp rendition:

We can go on of course with examples, but it’s time to summarize, and head towards the next step. For CNN_1 trained in MSD, 499 of the 561 units just seem to be outputting constant positive values independent of the input. The outputs of the other 62 units are dependent upon the input. That seems like a total waste of resources, but we will answer that soon.

In the net part we will move here:

Those 561 matrices, consisting of $(96 \times 187) \times 51 \times 3 + (59\times187 + 30\times187)\times 204 = 6,141,828$ elements, are decimated to one matrix of $561 \times 187 = 104,907$ elements. How? What is the interpretation of this?

Digging into MusiCNN, pt. 5

Posted on November 9, 2022 by Bob L. T. Sturm

In part 1, I review the MusiCNN system and present some curious results: while its choices of tags is not entirely odd for the given excerpt of music, its output seems very sensitive to irrelevant characteristics of the audio data. In part 2, I look at the architecture of the “spectrogram”-based model trained on 3-second audio segments. In part 3, I look at how the input to MusiCNN is computed from the audio signal, and how we might sonify that information. In part 4, I begin to look at the first layer (CNN_1) of MusiCNN, in particular the 153 one-dimensional convolutions and their impact on an input Mel spectrogram. We see a big difference in these filters across training datasets MSD and MTT. In this part, I inspect the 408 two-dimensional convolutions in CNN_1.

Why am I doing this multipart series? First, I want to understand how MusiCNN is moving probability mass in its output in relation to characteristics of its input. Second, I want to see how that understanding might help one fine tune its architecture, training, and evaluation – not to mention interpret the output of MusiCNN. Third, through this process I want to critically engage with “autotagging” research – what I see as an underspecified problem and very noisy datasets. Yet MusiCNN and other deep autotagging models, e.g., somehow appear to find sensible information in all this noise! Also, this is in preparation for my presentation about MusiCNN later this week at the 2022 Joint meeting of the American Musicological Society, Society for Ethnomusicology, and the Society for Music Theory.

We are still here:

We are working our way to the 561 matrices having 187 rows. Of these, 408 arise from convolutions of the dB Mel spectrogram $0.1 \mathbf{M}_{\textrm{dB}}$ with 2-D kernels: a) 204 of size [38,7] and b) 204 of size [67,7]. In both cases, the time support of the kernels is 7 time steps, or 112 ms. The first kind of kernel has a frequency support of 38 Mel bands, and the second kind has twice that at 67 Mel bands. What do these kernels look like? To what time-frequency domain structures are they sensitive? What does a dB Mel spectrogram filtered with each of these kernels look like? What do their sonifications sound like?

Obtaining the impulse responses of these filters entails passing a single impulse through them and measuring the output before the bias is added and activation performed in the unit. Let’s have a look at the energies (Frobenius norm) of the 2-D kernels for CNN_1 trained on MTT and MSD (all with reference to the same energy, 1).

As we saw for the 1-D kernels, the majority of the 2-D kernels are “dead”. Of the 204 67×7 kernels learned from MSD, only 14 have non-negligible energy. Only 22 of the 204 38×7 kernels learned from MSD have non-negligible energy. The significant 2-D kernels learned from MTT appear to be far greater in number than for MSD – which is the same case for the 1-D kernels. However, we see that their percentage is less than 42% of the total number of 2-D kernels in CNN_1.

Let’s look at the 14 significant 67×7 kernels of CNN_1 learned in MSD:

The most energetic is on the left, and the least (but still significant) is on the right. Dark is large positive value (0.2) and white is large negative value (-0.1). I show the flipped left-to-right and up-to-down impulse responses because they show the kinds of time-frequency structures they will amplify through the convolution operation. We can see the two most energetic 67×7 kernels have a single transient-like structure in their centers. The third impulse response shows a frequency component sweeping upward. Kernels 7 and 8 have horizontal bars, showing a sensitivity to stable frequency components across their time supports.

Let’s have a look at the 22 significant 38×7 kernels of CNN_1 learned in MSD

These show similar structures. Kernels 1, 11, and 12 are sensitive to stable frequencies. Kernels 2, 3, 6, 9, and 11 are sensitive to single frequency sweeps. Kernels 8 and 13 seem to be sensitive to transient/vertical structures.

Time to pass our Disco Duck excerpt through these and see and hear the results. Here’s what happens to the dB Mel spectrogram sent through the 67×7 filter of MSD with the most energy:

These particular filters in CNN_1 are taking the “valid” portion of the convolution. That means each time and frequency lag of the kernel that involves points outside the input are disregarded. CNN_1 actually pads only the time (3 samples to the left and 3 to the right, which preserves the length of the input); but the Mel bands are truncated. Hence, the output of these 204 filters are 30×187 for each 3-second input segment. I shift up the segment to reflect the center band of the kernel with a zero band lag and 30 band lag. As expected, this is creating vertical lines where there are onsets. Here’s a sonification of the result:

The third kernel has a frequency sweep in its highest 15 bands. So if there are such sweeps between Mel bands 50-96 then this should produce large values in the result. Here is the resulting filtered dB Mel spectrogram using that kernel:

We can see large values around 9.5-10.5 seconds where such structures appear in the high Mel bands. Here’s a sonification of the result, but moved down to the lowest 30 Mel bands:

Let’s look at some of the 38×7 filters now. The filters with the second and third most energetic impulse responses have opposite diagonal structures in low Mel bands. So I expect the second filter will pick out upward sweeping frequency components, and the other filter will pick out downward sweeping components. Here is the resulting convolution with the second filter:

And here’s the result with the third filter:

The 2-D filters learned from MTT have several of the same kinds of structures, but seem much more varied in terms of harmonic structures. Here are the 60 67×7 impulse responses sorted by decreasing energy:

And here is the same but for 38×7 filters:

Now it is time to proceed down the processing chain. What does MusiCNN do to each of the 561 matrices resulting from these convolutions?

Digging into MusiCNN, pt. 4

Posted on November 4, 2022 by Bob L. T. Sturm

In part 1, I review the MusiCNN system and present some curious results: while its choices of tags is not entirely odd for the given excerpt of music, its output seems very sensitive to irrelevant characteristics of the audio data. In part 2, I look at the architecture of the “spectrogram”-based model trained on 3-second audio segments. In part 3, I look at how the input to MusiCNN is computed from the audio signal, and how we might sonify that information. In this part, I begin to look at the first layer of MusiCNN, which is a convolutional layer, i.e., CNN_1 in the below.

MusiCNN first normalizes the dB Mel spectrogram $0.1 \mathbf{M}_\textrm{dB}$ (using parameters that are learned). Then the fun begins. “CNN_1” applies 561 different (finite impulse response) filters to this matrix to create 561 new matrices. The supports of the filters are specified by the architecture: a) 204 filters of size [38,7], b) 204 filters of size [67,7], c) 51 filters of size [1,128], d) 51 filters of size [1,64], and e) 51 filters of size [1,32]. The first number describes the support of the filter in contiguous Mel bands, and the second in analysis windows. So a filter of size [1,128] spans only one Mel band, but 128 contiguous analysis windows (2.048 seconds of audio). And the filter of size [67,7] has a support of 67 Mel bands and 7 analysis windows (112 ms of audio).

In this post I will look only at the filters having unit Mel band support. These filters are applied to each row of the normalized $0.1 \mathbf{M}_\textrm{dB}$ with zero padding either side to produce a row of the same length. This operation results in a matrix of the same size, which can still be interpreted as a dB Mel spectrogram. What do these filters look like? To what time-domain structures are they sensitive? What do the output dB Mel spectrograms look like? What do they sound like?

To answer these questions, we have to tease out the parameters of CNN_1. I don’t yet know how to find these in the tensorflow API, but I can just send an impulse into each filter to get its impulse response (IR) – which are the parameters. (This also entails setting each unit activation to be linear, and removing the bias.) Let’s first have a look at the energies of these IRs.

These are the energies of the filter IRs having unit Mel band support from CNN_1 trained on MSD and CNN_1 trained on MTT (all with reference to the same energy, 1), and each sorted according to decreasing energy. Basically, if an IR has an energy less than -60 dB, it is pretty much nullifying whatever it is being convolved with. We see that the majority of IRs from MSD have very small energies: only three of 51 filters of length 128 have an IR with a significant amount of energy. This increases to 6 filters of length 64, and 17 of length 32 having a significant amount of energy. This raises the possibility that the majority (83%) of these 153 neurons in CNN_1 for MSD are “dead”, or are transmitting useless information (a hypothesis we will test later). For the IRs of the MTT model things are quite different. Why does this happen? What is so different between MSD and MTT? Does this come from a difference in the training strategy to create these models?

Let us look at the three IRs of the length-128 filters in MSD having significant energy.

I scale the alpha value according to energy, so the least transparent line belongs to the IR with highest energy. The blue line shows a somewhat periodic structure, with tall peaks separated by about 0.48 s, which is a frequency of about 2.08 Hz, or about 127 beats per minute. Could this filter be acting as a 127 BPM tempo detector? Let’s have a look at the magnitude response of each of these IRs (analyzed with a boxcar window to minimize main lobe width):

The colors and alpha values correspond to those in the time-domain IR plot. We see that the 128-length IR with the greatest energy has a peak at 2 Hz, as we predicted, but the peaks more than 15 dB above that one are at 8.4 Hz and 17 Hz. The second most energetic IR has peaks near 2 and 4 Hz as well, around 8 and 12.5 Hz. The 128-length filter with the least non-negligible energy is acting as a lowpass filter, attenuating by at least 7 dB frequencies above 5 Hz.

Let’s pass the rows of $0.1 \mathbf{M}_\textrm{dB}$ through each of these filters and look at the results. Here’s what happens with the IR having the most energy. I interleave the unfiltered and filtered matrices so I can better see the results.

It seems that the horizontal structures of the original are pretty much obliterated, and some vertical structures remain. Here’s the sonification:

Pretty reverby! Not sure what useful information is there. Here’s the result of applying the second filter:

I clearly hear in this result 4 onsets per second, or 240 bpm, but not aligned with the underlying music (the FIR filter introduces a delay, which is why features are shifted to the left). This is interesting because the tempo of the original excerpt is 120 bpm. Is this filter acting like a matched filter? Here’s a reminder of what the sonification of the unfiltered Mel spectrogram sounds like:

Now here’s the result with the third 128-length filter having non-negligible energy:

This seems to have inverted the polarity of the original, where large values appear in Mel bands and at times where there is not much energy. Here’s the sonification:

Convolving the Mel spectrogram with the fourth most significant IR produces a result having a maximum magnitude of 3e-6 (down from a maximum magnitude of 5.7). So this IR and those of the 47 other 128-length filters in CNN_1 trained on MSD obliterate the information (and as we will see later, will never see the light of day as a result of the max pooling down the processing line).

Of the length-64 IRs of CNN_1 trained on MSD, there are six with non-negligible energies. Here are the IRs:

and here are their magnitude responses (analyzed with a boxcar window):

There’s a variety of things going on with these. Let’s have a look and listen at their effects on Disco Duck.

Here’s the effect of the filter with the largest IR energy:

Here’s the second largest energy 64-length IR:

This one produces a clear regular pulse which is out of sync with the underlying music.

Here are the effects of the fifth most energetic IR:

And the final significant filter of length 64:

This one is inverting the dB Mel spectrogram, just like the third most energetic length-128 filter.

Finally, we reach the 17 length-32 filters of the MSD-trained CNN_1. Here are their spectra:

The ones with the most energy (blue, red, green and orange) are highpass filters. Here’s what the most energetic one does to Disco Duck:

Like a highpass filter, this filter is taking something like the derivative along time.

I think the 9th most energetic IR is interesting:

Clearly visible in the speech is how this filter seems to make increasing frequency have a large value (darker), and a decreasing frequency have a small value (lighter). Partials, e.g. those around 6 s, seem to be attenuated according to the frequency increasing or decreasing. This would be very strange because these filters do not care about what is happening in any other band. They are applied on each row of the normalized dB Mel spectrogram. This is actually an “embossing” effect of a highpass filter. Consider the following signal created from chirps going up and down, and processed by this filter.

The left edge is highlighted in each case. If the filter was only picking out components with decreasing frequency, then we wouldn’t see the component with increasing frequency.

We have yet to look at the unit Mel band support filters of MTT CNN_1, but let’s forgo that for now. In the next part I will investigate the other 408 filters of the MSD CNN_1, i.e., the ones having a support of seven Mel bands.

Digging into MusiCNN, pt. 3

Posted on November 2, 2022 by Bob L. T. Sturm

The system first computes the short-time Fourier transform of the 3-s audio signal using a Hann window of size 32 ms and 50% overlap. This produces a matrix of size 257×187. Column $j$ corresponds to the $j$ th analysis window, which begins at time $16j$ ms. Row $k$ corresponds to frequency $16000 k/512$ Hz. Then the system takes the magnitude of each element of the matrix and raises it to a power of 2. This creates the power spectrogram of the audio signal. Let’s denote this matrix as $\mathbf{S}$ .

The system then makes $\mathbf{S}$ shorter – to a size of 96×187 – by a weighted sum of continguous rows in each column. This is done using 96 “Mel-spaced frequency bands” from about 30 Hz to 8000 kHz. Here’s what the first 34 weighted bands look like. Each color represents a different band, and each dot represents the weight applied to the corresponding row of $\mathbf{S}$ .

The first row of $\mathbf{S}$ contributes nothing. This row represents the power of the mean (zero frequency) of the audio signal in each analysis window – which is going to be very small unless there is some low-frequency drift in the audio recording equipment. The first row of the new matrix is created by multiplying the second row of $\mathbf{S}$ by 0.032. The second row is created by multiplying the third row of $\mathbf{S}$ by 0.032. Gradually the bands become wider. The 34th row of the new matrix is created by summing 0.003 times the 35th row of $\mathbf{S}$ , 0.028 times the 36th row and 0.001 times the 37th row.

Here are the weighted bands covering 1–8 kHz, but on a semi-log plot:

As the frequency increases, we see an increase in the number of rows of $\mathbf{S}$ being weighted and summed to form each row of the new matrix.

Here is the source code where the above is computed:

S, n_fft = _spectrogram(
        y=y, # audio data
        S=S, # default is null
        n_fft=n_fft, # MusiCNN specifies 512
        hop_length=hop_length, # MusiCNN specifies 256
        power=power, # default is 2
        win_length=win_length, # MusiCNN specifies 512
        window=window, # default is Hann window
        center=center, # default is True, just centers the windowed data
        pad_mode=pad_mode, # default is to add zeros
    )

# Build a Mel filter, which are the weights we see in the plots above
mel_basis = filters.mel(sr=sr, n_fft=n_fft, **kwargs) # MusiCNN specifies sr = 16000

return np.einsum("...ft,mf->...mt", S, mel_basis, optimize=True) # weighted sum of rows

The last line multiplies the weights of the mel_basis against each column of $\mathbf{S}$ and then sums them to form the rows of the shorter matrix, which I will denote by $\mathbf{M}$ .

Why this unusual scaling? The spacing of these bands is motivated by the fact that we (humans) don’t hear frequency linearly in Hz. The Mel frequency scaling thus better reflects human hearing. The triangular weighting allows each band to have the same width (in Mels), emphasize the center frequency of each band, and takes into account power in each adjacent band. Each row in our new shorter matrix is thus linearly spaced in perceptual frequency over the bandwidth possible at this sampling rate.

The last step in creating the input processed by MusiCNN is the following:

audio_rep = np.log10(10000 * audio_rep + 1)

This scales our new matrix $\mathbf{M}$ by 10000, adds 1 to each element, and takes the base10 logarithm of each element. Another way to see this is that the $ij$ th element of $\mathbf{M}$ is computed

$0.1 [\mathbf{M}_\textrm{dB}]_{ij} = \log_{10}\frac{[\mathbf{S}]_{ij}+0.0001}{0.0001}$

The right hand side adds a small value to an element of $\mathbf{S}$ (to avoid taking the logarithm of zero), and then divides by 0.0001. This produces a matrix with values in terms of deciBels with the reference power being 0.0001. If the $ij$ th element of $\mathbf{M}_\textrm{dB}$ has a value of 6, then that means the $i$ th Mel band of the $j$ th analysis window has a power about three times 0.0001. If it has a value of 0, then the $i$ th Mel band of the $j$ th analysis window has a power of zero. It it has a value of 12, then the $i$ th Mel band of the $j$ th analysis window has a power about 14.8 times 0.0001. And so on.

As in scaling the frequency to account for human perception, this conversion of power to deciBels is motivated by human perception of loudness. For a given frequency, we must double its power to make it sound twice as loud, rather than double its amplitude. The input to MusiCNN is thus $0.1 \mathbf{M}_\textrm{dB}$ .

There’s one other detail to consider in the above: the impact of the Hann window over 32 ms. How does this window shape and size contribute to the leakage of spectral information into adjacent Mel-frequency bands? We can get an idea of this by computing the dB spectrum of the Hann window:

We see the 3-dB bandwidth of this window is about 60 Hz. Modulating the window by sinusoids having frequencies in the bins of the discrete Fourier transform (length 512 at 16000 Hz), and overlaying the resulting frequency responses on the Mel bands nicely shows how the Mel bands are spaced according to the 3dB cutoff frequencies.

This shows that the first row of $\mathbf{M}_\textrm{dB}$ doesn’t just contain information of the audio data at about 32 Hz. It contains information from the lowest frequency (0 Hz) as well as the next higher frequency at about 62 Hz (each reduced in power by at least half). The second row of $\mathbf{M}_\textrm{dB}$ contains information at about 62 Hz, but also information from the next lowest frequency at 32 Hz and next highest frequency at about 94 Hz (each reduced in power by at least half). This is the same for all rows of $\mathbf{M}_\textrm{dB}$ .

Now let’s have a look at the above process with real audio data, in particular the GTZAN version of Disco Duck. Below is the music excerpt in the time domain, with each 3-second segment marked.

Excerpt of the GTZAN Disco Duck signal

Now we compute the short time Fourier transform of each segment and then convert to the power spectrogram. Each of the below segments is a matrix $\mathbf{S}$ .

The minimum power we see is 1e-21, and the largest is 1534.

Now we compute weighted sums of each column over 96 Mel-frequency bands and produce several shorter matrices $\mathbf{M}$ :

The minimum power we see is 1e-10, and the largest is 49. There’s many more non-zero values there than appears. This will become more apparent with the next operation: logarithmic scaling. We multiply each value in this matrix by 10000, add 1, and then compute the base-10 logarithm:

The minimum value of $0.1 \mathbf{M}_\textrm{dB}$ we see is 8e-7, and its largest is less than 6. Its mean value is 1.58. These numbers give an idea of the magnitude of inputs to the first CNN layer of MusiCNN.

Now, what does the above sound like? We can think of the amplitude time-series in each Mel band as corresponding to an amplitude control signal for a sinusoid at the center frequency of the band. The first row of this matrix looks like this:

Now we just apply that as an amplitude envelope to a sinsuoid with a frequency of 31.25 Hz. Then we do the same for the next row, but with a sinusoid with frequency 62.5 Hz. We continue in this way for all 96 bands, and then add them together. To make it sound better, we apply a bit of random deviation to the frequency of each oscillator. We also evaluate np.log10(20 * audio_rep + 1) rather than np.log10(10000 * audio_rep + 1). Here’s the code:

import mir_eval.sonify
import soundfile as sf

file_name = '/Users/bobs/research/202210/musicnn/musicnn/discoduck_GTZAN.wav'
tmin = 15 # start time of excerpt
tmax = 27 # end time of excerpt
# compute Mel spectrogram
audio, sr = librosa.load(file_name, sr=config.SR)
audio_rep = librosa.feature.melspectrogram(y=audio, 
                                           sr=sr,
                                           hop_length=config.FFT_HOP,
                                           n_fft=config.FFT_SIZE,
                                           n_mels=config.N_MELS)
audio_rep = np.log10(20 * audio_rep + 1) # scale logarithmically

# create control signal for mir_eval.sonify
times = np.arange(0,audio_rep.shape[1])*config.FFT_HOP/config.SR
fftfreqs = config.SR*np.arange(0,config.FFT_SIZE/2+1)/config.FFT_SIZE

y_f0 = np.zeros((int(np.max(times)*config.SR),1)) # place to store audio samples
for bb in range(0,config.N_MELS): # for each band
    # get center frequency of band
    bandweights = melfb[bb,:]
    f0 = fftfreqs[np.argmax(bandweights)]
    # synthesize and add
    y_f0 += mir_eval.sonify.pitch_contour(times, 
                        f0*(np.ones((1,len(times)))+0.001*np.random.randn(1,len(times))), 
                        config.SR, audio_rep[bb,:]).reshape(-1,1)

y_f0 = y_f0[tmin*config.SR:tmax*config.SR] # take corresponding excerpt
# normalize and write soundfile
sf.write('BeforeCNN.wav', y_f0/np.max(np.abs(y_f0)), config.SR, subtype='PCM_24')

Here’s the output sound:

This sonfication gives us an idea of what is heading into MusiCNN. It lo-fi, but I can hear and understand the voice, there are drums, maybe a bass. There is a plucky rhythm, and dynamic contrast. Next time we will examine the first layer of MusiCNN and see all the ways it processes this dB Mel spectrogram of Disco Duck.

Folk the Algorithms

Bringing you the best in synthetic traditions