A popular dataset to use in music autotagging is CAL500. It is described as coming from “500 songs.”
When we download the soundfiles, we find 500 songs; but when we look at the song names, we find 502 titles.
Even more puzzling, if we look at the Echo Nest IDs, we find 503 titles!
After some digging around, I found the problems, and a few others.
There is no information about these discrepancies on-line, so below I present the problems and solutions.
- The extra title in the Echo Nest IDs is: “pink_floyd-echoes_15 TRQVSAX12527819B19.” Because there already is “pink_floyd-echoes TRBCHNJ1251AE42CB9”, I remove it from this file.
- The two songs missing from the soundfiles are: “crosby_stills_and_nash-guinnevere TRALRAI1251FD19699” and “radiohead-karma_police TRZEWZE1251AE72E42”. I have obtained them through other means, and upload them here.
- The mp3 file “jade_leary-going_in.mp3” is only 313 bytes. It is 21 seconds in reality, so I was able to download the entire “song” from amazon. I upload it here.
- The mp3 file “frank_zappa-whats_the_ugliest_part_of_your_body.mp3” is 2 seconds long. We download the entire song from YouTube. I upload it here.
- The mp3 file “guided_by_voices-kicker_of_elves.mp3” is 4 seconds long. We download the entire song from YouTube. I upload it here.
- The mp3 file “jackalopes-rotgut.mp3” is 3 seconds long. We download the entire song from YouTube. I upload it here.
- In “songnames.txt” we find the first few files listed in an order that is different from what we get when using a “dir” or “ls” command. It is also different from the order listed in the Echo Nest IDs. So, we assume that the order of the 502 sets of annotations in “hardAnnotations.txt” and “softAnnotations.txt” is that given in “songnames.txt”.