Interesting that they sent a C&D considering the algorithm is likely patent pending at best. Aren't there other services, like Midori, that likely use the same 'obvious' algorithm of storing a database of frequency points per song and comparing it to the sample?
It shows how sick the patent system has become. The whole reason for patents is to provide public disclosure of the methods in the patent for the public benefit, so that they can be discussed, improved upon and not lost to humanity when a particular invention is no longer made by its inventor. It is explicitly part of their purpose that they become public knowledge and can be openly discussed.
Perhaps that's why the post is back up - I don't believe the bits and pieces described in the article are patentable. (Fourier has been used in signal processing for decades. Adding a nearest-neighbor lookup - also know for decades - is a trivial extension.)
I did a master degree on pattern matching on text in 1999 (same year Shazam started), and it was obvious then that much of the same concepts (many of them from textbooks stemming from the 1970s) could be used for pattern matching on music. (We even discussed implementing music matching, but couldn't see a market for it.)
There may very well be other parts of music fingerprint algorithms that are patentable, but I have a hard time believing that the parts described in the article could be.
One nontrivial part is transforming the spectrogram into some representation that is robust to the things that can affect the query audio, like background noise. Another nontrivial part is figuring out, given this representation, how to quickly match the query with a song or database of songs.
This post only touches on the actual way that Shazam recognises music. The meat of the algorithm is in the feature detection, which must be much more robust than what the author did. His way is very sensitive to noise and maybe other manipulations (e.g. dilation), I didn't take too good a look at it.
Basically, to get good features you need to find a set of candidate features (points in the image that stand out) and then filter them according to some desirable properties (in standard image recognition, for example, you want them to be resistant to rotation, translation, stretching, etc). There exist very good algorithms for finding these features, which is what Shazam uses. Of course, after finding the features, you need to hash them in a way that is able to match any time in the song, resist noise, etc etc, so it's a very interesting application.
I was very surprised to learn that the best way to recognise a song was to convert it into a picture and then try to recognise that picture. It seems like a roundabout way to do it, but it works very well.
You can find more information in this paper: Viola and Jones, "Rapid object detection using boosted cascade of simple features"
While the paper cited by 'StavrosK is indeed a very important one in computer vision (it's the basis for almost all modern face detection algorithms), it's not the one most relevant for Shazam.
This is another seminal work in computer vision, which solves both problems that 'StavrosK mentioned:
1. Find candidate feature points that stand out, and are reliably and repeatedly
detectable despite image variations.
2. Get a "hash" of each point that can be used to do searches fairly quickly.
Lots of work in detecting and recognizing objects now uses some variant of SIFT, and it's finding usage in lots of other areas of vision as well. I wouldn't be surprised if as many as 10% of papers at the top vision conferences use techniques based on some variant of SIFT.
Ah, yes, I forgot SIFT. last.fm does recognition through the algorithm I mentioned, which is why that came to mind, but I think SIFT would work better, thanks for the info!
I never realized how similar Shazam was to the first Matlab assignment we had to do for an EE class in discrete time linear processing. The only difference is we had to match whale calls to a type of whale. Spectrum analysis and everything. Never thought to identify music.
Sometimes, it's way to easy to overlook the obvious.
Would this algorithm work with music that is hummed or sung live? Shazam doesn't work for that, but SoundHound does. What's the difference in algorithm you think?
It's died down a bit, but for a while in the late-90s / early-2000s, there was a whole research subfield dedicated to retrieving audio via humming, which has some interesting stuff in it: http://scholar.google.com/scholar?hl=en&q=%22query+by+hu... It's got a nice mixture of signal processing, sequence alignment, and approximate pattern matching problems all rolled together.
I'm fairly sure that SoundHound does not match hummed or sung queries with album recordings, but with other hummed melodies submitted by users and labeled with the song ID.
I would imagine the algorithm is different, since the Shazam algorithm is designed to find exact matches corrupted by noise, EQ, etc., and two hummed versions of the same melody may vary in key, tempo, and timbre, and have small rhythmic differences.
I really enjoyed that article, it was very honest and the guy obviously enjoys what he does which really shines through in his writing.
It's also nice to see the thinking process he went through when developing his solution to this.
Yes the algorithm might not be perfect, it isn't a Shazam clone, but it does demonstrate that within 48 hours he created something that could recognise music. Now that's a good effort in my eyes!