How hard would it be for the public to curate massive amounts of training data for an open features/descriptors project? Instead of an API, just download the nightlies and drop them into your project. I have a feeling such a tool would help audio/video applications significantly.
I really like the idea of "open source"-like project for training data. Such a concept would be applicable to almost any machine learning based project. Although there are academic datasets for most domains, they have their own limitations.
Often the secret sauce is not the algorithms themselves, but the availability of (massive/broad) training data.
Ironically, we actually have a great source of training data available that people can contribute with minimal effort - the Facebook graph API!
All we need is someone to auth the app allowing the "view photos" and "view friends photos" permissions and then crawl all of them for photos with faces in them. The return gives us a pixel location for the center of the face and then you compare that against what the detector returns. Keep iterating on that for the millions of photos Facebook grants you and you should get Face.com-level accuracy fairly quickly!
Now that we've started discussing this, I think I would like to start such a website/service. It wouldn't require too much extra time, and I would contribute everything to the public domain or make it available under a free license.
Does anyone have any recommendations about which algorithms I should use? Would I be sampling Haar-like features? (I've only worked with feature training a few times.)
It would be cool to train for other objects and features as well. Cars, body parts, animals...
Ideally raw images and audio with a few categories. This is because 1) categorization/labeling like you are suggesting does not scale and 2) feature building and detection is the most important part really, building in features like you are suggesting kind of defeats the point when you are testing your algorithm IMO.
You could stratify and charge for categories, pre built features etc.
The dataset for useful training of video, images and audio is going to easily range into tens of gigabytes or more.
This service would probably be best modeled on the mechanical-turk approach.
Worth noting is that Youtube is kind of a source already.
CALTECH 256, and several others. Many papers presented in CVPR 2012 (http://www.cvpr2012.org/) were indeed on very large datasets.
Do note that some of these datasets are fairly large (may or may not be as broad though).
The algorithms you use will depend on the complexity you can handle. OpenCV, IPP, and CCV (which was on hacker news few days back) could provide good options for algorithms with a training dataset you choose/create.
Data size is indeed one of the key factors to consider when building vision systems, and in this respect, the current champ is Image Net: http://image-net.org/
They have images labeled according to the WordNet ontology, and the dataset is still growing. They also have more detailed annotations on subsets of the data for various things (object, attributes, etc.)
The Viola-Jones algorithm and Haar-like features are really only suitable for faces. Solving object detection/recognition for general classes of objects is one of the largest open problems in computer vision and thousands of researchers around the world are looking at how to solve it.
If you're interested, results on the Pascal VOC [1] and Imagenet Large Scale Visual Recognition Challenge [2] are considered to be the state-of-the-art in computer vision (and therefore most promising). If you look at actual performance of the best contenders, you'll find that reported accuracies are often in the teens or twenties. These numbers are likely over-optimistic. So for most practical purposes, general object recognition doesn't work. At all.
However, there are some domain-specific approaches that work to some degree. Faces is one example. Another is pedestrian detection, which is getting somewhat mature at this point. A third is plant recognition, as in my previous project Leafsnap [3].
My personal feeling is that there isn't going to be "one algorithm to rule them all", but rather a collection of dozens of algorithms to deal with the most common classes of objects (faces, people, cars, buildings, animals, text, etc.), and then a set of other algorithms for recognizing all other classes, depending on each class' characteristics (shape, material, configurations, variability, etc.)
Finally, a word of caution. Many young PhD students in vision start having dreams of building such a website/service, and they have all had their hopes crushed. Different classes often need to be dealt with quite differently, and it's not even clear what kind of API would cover all the different cases. So if you're actually serious about this, I would recommend limiting your focus to a few specific domains, and also by considering some actual use-cases so that you have something concrete to aim for (i.e., "build an API so that application X is possible").
AFAIK, Face.com provides face recognition, not only detection. They've presented their approach at YaC'2011 here in Moscow, and it is quite sophisticated. Not VJ at all.