Projects like this are inevitable and necessary; 'OpenAI' make such a mockery of their name that it's an invitation to others to try and build an alternative that is actually open.
Maybe that's their secret game - tease us to death, to make us follow along the path and contribute. The waiting period for Dall-E 2 is hard to bear, maybe we'll invent something better before they release it.
I mean, diffusion models became the best generative image model in this fashion after Dall-E 1 and now Dall-E 2 is already adopting the idea in favor of auto-regressive image generation.
In the last few years OpenAI took the AI crown with GPT-2, 3 and CLIP and Dall-E. They rarely shared, but they kick-started everyone else into replication mode.
A nice recent result (DeepMind) is that you can either make the dataset 4x larger or the network 4x larger to get the same result. So a large dataset could create a more efficient/smaller model and in turn it could be easier to distribute and use.
Their marketing is so bad. Terrible website, they present themselves first by opposing OpenAI, they name their datasets the way established orgs name their models. Their only project is a non-curated filtering of already open source data using CLIP (they just looped over it and dropped the image-text pairs with cosine similarity below 0.3).
Let me start by saying that laion is a non profit, open to anyone that want to contribute.
Agreed about the website css. Do you want to contribute?
What's the problem with the dataset name exactly? Seems to work pretty well.
Yes the dataset is an extract of common crawl, this is an accessible to all method to produce valuable dataset. This is unlike supervised dataset which are reserved to organization with millions of dollars to spend on annotation and do not scale.
Non annotated datasets are the base of self supervised learning, which is the future of machine learning. Image/text with no human label is a feature, not a bug. We provide safety tags for safety concerns and watermark tags to improve generations.
It also so happens that this dataset collection method has been proven by using laion400m to reproduce clip model. (And by a bunch of other models trained on it)