Hacker News new | past | comments | ask | show | jobs | submit login

GenAI novice here. what is training data made of how is it collected? I guess no one will share details on it, otherwise a good technical blog post with lots of insights!

>At Databricks, we believe that every enterprise should have the ability to control its data and its destiny in the emerging world of GenAI.

>The main process of building DBRX - including pretraining, post-training, evaluation, red-teaming, and refining - took place over the course of three months.




The most detailed answer to that I've seen is the original LLaMA paper, which described exactly what that model was trained on (including lots of scraped copyrighted data) https://arxiv.org/abs/2302.13971

Llama 2 was much more opaque about the training data, presumably because they were already being sued at that point (by Sarah Silverman!) over the training data that went into the first Llama!

A couple of things I've written about this:

- https://simonwillison.net/2023/Aug/27/wordcamp-llms/#how-the...

- https://simonwillison.net/2023/Apr/17/redpajama-data/


Wow, that paper was super useful. Thanks for sharing. Page 2 is where it shows the breakdown of all of the data sources, including % of dataset and the total disk sizes.


my question was specific to databricks model. If it followed llama or openai, they could add a line or two about it .. make the blog complete.


they have a technical report coming! knowing the team, they will do a great job disclosing as much as possible.


The training data is pretty much anything you can read on the internet plus books.

This is then cleaned up to remove nonsense, some technical files, and repeated files.

From this, they tend to weight some sources more - e.g. Wikipedia gets a pretty high weighting in the data mix. Overall these data mixes have multiple trillion token counts.

GPT-4 apparently trained on multiple epochs of the same data mix. So would assume this one did too as it’s a similar token count


https://arxiv.org/abs/2305.10429 found that people are overweighting Wikipedia and downweighting Wikipedia improves things across the board INCLUDING PREDICTING NEXT TOKEN ON WIKIPEDIA, which is frankly amazing.


Personally, I found looking at open source work to be much more instructive in learning about AI and how things like training data and such are done from the ground up. I suspect this is because training data is one of the bigger moats an AI company can have, as well as all the class action lawsuits surrounding training data.

One of the best open source datasets that are freely available is The Pile by EleutherAI [1]. It's a few years old now (~2020), but they did some really diligent work in putting together the dataset and documenting it. A more recent and even larger dataset would be the Falcon-RefinedWeb dataset [2].

[1]: https://arxiv.org/abs/2101.00027 [2]: https://arxiv.org/abs/2306.01116




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: