GenAI novice here. what is training data made of how is it collected? I guess no...

simonw · 2024-03-27T13:39:19 1711546759

The most detailed answer to that I've seen is the original LLaMA paper, which described exactly what that model was trained on (including lots of scraped copyrighted data) https://arxiv.org/abs/2302.13971

Llama 2 was much more opaque about the training data, presumably because they were already being sued at that point (by Sarah Silverman!) over the training data that went into the first Llama!

A couple of things I've written about this:

- https://simonwillison.net/2023/Aug/27/wordcamp-llms/#how-the...

- https://simonwillison.net/2023/Apr/17/redpajama-data/

ssgodderidge · 2024-03-27T15:18:50 1711552730

Wow, that paper was super useful. Thanks for sharing. Page 2 is where it shows the breakdown of all of the data sources, including % of dataset and the total disk sizes.

shnkr · 2024-03-27T14:06:21 1711548381

my question was specific to databricks model. If it followed llama or openai, they could add a line or two about it .. make the blog complete.

comp_raccoon · 2024-03-27T15:47:19 1711554439

they have a technical report coming! knowing the team, they will do a great job disclosing as much as possible.

tempusalaria · 2024-03-27T13:40:41 1711546841

The training data is pretty much anything you can read on the internet plus books.

This is then cleaned up to remove nonsense, some technical files, and repeated files.

From this, they tend to weight some sources more - e.g. Wikipedia gets a pretty high weighting in the data mix. Overall these data mixes have multiple trillion token counts.

GPT-4 apparently trained on multiple epochs of the same data mix. So would assume this one did too as it’s a similar token count

sanxiyn · 2024-03-27T13:52:52 1711547572

https://arxiv.org/abs/2305.10429 found that people are overweighting Wikipedia and downweighting Wikipedia improves things across the board INCLUDING PREDICTING NEXT TOKEN ON WIKIPEDIA, which is frankly amazing.

IshanMi · 2024-03-27T17:51:43 1711561903

Personally, I found looking at open source work to be much more instructive in learning about AI and how things like training data and such are done from the ground up. I suspect this is because training data is one of the bigger moats an AI company can have, as well as all the class action lawsuits surrounding training data.

One of the best open source datasets that are freely available is The Pile by EleutherAI [1]. It's a few years old now (~2020), but they did some really diligent work in putting together the dataset and documenting it. A more recent and even larger dataset would be the Falcon-RefinedWeb dataset [2].

[1]: https://arxiv.org/abs/2101.00027 [2]: https://arxiv.org/abs/2306.01116