This is a great effort, and it's important to have datasets like this available to democratise ML learning and work.
One small comment: It would be great for this (and other) datasets to give a quick "sample data" file - preferably one that doesn't need to be downloaded to be viewed. Even a screenshot of some of the data would be useful for people browsing to get a quick understanding of the actual content, and how it is formatted. Downloading gigabytes of data just to have a look isn't practical.
The test and validation datasets are under half a gig; I agree it'd be nice if there were a few example items and perhaps a JSON Schema [0] for it, though.
I'll reproduce a line here:
{"text": "Roman Catholic Diocese of Tambacounda\n\nThe Roman Catholic Diocese of Tambacounda () is a diocese located in the city of Tambacounda in the Ecclesiastical province of Dakar in Senegal.\n\nHistory\n August 13, 1970: Established as Apostolic Prefecture of Tambacounda from the Diocese of Kaolack and Diocese of Saint-Louis du S\u00e9n\u00e9gal\n April 17, 1989: Promoted as Diocese of Tambacounda\n\nSpecial churches\n The cathedral is Cath\u00e9drale Marie Reine de l\u2019Univers in Tambacounda, which is located in the Medina Coura neighborhood of the town.\n\nLeadership\n Bishops of Tambacounda (Roman rite)\n Bishop Jean-No\u00ebl Diouf (since 1989.04.17)\n Prefects Apostolic of Tambacounda (Roman rite) \n Fr. Cl\u00e9ment Cailleau, C.S.Sp. (1970.08.13 \u2013 1986.04.24)\n\nSee also\nRoman Catholicism in Senegal\n\nReferences\n\nExternal links\n GCatholic.org\n Catholic Hierarchy \n\nCategory:Roman Catholic dioceses in Senegal\nCategory:Tambacounda\nCategory:Christian organizations established in 1970\nCategory:Roman Catholic dioceses and prelatures established in the 20th century", "meta": {"pile_set_name": "Wikipedia (en)"}}
One small comment: It would be great for this (and other) datasets to give a quick "sample data" file - preferably one that doesn't need to be downloaded to be viewed. Even a screenshot of some of the data would be useful for people browsing to get a quick understanding of the actual content, and how it is formatted. Downloading gigabytes of data just to have a look isn't practical.