Good luck doing that in the United States. Pretty much all that kind of data, while supposedly available under the FOIA, is all paywalled by the government or third parties. But it would be interesting to see an AI model.
Harvard Law's Library Innovation Lab has the Caselaw Access Project— a complete collection of precedential US caselaw with structured metadata. It's at https://case.law. It's readable online (including the original pdfs for most,) accessible via a rest API in fully structured documents, and through bulk downloads. There are OCR mistakes here and there, but the accuracy is over 99% even with weird things like the "long s" that looked like an f that was common before the 20th century. While updates are too slow to replace the commercial tools, it's perfect for uses like this... Rather, it will be in a little over 3 months. In Feb of 2018, it was released under agreement with a funder to limit access to 500 cases per day per user for 6 years except for a few jurisdictions available now— so the entire corpus will be completely open then.
They scanned, OCR'd, and applied metadata to 40k volumes, and (digitally) redacted by hand all commercial material (eg head notes, key citations) in all in-copyright volumes, so what's left is entirely in the public domain.
Disclosure: worked on that project for several years.
> Please let me know how you find using the data and if you'd like to see any additions or changes!
I only just started yesterday but so far so good!
Having it in Parquet files on Git LFS makes a huge difference. It only took a few lines to add the entire dataset to our CI/CD cache which is an improvement over the ingestion scripts we have to normally write with change detection and all that. It took less than an hour to start running the cases through our pipeline - I wish all of the GovInfo bulk data were available this way!
To be clear, they are all in the public domain— Fastcase updates included. All of the proprietary info was redacted by hand and the opinions themselves are not copyrightable. The throttling is a contractual obligation to a project partner that limits Harvard's distribution of the cases until Feb of 2024, but that's it. There are also exceptions— cases where the publication is no longer in copyright, and jurisdictions that already publish their opinions online... There are 3 or 4. Those are accessible without throttling through the API and through bulk downloads right now.
This should have more up-to-date and accurate information than I do:
https://case.law/about
From what I've been told by people who know way more about it than I do, it is complete. One thing to consider is that official published precedential caselaw is from the appellate level up. From what I understand, lower court cases aren't published, though I believe they can be accessed through PACER... But I have no real legal research expertise. The first part of the project entailed a lengthy process involving several decades-experienced legal librarians, lawyers, and archivists mapping out exactly what reporters exist and which ones were "official" (considered authoratative by the courts) and when. Apparently nobody had done it before— well, nobody that made their data available, anyway. HLSL is a library of last resort for law and all but a tiny fraction of the books we scanned are in their collection.
In the about page there's more detailed information about the scope and process.
This one isn't even paywalled. Won't give you access even if you have money.
> Applications for access to the Cambridge Law Corpus (CLC) can only be made by researchers who are employed full-time by a recognised university or other research institution. The applicant must hold a permanent position at the level of Assistant Professor (or higher) or equivalent.
So corporations access will be quickly sorted. Maybe a 1 million dollar research grant by a GAFAM corporation to a promising and enterprising Computer Science Department...