Good luck doing that in the United States. Pretty much all that kind of data, wh...

chefandy · on Sept 24, 2023

Harvard Law's Library Innovation Lab has the Caselaw Access Project— a complete collection of precedential US caselaw with structured metadata. It's at https://case.law. It's readable online (including the original pdfs for most,) accessible via a rest API in fully structured documents, and through bulk downloads. There are OCR mistakes here and there, but the accuracy is over 99% even with weird things like the "long s" that looked like an f that was common before the 20th century. While updates are too slow to replace the commercial tools, it's perfect for uses like this... Rather, it will be in a little over 3 months. In Feb of 2018, it was released under agreement with a funder to limit access to 500 cases per day per user for 6 years except for a few jurisdictions available now— so the entire corpus will be completely open then.

They scanned, OCR'd, and applied metadata to 40k volumes, and (digitally) redacted by hand all commercial material (eg head notes, key citations) in all in-copyright volumes, so what's left is entirely in the public domain.

Disclosure: worked on that project for several years.

leppert · on Sept 24, 2023

Thanks Andy. We're also in the midst of releasing the Collaborative Open Legal Dataset ("COLD cases") specifically aimed at AI/ML work. More tk. https://huggingface.co/datasets/harvard-lil/cold-cases

civilitty · on Sept 24, 2023

Any idea how complete the COLD dataset is compared to the Caselaw Access Project?

Unfortunately Caselaw limits access to the full text bulk data of most jurisdictions without a research account and I’m trying to find an alternative.

mdellabitta · on Sept 25, 2023

COLD is bigger than Caselaw Access Project. It's over 8 million cases vs 6.9 on case.law.

Please let me know how you find using the data and if you'd like to see any additions or changes!

civilitty · on Sept 25, 2023

> Please let me know how you find using the data and if you'd like to see any additions or changes!

I only just started yesterday but so far so good!

Having it in Parquet files on Git LFS makes a huge difference. It only took a few lines to add the entire dataset to our CI/CD cache which is an improvement over the ingestion scripts we have to normally write with change detection and all that. It took less than an hour to start running the cases through our pipeline - I wish all of the GovInfo bulk data were available this way!

mdellabitta · on Sept 26, 2023

I'm really glad you appreciate that!

chefandy · on Sept 26, 2023

That's rad. I'll have to dig into it.

chefandy · on Sept 25, 2023

> Unfortunately Caselaw limits access to the full text bulk data of most jurisdictions without a research account

That's only true until Feb of 2024! It should be totally unrestricted in a few months.

civilitty · on Sept 25, 2023

Great to know, thank you!

Any idea why they're time limited? I assumed that was a license restriction from reporters or FastCase, et. al. which would have been permanent.

chefandy · on Sept 26, 2023

To be clear, they are all in the public domain— Fastcase updates included. All of the proprietary info was redacted by hand and the opinions themselves are not copyrightable. The throttling is a contractual obligation to a project partner that limits Harvard's distribution of the cases until Feb of 2024, but that's it. There are also exceptions— cases where the publication is no longer in copyright, and jurisdictions that already publish their opinions online... There are 3 or 4. Those are accessible without throttling through the API and through bulk downloads right now.

This should have more up-to-date and accurate information than I do: https://case.law/about

chefandy · on Sept 24, 2023

riku_iki · on Sept 24, 2023

> a complete collection of precedential US caselaw with structured metadata

is it complete? for example it says it is 144k cases in California, I would expect more..

chefandy · on Sept 24, 2023

From what I've been told by people who know way more about it than I do, it is complete. One thing to consider is that official published precedential caselaw is from the appellate level up. From what I understand, lower court cases aren't published, though I believe they can be accessed through PACER... But I have no real legal research expertise. The first part of the project entailed a lengthy process involving several decades-experienced legal librarians, lawyers, and archivists mapping out exactly what reporters exist and which ones were "official" (considered authoratative by the courts) and when. Apparently nobody had done it before— well, nobody that made their data available, anyway. HLSL is a library of last resort for law and all but a tiny fraction of the books we scanned are in their collection.

In the about page there's more detailed information about the scope and process.

riku_iki · on Sept 24, 2023

this makes sense, thank you!

hnfong · on Sept 24, 2023

This one isn't even paywalled. Won't give you access even if you have money.

> Applications for access to the Cambridge Law Corpus (CLC) can only be made by researchers who are employed full-time by a recognised university or other research institution. The applicant must hold a permanent position at the level of Assistant Professor (or higher) or equivalent.

slavboj · on Sept 24, 2023

Not even adjuncts and postdocs, brutal.

belter · on Sept 24, 2023

So corporations access will be quickly sorted. Maybe a 1 million dollar research grant by a GAFAM corporation to a promising and enterprising Computer Science Department...

lettergram · on Sept 24, 2023

You can get all the patent data from the USPTO for free:

https://github.com/lettergram/parse-uspto-xml/tree/master