Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
In memory of Aaron, bulk XML of every federal and state law and court ruling (webpolicy.org)
284 points by friendofaclu on Jan 8, 2014 | hide | past | favorite | 55 comments


Which versions are these? I ask because, to lawyers, pagination matters -- a LOT. Lawyers refer to a case by the book and page number, and refer to parts of a case by the page numbers. It's probably not a great system, but the courts don't have anchors in text for better or worse. Also, what version is it as to corrections? Most courts issue typo changes after the fact. Most are minor, but a few do materially affect the meaning (such as changing the page numbers of the parts of a different case that have been overruled).

Don't get me wrong, it's better than nothing, but to get any buy-in from existing lawyers these issues need to be addressed. I'm concerned that they haven't on the site.


> I ask because, to lawyers, pagination matters -- a LOT. Lawyers refer to a case by the book and page number, and refer to parts of a case by the page numbers. It's probably not a great system, but the courts don't have anchors in text for better or worse.

In effect, the volumes and page numbers from the official reporter are anchors in the text -- and are stored that way in other databases (and sometimes included as textual anchors in secondary printed references.)


They are anchors of a sort, they're just inconvenient because they change from version to version and often aren't included in online versions. If a court says (in slightly more words) "pages 205 to 207 are overruled" there are often paragraphs that span from pages 204-205 and from 207-208 that are ambiguous --- and it gets even worse when a different book version is paginated so that the range above covers part of page 621 to part of page 624. Anchors on paragraphs (if not lines) would be useful.


> They are anchors of a sort, they're just inconvenient because they change from version to version and often aren't included in online versions.

Most online versions I've seen do include both the citation and page numbers from the official reporters.

> If a court says (in slightly more words) "pages 205 to 207 are overruled" there are often paragraphs that span from pages 204-205 and from 207-208 that are ambiguous

US courts don't generally do that. They cite prior cases using page references, but they don't say "pages X-Y" are overruled. Its not a matter of more or less words, that's just not how they work at all. If they are reversing a lower court decision, they simply state that the decision is reversed (and if it is reversed in part, they describe which effects are reversed, which may not map to specific separable parts of the text). If they are stating a new legal rule overriding a prior precedent, they simply state the new legal rule.

> and it gets even worse when a different book version is paginated so that the range above covers part of page 621 to part of page 624.

Different books aren't the official reporter. Different books (or online sources) that are intended to be legal references will often include, as anchors in the text, the page numbers from the official reporter at the point in the text where the pages break in the official reporter.

As an example in an online source, consider the Findlaw entry for The Amistad [1]. The heading includes the reference to the official reporter (40 U.S. 518) -- 518 is the page number on which the case starts in the official reporter.

Throughout the text on Findlaw, you'll see blue notes like "[40 U.S. 518, 523]" -- these indicate points of page breaks in the official reporter (they follow the style of standard legal citation, so 40 U.S. 518, 523 marks the beginning point of page 523, in the case beginning at page 518, in volume 40 of the United States Reports.)

[1] http://caselaw.lp.findlaw.com/scripts/getcase.pl?court=US&vo...


In some texts, there are insane attempts to not change numbers between versions. If pages are removed, those page numbers are not used. If text is added, they add sub-page numbers, like sub-bullets in an outline. Some legal tomes are in binders, not hard-bound, to accommodate this.

This leads to page numbers like 247.1151a-iii. Which is then the canonical page number for a block of text.

It's enough to make the Library of Congress filing system seem simple and rational!!


I have been in charge of maintaining binders that work that way. It's a pain, especially if updates to the same section arrive out of chronological order, and you're using a shredder . . .


> Lawyers refer to a case by the book and page number, and refer to parts of a case by the page numbers. It's probably not a great system [...]

It's a fantastic system (and a wonderful demonstration of the advantages of immutable graph data structures). You can read cases from a century ago and still find every cited reference. Hyperlinking pales in comparison.


It seems to me, they solve different problems: one is about finding a document Foo, the other is about finding information in a document Foo, assuming you already have Foo.

The hyperlink equivalent would be "you can find this book at this library in this town, as of the time this was written".

I imagine that in real life lawyers don't need to worry about that sort of instruction because those books are printed and widely distributed across the country, while on the web documents tend to exist in only a single location, maintained by a single person or organization. USENET might be the closest on the internet we have gotten to that sort of distribution model.


Citations can refer to the document as a whole. E.g. 410 U.S. 113 (1973) refers to the case Roe v. Wade. It's equivalent to http://en.wikipedia.org/wiki/Roe_v._Wade. 410 U.S. 113, 153 is a pin-cite to the quote "This right of privacy, whether it be founded in the Fourteenth Amendment's concept of personal liberty and restrictions upon state action, as we feel it is, or, as the District Court determined, in the Ninth Amendment's reservation of rights to the people, is broad enough to encompass a woman's decision whether or not to terminate her pregnancy." It's similar to http://en.wikipedia.org/wiki/Roe_v._Wade#Right_to_privacy.

The difference is that a citation is a reference to an immutable object. No matter what happens to Roe v. Wade, say it is overruled, the document at 410 U.S. 113 won't change. A URL on the other hand is a reference to a location. Like with the memory address pointed-to by a C pointer, what is at that location can change.

One can imagine a system of hyperlinks that behaved differently. In this system, a URL would uniquely identify an immutable document. A new version of a document would get a new URL, and servers would be required by the protocol to preserve all previous versions of the document. This is essentially the premise of Git: every blob is stored forever and different revisions are stored as deltas such that older versions always remain accessible.


One can, and one should. We should build this system, keeping in mind the lessons of Project Xanadu.

http://mesh.is/draft


It's not really hyperlinking that fails, but the fact that web sites (the ones that don't disappear without a trace that is) don't maintain their structure over time. Many don't even attempt to provide meaningful forwards, instead just dumping you back to the front page. Microsoft and Oracle are particularly egregious offenders.

What we need is a technical mechanism for embedding referenced source documents into a document, in a way that is as easy as hyperlinking to add the references and follow them. Probably also a new fair-use provision in copyright law as well.


I think it's fair to say that hyperlinking fails. Sites don't maintaining their structure over time would be the cause of the failure, but the link still doesn't work... Semantics maybe, but eh.

Agreed on having a way to embed documents. Especially in a way that supported some kind of signing. If I could have a reputable third party (web.archive.org or anybody else) sign an embedded snippet of a page and say "Yes, this was actually posted on X website at Y time" that would be fantastic.


Hyperlinks are just interactive references. Hyperlinks don't care about the nature of the reference. It is quite possible to have a hyperlink with its reference set as a book/page number.


Indeed, Google Books actually does this.


A few questions come to mind that aren't answered by perusing the site, though I haven't yet looked at the downloadable files.

1) How comprehensive are the court decisions ? For example which Federal Courts are covered and for what time periods ? If there are variations in coverage of state courts what are the high low and typical cases of coverage - both for dates and court levels ?

2) How was the court decision data obtained ? I was under the impression that there were significant obstacles to obtaining much of this data since although statutes are available freely online for many jurisdictions access to court decisions is typically very costly. I once payed several hundred dollars for a months access to NYS court decisions and I believe that service no longer exists, having been replaced by much more expensive long term plans that are out of reach of anyone except law firms or large corporations.

Getting court decisions online for free or at an affordable cost would be of great benefit to anyone needing/wanting to do legal research in the US and would help improve the increasingly dismal state of democracy in this country.


When I get a chance to finish the 6GB of the District Courts I'll try and remember to come back and update you all

However, I can say that I extracted my state's cases and it came out to 100k+ so it's certainly more comprehensive than any other free data source (for my state)

EDIT: Downloads went down (Dropbox) right as I posted so I reached out to author to see how to get my hands on the rest of the files

FINAL EDIT: See my note below about him being on vacation and taking a look when he gets back near better internet.


This is awesome, thank you for sharing these. I was able to get a couple states, but it looks like Dropbox has cut you off.

  Connecting to dl.dropboxusercontent.com
  (dl.dropboxusercontent.com)|23.23.88.93|:443... connected.
  HTTP request sent, awaiting response... 509 Bandwidth Error
  2014-01-08 23:36:42 ERROR 509: Bandwidth Error.
Dropbox limits personal accounts to 20GB per day of public sharing.

https://www.dropbox.com/help/45/en


He said he pays for a pro account, not sure how much they offer but we exceeded that apparently.


A sincere question, is the collation of this material and its publication, /actually/ dedicated to Aaron Swartz (as I see https://www.google.com/#q=aaron+swartz+site:webpolicy.org&sa... shows zero results) or rather, editorializing / opining by the submitter?


It was a posted by a brand new account, and this is the only activity on the account. Probably doesn't know the rule about changing titles, and stuck in the Swartz reference to get more people to click.

I'm a little surprised the moderators haven't fixed it yet.


It looks like Dropbox just shut down the account's transfers. I take it a couple dozen people started mass-downloading each state's data. This would be the ideal use for a torrent network, right?


I've emailed Jonathan at his Stanford email to ask him to put it up as a torrent

Will update if I hear back


UPDATE

I just heard back and he's on vacation (go figure!).

He says he thought he had a pretty generous amount of traffic but he'll take a look when he can.


Torrents are definitely the way to deal with data this size.


See http://freelawproject.org/ for excellent free/open source legal data.


I don't think it provides state law, does it? Or state court decisions? Only federal court opinions, right?


Free Law Project does have some state courts.


I hope those working on this type of material consider Akoma Ntoso (http://www.akomantoso.org/), currently being standardized as OASIS LegalDocML (http://www.oasis-open.org/committees/legaldocml), and maybe OASIS LegalXML (https://www.oasis-open.org/committees/legalxml-courtfiling).

I think Akoma Ntoso would make bulk access, maybe even piecemeal API access as with other similar works like this, easier for consumption (think NLTK). The Italian Senate (the Senate in Rome) uses it, the Library of Congress has introduced some "data challenges" using it as well, and I think it is the future. Using a common data format / XML schema has its advantages.


Is this related to the XML data collected by public.resource.org? E.g. Supreme Court decisions are available in XML here: https://bulk.resource.org/courts.gov/c/US/


I don't know, but Public.Resource.Org only has legislation for a few states and territories.


Is there a free or paid service that has this type of data and a api for accessing it?


There's CourtListener and I'm sure several others.

http://freelawproject.org/

I'm building a product and API that would take advantage of the whole collection but it's not ready for primetime yet.


CourtListener already has an API: https://www.courtlistener.com/api/


Thanks... I ask because I wonder if law firms do any data analysis for all court opinions(Sentiment Analysis), judges, attorneys and outcomes. Also if states have data warehouses to review the application of laws local and federal.


The Administrative Office of the US Courts and the Federal Judicial Center collect and publish some statistics, but they're pretty basic. For example, they collect "case type" as a single field with a single value. Most cases have more than one kind of claim, so this is incredibly under-inclusive. Also there aren't codes for lots of kinds of cases.

http://www.uscourts.gov/FederalCourts/UnderstandingtheFedera... http://www.fjc.gov/

States are a grab bag but generally statistics poor.


Some jurisdictions have something called, by various names, the "Jury Verdict Reporter". If you see people in the Clerk's office (not federal) with laptops, they are copying information off of the case dockets. First instance state tribunals (however they are called) rarely publish their decisions. Some federal district court opinions are published.


A startup, Lex Machina (https://lexmachina.com/), does much of this stuff and sells it to law firms. My understanding is that they've been quite successful.


They don't, but Judicata does (http://www.judicata.com). I don't think their product has been released yet.


If they do, it's not common knowledge and they're being greedy!

But that's the kind of thing I'm working on and I'm sure others are as well.


Robot, Robot & Hwang:

http://www.robotandhwang.com/


http://www.plainsite.org

The API hasn't really been tested much.

http://www.plainsite.org/api

Feedback welcome.


Are there plans for state law for other states besides California and Virginia?


Only if someone decides to contribute--we're working on other kinds of data. You should check out The State Decoded, however, at http://www.statedecoded.com.

The State Decoded takes a more Jeffersonian approach (appropriately perhaps, given that it's based in VA), allowing citizens to code up their own state statutes. PlainSite in contrast is more Hamiltonian: centralized and standardized. There are advantages and disadvantages to each approach.


Not so long ago, this collection would have been priceless. 10-15 years ago there was an article in Wired about efforts to obtain access to case law, which was pretty much locked down by West Publishing and Lexis/Nexis. So a few comments:

1) To make this set usable from a practical point of view you have to know when it starts and finishes. "[E]very federal court ruling" is a bold statement. Federal Reporter Third? All 1000 volumes of F2d? What about the original Federal Reporter? F.Supp.? Not all federal district court decisions are published. Since our federal courts have become criminal courts (starting in the 1980's) most of the written decisions will be at the appellate level. What about "Do Not Publish" opinions? There are thousands of them and they are still useful. Usually only DoJ has copies.

2) Not having everything is not critical to the practical value of the set. In the 1990's a West salesman would tell you that there was no need to buy anything before 500 F2d if you were trying to put together a small federal library. For most states they would try to sell you everything, except perhaps New York, California and few others. The issue is updating. Florida updates (or used to) its appellate decisions on a monthly basis. You could sign up and they would send you a zip file every month. I don't know if all states do this. The problem of recency is a major one. A case could have been decided yesterday but you won't find out about it for a month. You can fix the problem on appeal--theoretically, assuming a client who wants to pay--because judges will not, except in rare cases, revisit older decisions they have made because case law that was not available at the time was dispositive.

3. The issue of citing to a particular page of a decision in addition to the official citation is not a huge problem. In many states, appellate decisions are relatively short and court rules have provided for the use of just the official citation. Cites to new Westlaw and Lexis cases do not have page numbers. When page numbers are unavailable, you can cite them as ( U.S. )(2014) [my Blue Book syntax is probably a little off here). If you cite an unpublished opinion you normally have to provide the judge and your counterparty a copy of the decision.

4. FLITE was the U.S. Air Force's effort to computerize case law in the 1980's. Westlaw and Lexis fought ferociously to prevent this database from being released to the public. They were successful. The same is the case with JURIS, a DoJ caselaw database. Now there are several providers (such as Fastcase) which compete with Westlaw and Juris. Access to PACER, the U.S. courts database of case, is limited. Efforts to mass download the database have been frustrated. The courts use PACER as a revenue tool. Also, criminal cases at the district court level are not on PACER (unless this has changed) supposedly to protect informants. So it would be interesting to know how this database was obtained.

5. Putting aside the practical value of this database, once the extent of the content is established, it could have real value for researchers. Could it be used to spot trends in the law? I wonder what might be shown if tools to measure things like historical market performance were used to analyze the database. You could see all sorts of data points for terms like "Dalkon Shield" or "asbestos" occurring within specific time ranges. There is definitely a "me too" aspect to the law. And while judges make law all the time, they have no control (usually) over the cases brought to them. Do cases involving "terrorists" match the pattern of cases involving "communists"? Or, say in the period 1910-1920, "Germans"? On a practical level, what is the statistical incidence of cases involving the Statute of Frauds? The "ancient document" exception to the hearsay rule? Are criminal conversation causes of action really coming back? If historically the incidence of data points A, then B always led to C can an analysis of such points today of any use in predicting future decisions?

Just a few thoughts.


A major part of the work at hand:

> federal and state law

Your entire post is about:

> case law

Suffice to say, those in the biz tend to focus mostly on case law, probably because they know the basics of statutory and common law, case law is not codified, and (all) case law is not offered for free even on horribly designed, practically useless government websites.

But, at least for my purposes, state legislation and statutory codifications (and their regulatory counterparts) are also very important to have bulk access to. Even outdated materials are useful, as once I know a section is relevant (because I can run complicated queries against entire datasets), I can begin research using more arcane methods (such as government websites and printed materials.) My treks through the California Codes and the California Code of Regulations would not have been possible without bulk access (I had to do it myself of course, after the Legislative Counsel got forced by CFAC/FAC and MAPLight.org to release the DB), and there was little case law on the issues I was doing research on (that I knew or know of) to guide me.

Outdated material may not be so useful for practicing lawyers, but its extremely useful for the 99%.


Unfortunately, in the U.S.statutory law is of limited utility without the case law which in interpreting it, modifies it. I agree that there are all sorts of forgotten gems to be mined--such as when the U.S. Congress established a church.


Unfortunately, the case law is also of limited utility without the statutory/regulatory law underlying it.


Statutes play less of a role than you would think. I'm not saying that they are unnecessary.


Yep, spot on.

> it could have real value for researchers.

Yes, historian researchers but not legal researchers, I'm afraid.


I don't know about that. Legal research for more than just answering a question like "does interspousal testimonial immunity apply in states which have not approved gay marriage when an out of state witness is called pursuant to a national federal criminal subpoena?" For that kind of a question you want to know if there is any recent law at all. So if this database is not kept up, it will be pretty much useless to answer this kind of question, except to the extent you want to get into the reasons for interspousal immunity in the first place.


> So if this database is not kept up, it will be pretty much useless to answer this kind of question.

Umm, exactly my point. Maybe I wasn't too clear.


I'd love for the title of this post to be accurate, but it's not. Many court rulings are available only in sliced tree format for a modest copying fee from a clerk behind a glass window. Sorry.


This is fantastic! Now you need a script to convert the daily slip opinions from the courts and the updated statutes from the legislatures to add them to the databases to keep the content current.


Has someone mirrored this?


Links are down...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: