The intention is that it should be used for all kinds of research, including the...

The intention is that it should be used for all kinds of research, including the efficacy of LLM OCRs. It shouldn't be too difficult to look at the original PDF to substantiate any "fact" anyone wants to quote from the dataset.

I'm more interested in seeing if people can find new insights, questions, or inconsistencies by reviewing the documents at scale, automatically.