I am releasing JFK-TELL, a dataset I generated by extracting text from the scanned PDFs of the assassination records released until April 2025. The extraction was done with Google Gemini LLM API to generate Markdown text, using a very simple prompt. For detailed methodology, check out the Github repo at https://github.com/farhanhubble/jfk-tell
Who is independently error checking this? Surely given the depth of conspiracy theory, simply machine driven OCR with no independent validation is going to "feed the beast" more than it intends?
The intention is that it should be used for all kinds of research, including the efficacy of LLM OCRs. It shouldn't be too difficult to look at the original PDF to substantiate any "fact" anyone wants to quote from the dataset.
I'm more interested in seeing if people can find new insights, questions, or inconsistencies by reviewing the documents at scale, automatically.