Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Corpus size doesn't mean much in the context of a PDF, given how variable that can be per page.

I've found Google's Flash to cut my OCR costs by about 95+% compared to traditional commercial offerings that support structured data extraction, and I still get tables, headers, etc from each page. Still not perfect, but per page costs were less than one tenth of a cent per page, and 100 gb collections of PDFs ran to a few hundreds of dollars.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: