Hacker News new | past | comments | ask | show | jobs | submit login

I always enjoy reading articles like this. I am a .Net developer and recently did an application for parsing some of our B2B PDF documents (different vendors, different document structures) and loading the data into a backend db.

It was an interesting project. Text parsing becomes so simple once you have a logic. And there are so many things to watch out for e.g. Different versions of PDF are structured differently, some vendors had HTML documents, flat text, some totally not in any structure.

Thanks for sharing!




What did you use for parsing PDFs?


I used itextsharp library to convert the PDF files to text and then go from there. Once you have the file in text format you can then determine how to parse - that would be the structure you will see all the time for that document. In my case, each document differ by vendors, hence different parsers. That's the gist of it.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: