Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

- Most of the titles have incorrectly split words, for example "P ART 2—R EPEAL OF EPA R ULE R ELATING TO M ULTI -P OLLUTANT E MISSION S TANDARDS". I know LLMs are resilient against typos and mistakes like this, but it still seems not ideal.

- The header is parsed in a way that I suspect would mislead an LLM: "BRETT GUTHRIE, KENTUCKY FRANK PALLONE, JR., NEW JERSEY CHAIRMAN RANKING MEMBER ONE HUNDRED NINETEENTH CONGRESS". Guthrie is the chairman and Pallone is the ranking member, but that isn't implied in the text. In this particular case an LLM might already know that from other sources, but in more obscure contexts it will just have to rely on the parsed text.

- It isn't converted into Markdown at all, the structure is completely lost. If you only care about text then I guess that's fine, and in this case an LLM might do an ok job at identifying some of the headers, but in the context of this discussion I think ai.toMarkdown() did a bad job of converting to Markdown and a just ok job of converting to text.

I would have considered this a fairly easy test case, so it would make me hesitant to trust that function for general use if I were trying to solve the challenges described in the submitted article (Identifying headings, Joining consecutive headings, Identifying Paragraphs).

I see that you are trying to minimize tokens for LLM input, so I realize your goals are probably not the same as what I'm talking about.

Edit: Another test case, it seems to crash on any Arxiv PDF. Example: https://pure.md/https://arxiv.org/pdf/2411.12104.




> it seems to crash on any Arxiv PDF

Fixed, thanks for reporting :-)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: