More

mingtianzhang · 2025-10-05T16:47:24 1759682844

The current OCR approach typically relies on a Vision-Language Model (VLM) to convert a table into a JSON structure. However, a table inherently has a 2D spatial structure, while Large Language Models (LLMs) are optimized for processing 1D sequential text. This creates a fundamental mismatch between the data representation and the model’s input format.

Most existing pipelines address this by preprocessing the table into a linearized 1D string before passing it to the LLM — a question-agnostic step that may lose structural information.

Instead, one could retain the original table form and, when a question is asked, feed both the question and the original table (as an image) directly into the VLM. This approach allows the model to reason over the data in its native 2D domain, providing a more natural and potentially more accurate solution.

fragmede · 2025-10-05T16:58:39 1759683519

Yeah, I wonder how PNG would fare in this contest.

mingtianzhang · 2025-10-05T07:35:39 1759649739

Edited version:

We try to solve a similar problem to put long documents in context. We built an MCP for Claude to allow you to put long PDFs in your context window that go beyond the context limits: https://pageindex.ai/mcp.

derleyici · 2025-10-05T08:04:29 1759651469

Just a heads-up: HN folks value transparency, so mentioning if it's yours usually builds more trust.

mingtianzhang · 2025-10-05T08:18:36 1759652316

Thanks for the reminder, I have edited the comment.

Szpadel · 2025-10-05T07:48:34 1759650514

this is about difficulty thing, it is about maintaining knowledge during long execution what is IMO exiting is long term memory in things like Claude code, where model could learn your preferences as you collaborate. (there is already some hard disabled implementation in CC)

mingtianzhang · 2025-10-05T07:32:49 1759649569

Thanks for the great question! We actually use a reasoning-based, vectorless approach. In short, it follows this process:

  1. Generate a table of contents (ToC) for the document.

  2. Read the ToC to select a relevant section.

  3. Extract relevant information from the selected section.

  4. If enough information has been gathered, provide the answer; otherwise, return to step 2.

We believe this approach closely mimics how a human would navigate and read long PDFs.

LoMoGan · 2025-10-05T08:22:36 1759652556

Sounds interesting, will try it out.

mingtianzhang · 2025-10-05T08:26:51 1759652811

Thanks, any feedback is welcome!

mingtianzhang · 2025-10-05T07:12:37 1759648357

thanks!

mingtianzhang · 2025-10-05T07:12:06 1759648326

Great post, I am wondering if this system includes financial report analysis.

mingtianzhang · 2025-10-03T04:27:36 1759465656

Many RAG systems handle in-document references (like “see appendix for details”) by building graphs or other preprocessing structures. The idea is to make sure cross-references are resolved before retrieval.

But with reasoning-based RAG, you don’t need that extra layer. The LLM itself can read the document, notice the reference, and then “jump” to the appendix (or wherever the reference points) to extract the answer. In other words, instead of pre-building structure, the model reasons its way through the content.

An example of reasoning-based RAG with PageIndex MCP is attached. In this example, the query asks for the total value. The main text only provides the increased value and refers to the appendix table for the total value. The LLM then looks up the appendix to find the total value and explains its reasoning process.

This raises an interesting question: how much preprocessing do we actually need for reasoning-augmented RAG, and when is it better to just let the model figure it out?

mingtianzhang · 2025-09-29T14:29:40 1759156180

This paper introduces a method that allows you to fine-tune black box embedding models (e.g. those vectors obtained with ChatGPT API). It shows there is around 10% improvement in various domains. Any feedbacks are welcome.

mingtianzhang · 2025-09-24T03:39:55 1758685195

Check out this MCP: https://pageindex.ai/mcp, which allows you to chat with any long PDFs (hundreds of pages) beyond the context limit of Claude or ChatGPT.

mingtianzhang · 2025-09-19T23:14:38 1758323678

The word index originally came from how humans retrieve information: book indexes and tables of contents that guide us to the right place.

Computers later borrowed the term for data structures such as B-trees, hash tables, and more recently, vector indexes. They're highly efficient for machines, but also abstract and unnatural: not something a human, or an LLM, can directly use as a reasoning aid. This creates a gap between how indexes work for computers and how they should work for models that reason like humans.

PageIndex is a new step that looks back to move forward. It revives the original, human-oriented idea of an index and adapts it for LLMs. Now the index itself (PageIndex) lives inside the LLM's context window: the model sees a hierarchical table-of-contents tree and reasons its way down to the right span, much like a person would retrieve information using a book's index.

PageIndex MCP shows how this works in practice: it runs as a MCP server, exposing a document's structure directly to LLMs. This means platforms like Claude, Cursor, or any MCP-enabled agent can navigate the index themselves and reason their way through documents, not with vectors or chunking, but in a human-like, reasoning-based way.

mingtianzhang · 2025-09-11T13:25:45 1757597145

Hi, we generate a table of contents (ToC) for each document, so LLM can use the ToC to navigate the long documents, which can bypass the context limit. You can checkout this notebook for a quick toturial about our method: https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex

shubhamintech · 2025-09-11T13:35:46 1757597746

Oh nicee! Wanna chat on a meet? Curious to know the usage behavior of your users & pilot it internally as well.