The current OCR approach typically relies on a Vision-Language Model (VLM) to convert a table into a JSON structure. However, a table inherently has a 2D spatial structure, while Large Language Models (LLMs) are optimized for processing 1D sequential text. This creates a fundamental mismatch between the data representation and the model’s input format.
Most existing pipelines address this by preprocessing the table into a linearized 1D string before passing it to the LLM — a question-agnostic step that may lose structural information.
Instead, one could retain the original table form and, when a question is asked, feed both the question and the original table (as an image) directly into the VLM. This approach allows the model to reason over the data in its native 2D domain, providing a more natural and potentially more accurate solution.
We try to solve a similar problem to put long documents in context. We built an MCP for Claude to allow you to put long PDFs in your context window that go beyond the context limits: https://pageindex.ai/mcp.
this is about difficulty thing, it is about maintaining knowledge during long execution
what is IMO exiting is long term memory in things like Claude code, where model could learn your preferences as you collaborate. (there is already some hard disabled implementation in CC)
Thanks for the great question! We actually use a reasoning-based, vectorless approach. In short, it follows this process:
1. Generate a table of contents (ToC) for the document.
2. Read the ToC to select a relevant section.
3. Extract relevant information from the selected section.
4. If enough information has been gathered, provide the answer; otherwise, return to step 2.
We believe this approach closely mimics how a human would navigate and read long PDFs.
Many RAG systems handle in-document references (like “see appendix for details”) by building graphs or other preprocessing structures. The idea is to make sure cross-references are resolved before retrieval.
But with reasoning-based RAG, you don’t need that extra layer. The LLM itself can read the document, notice the reference, and then “jump” to the appendix (or wherever the reference points) to extract the answer. In other words, instead of pre-building structure, the model reasons its way through the content.
An example of reasoning-based RAG with PageIndex MCP is attached. In this example, the query asks for the total value. The main text only provides the increased value and refers to the appendix table for the total value. The LLM then looks up the appendix to find the total value and explains its reasoning process.
This raises an interesting question: how much preprocessing do we actually need for reasoning-augmented RAG, and when is it better to just let the model figure it out?
This paper introduces a method that allows you to fine-tune black box embedding models (e.g. those vectors obtained with ChatGPT API). It shows there is around 10% improvement in various domains. Any feedbacks are welcome.
Check out this MCP: https://pageindex.ai/mcp, which allows you to chat with any long PDFs (hundreds of pages) beyond the context limit of Claude or ChatGPT.
The word index originally came from how humans retrieve information: book indexes and tables of contents that guide us to the right place.
Computers later borrowed the term for data structures such as B-trees, hash tables, and more recently, vector indexes. They're highly efficient for machines, but also abstract and unnatural: not something a human, or an LLM, can directly use as a reasoning aid. This creates a gap between how indexes work for computers and how they should work for models that reason like humans.
PageIndex is a new step that looks back to move forward. It revives the original, human-oriented idea of an index and adapts it for LLMs. Now the index itself (PageIndex) lives inside the LLM's context window: the model sees a hierarchical table-of-contents tree and reasons its way down to the right span, much like a person would retrieve information using a book's index.
PageIndex MCP shows how this works in practice: it runs as a MCP server, exposing a document's structure directly to LLMs. This means platforms like Claude, Cursor, or any MCP-enabled agent can navigate the index themselves and reason their way through documents, not with vectors or chunking, but in a human-like, reasoning-based way.
Hi, we generate a table of contents (ToC) for each document, so LLM can use the ToC to navigate the long documents, which can bypass the context limit. You can checkout this notebook for a quick toturial about our method: https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex
Most existing pipelines address this by preprocessing the table into a linearized 1D string before passing it to the LLM — a question-agnostic step that may lose structural information.
Instead, one could retain the original table form and, when a question is asked, feed both the question and the original table (as an image) directly into the VLM. This approach allows the model to reason over the data in its native 2D domain, providing a more natural and potentially more accurate solution.
reply