What if the way we build AI document chatbots today is flawed? Most systems use RAG. They split documents into chunks, create embeddings, and retrieve answers using similarity search. It works in demos but often fails in real use. It misses obvious answers or picks the wrong context. Now there is a new approach called PageIndex. It does not use chunking, embeddings, or vector databases. Yet it reaches up to 98.7% accuracy on tough document Q&A tasks. In this article, we will break down how PageIndex works, why it performs better on structured documents, and how you can build your own chatbot using it.
Here’s the classic RAG pipeline you’ve probably seen a hundred times.
Simple. Elegant. And absolutely riddled with failure modes.
When you slice a document at 512 tokens, you’re not respecting the document’s actual structure. A single table might get split across three chunks. A footnote that’s critical to understanding the main text ends up in a completely different chunk. The answer you need might literally span two adjacent chunks that the retriever picks only one of.
This is the big one. Vector similarity finds text that sounds like your question. But documents often don’t repeat the question’s phrasing when they answer it. Ask “What is the termination clause?” and the contract might just say “Section 14.3 — Dissolution of Agreement.” Low cosine similarity. Missed entirely.
You get three chunks back. Why those three? You have no idea. It’s pure math. There’s no reasoning, no explanation, no audit trail. For financial documents, legal contracts, and medical records? That opacity is a serious problem.
A 300-page technical manual with complex cross-references? The sheer number of chunks makes retrieval noisy. You end up getting chunks that are vaguely related instead of the exact section you need.
These aren’t edge cases. These are the everyday failures that RAG engineers spend most of their time fighting. And the reason they happen is actually pretty simple — the entire architecture is borrowed from search engines, not from how humans actually read and understand documents.
When a human expert needs to answer a question from a document, they don’t scan every sentence looking for the one that sounds most similar to the question. They open the table of contents, skim the chapter headings, navigate, and reason about where the answer should be before they even start reading.
That’s the insight behind PageIndex.
PageIndex was built by VectifyAI and open-sourced on GitHub. The core idea is deceptively simple:
Instead of searching a document, navigate it: the way a human expert would.
Here’s the key mental shift. Traditional RAG asks: “Which chunks look most similar to my question?”
PageIndex asks: “Where in this document would a smart human look for the answer to this question?”
Those are two very different questions. And the second one turns out to produce dramatically better results.
PageIndex does this by building what it calls a Reasoning Tree. It is essentially an intelligent, AI-generated table of contents for your document.
Here’s how to visualize it. At the top, you have a root node that represents the entire document. Below that, you have nodes for each major section or chapter. Each of those branches into subsections. Each subsection branches into specific topics or paragraphs. Every single node in this tree has two things:
This tree is built once, when you first submit the document. It’s your index.
Now here’s where it gets clever. When you ask a question, PageIndex does two things:
It sends the question to an LLM along with the tree, but just the titles and summaries, not the full text. The LLM reads through the tree like a human reads a table of contents, and it reasons: “Okay, given this question, which branches of the tree are most likely to contain the answer?”
The LLM returns a list of specific node IDs, and you can see its reasoning. It literally tells you why it chose those sections. Full transparency.
PageIndex fetches only the full text of those selected nodes, hands it to the LLM as context, and the LLM writes the final answer grounded entirely in the real document text.
Two LLM calls. No embeddings. No vector database. Just reasoning.
And because every answer is tied to specific nodes in the tree, you always know exactly which page, which section, which part of the document the answer came from. Complete audit trail. Complete explainability.
Let me go deeper into the mechanics, because this is the really interesting part.
When you call submit_document(), PageIndex reads your PDF or text file and does something remarkable. It doesn’t just extract text but also understands the structure. Using a combination of layout analysis and LLM reasoning, it identifies:
It then constructs the tree and generates a summary for every node. Not just a title. An actual condensed description of what’s in that section. This is what enables the smart navigation later.
The tree uses a numeric node ID system that mirrors real document structure: 0001 might be Chapter 1, 0002 Chapter 2, 0003 the first section inside Chapter 1, and so on. The hierarchy is preserved.
Think about what chunking does to a 50-page financial report. You get maybe 300 chunks, each with zero awareness of whether it’s from the executive summary or a footnote on page 47. The embedder treats them all equally.
The PageIndex tree, on the other hand, knows that node 0012 is the “Revenue Breakdown” subsection under the “Q3 Financial Results” section under “Annual Report 2024.” That structural awareness is enormously valuable when you’re trying to find something specific.
Here’s the other thing that makes PageIndex special. The search step is not a mathematical operation. It’s a cognitive operation performed by an LLM.
When you ask, “What were the main risk factors disclosed in this report?”, the LLM doesn’t measure cosine distance. It reads the tree, recognizes that the “Risk Factors” section is exactly what’s needed, and selects those nodes, just like you would.
This means PageIndex handles semantic mismatch naturally. This is the kind of mismatch that kills vector search. The document calls it “Risk Factors.” Your question calls it “main dangers.” A vector search might miss it. An LLM reading the tree structure will not.
PageIndex powered Mafin 2.5, VectifyAI’s financial RAG system, which achieved 98.7% accuracy on FinanceBench. For those unaware, this is a benchmark specifically designed to test AI systems on financial document questions, where the documents are long, complex, and full of tables and cross-references. That’s the hardest environment for traditional RAG. It’s where PageIndex shines most.
PageIndex is particularly powerful for:
Basically: anywhere your document has real structure that chunking would destroy.
And the really exciting thing? You can use it with any LLM. OpenAI, Anthropic, Gemini — the tree search and answer generation steps are just prompts. You’re in full control.
Okay. You now know the theory. You know why PageIndex exists, what it does, and how it works under the hood. Now let’s actually build something with it.
I’m going to open a Jupyter notebook and walk you through the complete PageIndex pipeline: uploading a document, getting the reasoning tree back, navigating it with an LLM, and asking questions. Every line of code is explained. No hand-waving.
%pip install -q --upgrade pageindex
First things first. We install the pageindex Python library. One line, done. No vector database to set up. No embedding model to download. This is already simpler than any traditional RAG setup.
import os
from pageindex import PageIndexClient
import pageindex.utils as utils
from dotenv import load_dotenv
load_dotenv()
PAGEINDEX_API_KEY = os.getenv("PAGEINDEX_API_KEY")
pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)
We import the PageIndexClient. This is our connection to the PageIndex API. All the heavy lifting of building the tree happens on their end, so we don’t need a beefy machine. We also load API keys from a .env file — always keep your keys out of your code.
import openai
async def call_llm(prompt, model="gpt-4.1-mini", temperature=0):
client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)
response = await client.chat.completions.create(...)
return response.choices[0].message.content.strip()
Here we define our LLM helper function. We’re using GPT-4.1-mini for cost efficiency — but this works with any OpenAI model, and you could swap in Claude or Gemini with a one-line change. Temperature zero keeps the answers factual and consistent.
pdf_path = "/Users/soumil/Desktop/PageIndex/HR Policies-1.pdf"
doc_id = pi_client.submit_document(pdf_path)["doc_id"]
print('Document Submitted:', doc_id)
This is the magic line. We point to our PDF — in this case an HR policy document — and submit it. PageIndex takes the file, reads its structure, and starts building the reasoning tree in the background. We get back a doc_id, a unique identifier for this document that we’ll use in every subsequent call. Notice there’s no chunking code, no embedding call, no vector database connection.
while not pi_client.is_retrieval_ready(doc_id):
print("Still processing... retrying in 10 seconds")
time.sleep(10)
tree = pi_client.get_tree(doc_id, node_summary=True)['result']
utils.print_tree(tree)
PageIndex processes the document asynchronously — we just poll every 10 seconds until it’s ready. Then we call get_tree() with node_summary=True, which gives us the full tree structure including summaries.
Look at this output. This is the reasoning tree. You can see the hierarchy — the top-level HR Policies node, then Electronic Communication Policy, Sexual Harassment Policy, Grievance Redressal Policy, each branching into its subsections. Every node has an ID, a title, and a summary of what’s in it.
This is what traditional RAG throws away. The structure. The relationships. The hierarchy. PageIndex keeps all of it.
query = "What are the key HR policies and employee guidelines?"
tree_without_text = utils.remove_fields(tree.copy(), fields=['text'])
search_prompt = f"""
You are given a question and a tree structure of a document...
Question: {query}
Document tree structure: {json.dumps(tree_without_text, indent=2)}
Reply in JSON: {{ "thinking": "...", "node_list": [...] }}
"""
tree_search_result = await call_llm(search_prompt)
Now we search. For this, we build a prompt that includes the question and the entire tree — but crucially, without the full text content of each node. Just the titles and summaries. This keeps the prompt manageable while giving the LLM everything it needs to navigate.
The LLM is instructed to return a JSON object with two things: its thinking process and the list of relevant node IDs.
Look at the output. The LLM tells us exactly why it chose each section. It reasoned through the tree like a human would. And it gave us a list of 30 node IDs — every section of this HR document, because the question is broad.
This transparency is something you simply can’t get with cosine similarity.
node_list = tree_search_result_json["node_list"]
relevant_content = "\n\n".join(node_map[node_id]["text"] for node_id in node_list)
answer_prompt = f"""Answer the question based on the context:
Question: {query}
Context: {relevant_content}"""
answer = await call_llm(answer_prompt)
utils.print_wrapped(answer)
Step two. Now that we know which nodes are relevant, we fetch their full text — only those nodes, nothing else. We join the text and build a clean context prompt. One more LLM call, and we get our answer.
Look at this answer. Detailed, structured, accurate. And every single claim can be traced back to a specific node in the tree, which maps to a specific page in the PDF. Full audit trail. Full explainability.
async def ask(query):
# Full pipeline: tree search → text retrieval → answer generation
...
user_query = input("Enter your query: ")
await ask(user_query)
Now we package the entire pipeline into a single ask() function. Submit a question, get an answer — the tree search, retrieval, and generation all happen under the hood. Let me show you a couple of live examples.
Type a question: e.g., “What are the penalties for sexual harassment?”
Watch what happens. It searches the tree, identifies the Sexual Harassment Policy nodes specifically, pulls their text, and gives us a precise, cited answer in seconds. This is the experience you want to deliver to your users.
Another one. Again, it finds exactly the right section. No confusion, no noise, no hallucination. Just the answer, from the document, with a clear trail showing where it came from.
Let’s bring this together. Traditional RAG finds text that looks similar to a question. But the real goal is to find the right answer in a structured document. PageIndex solves this better. It builds a reasoning tree and lets the model navigate it intelligently. The result is accurate and explainable answers, with up to 98.7% accuracy on FinanceBench. It is not perfect for every use case. Vector search still works well for large scale semantic search. But for long, structured documents, PageIndex is a stronger approach. You can find all the code in the description. Add your API keys and get started.