Blog/Technical

RAG Doesn't Solve Hallucination. What Actually Does.

Kenotic LabsApril 7, 20268 min read

RAG Doesn't Solve Hallucination. What Actually Does.

Legal AI tools using RAG hallucinate 17-33% of the time. The problem isn't the model. It's what retrieval actually returns, and what it can't.

RAG reduces hallucination compared to base models, but it doesn't solve it. RAG retrieves similar text chunks. It doesn't know what's current vs. outdated, can't disambiguate overlapping contexts, and can't reconstruct the state of a situation. The actual fix is deterministic reconstruction: a write-path architecture that structures information at storage time so the right context can be rebuilt, not searched for.

RAG was supposed to fix hallucination. Give the model access to real documents. Ground its responses in retrieved facts. Problem solved.

Except it didn't solve it. Stanford's 2025 study found that LexisNexis and Westlaw, two of the most sophisticated RAG-based legal research tools on the market, hallucinate between 17% and 33% of the time. Westlaw's AI-Assisted Research is accurate on just 42% of queries.

These aren't toy demos. These are production tools used by lawyers making decisions that affect people's lives. And one in three to one in six responses is wrong.

RAG helped. It didn't solve the problem. Understanding why requires looking at what retrieval actually does and what it doesn't.

What Does RAG Actually Do?

RAG works in three steps:

  1. Chunk: Split documents into pieces (typically 200-500 tokens each)
  2. Embed: Convert each chunk into a vector (a numeric representation of its meaning)
  3. Retrieve: When a query comes in, find the chunks whose vectors are closest to the query vector, and feed them to the model as context

The model then generates a response using those retrieved chunks as grounding.

This works well for a specific class of question: "Find me something relevant to this query." If the answer exists in a single chunk and the embedding correctly captures its relevance, RAG does its job.

But that's a narrow class of question.

Why Does RAG Still Hallucinate?

RAG fails in predictable ways. Recent research catalogs multiple distinct root causes. The most structural:

1. Chunk splitting destroys context. If a fact spans two chunks, neither chunk contains the complete answer. One bad chunk split can ruin relevance. Documents get sliced in ways that break semantic units, split related concepts, and create fragments too small to be meaningful.

2. Semantic similarity isn't semantic correctness. Vector search finds text that sounds similar to the query. It doesn't verify that the retrieved text actually answers it. "Terminating an employee" and "terminating a software process" are semantically similar. They are not the same thing.

3. Lost in the middle. Even when RAG retrieves the right chunks, models struggle to use them. Research shows a U-shaped performance curve. Models attend to the beginning and end of context but degrade significantly on information in the middle. More retrieved chunks can actually make accuracy worse.

4. No temporal awareness. RAG retrieves chunks regardless of when they were written. If a fact was updated three times, RAG might return the outdated version because its embedding is closer to the query. There's no concept of "this supersedes that."

5. No disambiguation. If your system serves multiple users or contexts, RAG returns the closest vectors regardless of who they belong to. Two users with similar situations get each other's data.

6. Retrieval is not reconstruction. RAG answers "what text is similar to this query?" It cannot answer "what is the current state of this situation?" Those are fundamentally different operations.

What's the Difference Between Retrieval and Reconstruction?

This is the core distinction:

Retrieval searches a corpus and returns similar chunks. It's a read-path operation. The data is stored however it was stored, and search happens at query time.

Reconstruction rebuilds the current state of a situation from structured traces. It's a write-path-first operation. Data is decomposed and structured at storage time so that the right context can be deterministically assembled later.

RAG (Retrieval)Deterministic Reconstruction
When structuring happensQuery time (search)Write time (decomposition)
What it returnsSimilar text chunksThe current state of a situation
Update handlingOld and new chunks coexistOld state is superseded, current state is authoritative
DisambiguationReturns all similar vectors regardless of sourceTraces are scoped to each user/context
Temporal orderingNo awareness of sequenceTracks what happened when and what's still active
Hallucination sourceWrong chunk retrieved, model confabulatesDeterministic: either the trace exists or it doesn't
Fails whenQuery doesn't match available chunk embeddingsNothing was stored (explicit failure, not silent)

The difference: when RAG fails, it returns something that sounds right but isn't. When reconstruction fails, it returns nothing, because the data either exists in structured form or it doesn't. Silent hallucination vs. explicit absence.

Why Can't You Just Improve RAG?

The industry is trying. Semantic chunking. Re-ranking. Hybrid search. Agentic RAG. Better embeddings. Each iteration improves accuracy incrementally.

But these improvements are all on the read path. They're trying to get better at finding the right chunk at query time. The fundamental issue is that the data was stored as unstructured text, and no amount of search sophistication fully compensates for that.

Consider: if you store a user's situation as raw conversation logs and then try to retrieve the relevant pieces later, you're depending on embedding similarity to reconstruct meaning. That's probabilistic by nature. Sometimes it works. Sometimes it returns the wrong chunk. Sometimes it returns an outdated version. Sometimes it misses context that spans multiple chunks.

If instead you decompose the interaction at write time, extracting who was involved, what happened, when, what the emotional state was, what's still active. Reconstruction at read time is deterministic. You're not searching for similar text. You're assembling structured traces that were explicitly stored for this purpose.

Where Does Fine-Tuning Fit?

Neither RAG nor fine-tuning solves this problem, because they solve different problems:

  • Fine-tuning changes model behavior: how it writes, what style it uses, what domain it specializes in. It doesn't give the model access to external facts.
  • RAG gives the model access to external facts, but probabilistically, with all the failure modes above.
  • Longer context windows let the model hold more raw text, but performance degrades as context length increases (the same "lost in the middle" effect), and a longer window still resets every session.

None of these address the core issue: how do you maintain structured, updateable, living state across time?

That's not a retrieval problem. That's not a training problem. That's an infrastructure problem. It requires a dedicated layer.

What Is This Layer?

A continuity layer sits between the user and the model. It decomposes information into structured traces at write time (who, what, when, emotional state, active vs. resolved) and reconstructs the current situation from those traces at read time. Not "find similar chunks" but assemble the structured state.

This is the same architectural problem behind ChatGPT forgetting across sessions, Character AI losing your story, chatbots making you repeat yourself, and voice assistants that can't remember yesterday. RAG is deployed in all of them as the "memory" solution. It's insufficient in all of them for the same reasons.

What I Built

At Kenotic Labs, I built a write-path-first deterministic architecture called DTCM (Decomposed Trace Convergence Memory). Every interaction is decomposed into five structured traces at write time. At read time, the system reconstructs situational context from those traces. Deterministically, not probabilistically.

I tested it against ATANT, the first open evaluation framework for AI continuity. 250 narrative stories. 1,835 verification questions. 100% accuracy in isolated mode. 96% at 250-story cumulative scale, with 250 different contexts coexisting in one system, correctly disambiguated.

RAG finds similar chunks. DTCM reconstructs the current state. That's the architectural difference.

Follow the research at kenoticlabs.com

Samuel Tanguturi is the founder of Kenotic Labs, building the continuity layer for AI systems. ATANT v1.0, the first open evaluation framework for AI continuity, is available on GitHub.

The continuity layer is the missing layer between AI interaction and AI relationship.

Kenotic Labs builds this layer.

Get in touch