We Built the First Benchmark for AI Continuity: 250 Stories, 1,835 Questions
Every AI benchmark measures intelligence. None of them measure whether the system can maintain coherent context across time. So we built one.
ATANT (Automated Test for Acceptance of Narrative Truth) is the first open evaluation framework for AI continuity. It tests whether a system can persist, update, disambiguate, and reconstruct meaningful context across time. 250 narratives across 6 life domains, 1,835 verification questions, 10 checkpoints, 4 compliance levels. No LLM in the evaluation loop. The reference implementation scored 100% in isolated mode and 96% at 250-story cumulative scale.
Why Did We Build This?
AI benchmarks measure intelligence: MMLU, HumanEval, GSM8K, Chatbot Arena. They measure whether a model can answer questions, write code, solve math, generate coherent text.
None of them measure whether the system can remember what you told it yesterday. Or whether it can keep your sister's story separate from your coworker's. Or whether it updates correctly when the facts change. Or whether it can answer "summarize my current situation" instead of just "when is my interview?"
That's continuity. And until ATANT, there was no standard way to test it.
The industry has been building AI memory systems (Mem0, Zep, Letta, LangChain memory modules, vector databases) with no shared framework for evaluating whether they actually work. Each system reports its own metrics, on its own benchmarks, measuring its own definition of "memory."
We needed a shared standard. So we published one.
What Does ATANT Test?
ATANT tests AI continuity through narrative-based evaluation. Instead of synthetic fact pairs or single-turn Q&A, ATANT uses realistic multi-turn conversation narratives, the kind of context AI systems encounter in the real world.
The Test Corpus
| Metric | Value |
|---|---|
| Total narratives | 250 |
| Total verification questions | 1,835 |
| Life domains | 6 (Career, Relationships, Health, Learning, Daily Life, Life Events) |
| Testing phases | 5 rounds (50 stories each) |
| Question types | Fact retrieval, temporal ordering, update verification, disambiguation, reconstruction |
Each narrative is a multi-turn conversation that introduces facts, changes them, introduces overlapping entities, and tests whether the system maintains correct state through all of it.
Example: a story introduces a user with a job interview on Tuesday. Three turns later, the interview moves to Thursday. Five turns later, the user mentions their sister also has an interview. The verification questions test:
- Does the system know the interview is Thursday (not Tuesday)?
- Does the system distinguish between the user's interview and the sister's?
- Can the system reconstruct the current situation including both?
These aren't hard questions for a human. They're hard for systems that store raw text and retrieve by similarity.
What Are the 10 Checkpoints?
ATANT evaluates continuity through a sequence of 10 checkpoints, each verifying a specific stage of the write path and read path:
| CP | Name | What It Tests |
|---|---|---|
| CP1 | Classification | Is the input correctly classified (personal fact, event, emotion, etc.)? |
| CP2 | Triple Storage | Are the expected facts stored in the correct structured form? |
| CP3 | Predicted Queries | Does the system generate the right query-answer pairs at write time? |
| CP4 | Object Type Tagging | Are entities correctly typed (person, place, organization, etc.)? |
| CP5 | Query Classification | Is the verification question correctly classified for retrieval? |
| CP6 | Structural Matcher | Does the question match to the correct stored triple? |
| CP7 | DTCM Convergence | Does the convergence gate activate and select the right traces? |
| CP8 | Final Combined | Is the answer correct? The headline metric. |
| CP9 | Temporal System | Are temporal facts (dates, sequences, active/resolved) correct? |
| CP10 | Adaptation Engine | Does the system detect emotional state and adjust? |
CP8 is what matters for compliance. The other checkpoints diagnose where failures occur in the pipeline, which is what makes ATANT a development tool, not just a scorecard.
What Are the 4 Compliance Levels?
| Level | Requirement | What It Proves |
|---|---|---|
| ATANT-Core | 50 stories, isolated mode, 100% CP8 | Basic continuity works |
| ATANT-Stress | 250 stories, isolated mode, 100% CP8 | Continuity generalizes across story types |
| ATANT-Cumulative | 50 stories, cumulative mode, 100% CP8 | Disambiguation works, multiple users, correct separation |
| ATANT-Scale | 250 stories, cumulative mode, 100% CP8 | Disambiguation scales to production levels |
Scoring tiers within each level: Gold (100%), Silver (95-99%), Bronze (90-94%).
The sequence matters. A system that passes ATANT-Scale has proven it can maintain correct, disambiguated, updateable context across 250 coexisting narratives. That's the bar for production continuity.
What Did the Reference Implementation Score?
The first system evaluated against ATANT is NURA, the reference implementation built on DTCM (Decomposed Trace Convergence Memory) at Kenotic Labs.
| Mode | Stories | Questions | CP8 Accuracy |
|---|---|---|---|
| Isolated (250) | 250/250 | 1,835/1,835 | 100% |
| Cumulative (50) | 50/50 | 304/304 | 100% |
| Cumulative (250) | ~210/250 | 1,761/1,835 | 96% |
What the Results Mean
Isolated 100%, ATANT-Stress: Gold. Every story, every question, every checkpoint. The write path and read path work correctly when each narrative is tested independently.
Cumulative 50 at 100%, ATANT-Cumulative: Gold. 50 different people's narratives coexisting in the same database. The system retrieves the right fact for the right person every time.
Cumulative 250 at 96%, ATANT-Scale: Silver. 250 narratives. The 4% gap comes from predicate disambiguation at extreme scale. When 250 stories coexist, similarly-named predicates from different stories can compete. The Predicate Lexicon and Inverted Scoring Formula have been reducing this steadily.
How We Got Here
The path was not smooth:
| Date | Architecture | Best Score |
|---|---|---|
| Jan 2026 | Legacy pipeline | 58% (50 stories, with LLM in loop) |
| Feb 2026 | Scoring optimizations | 72% to regressed to 58% |
| Mar 8 | 594 Equation System + DTCM | 100% isolated (50 stories) |
| Mar 12 | 5 rounds complete | 100% isolated (250 stories) |
| Mar 14 | ParsedUtterance pipeline | 100% cumulative (50 stories) |
| Mar 16 | Garbage gate + explanation rescue | 100% cumulative (50), 96% cumulative (250) |
The legacy pipeline hit a ceiling at 58% and suffered from whack-a-mole regressions: fixing one story broke another. That forced the architectural rewrite to the 594 Equation System and DTCM. From that point, every test round passed on the first attempt.
What Makes ATANT Different From Existing Benchmarks?
| Standard LLM benchmarks | Memory-specific benchmarks | ATANT | |
|---|---|---|---|
| What it tests | Model intelligence | Fact retrieval from stored data | Full continuity: persist, update, disambiguate, reconstruct |
| Test format | Single-turn Q&A | Fact pairs or simple dialogues | Multi-turn narratives with updates, contradictions, overlapping entities |
| LLM in eval loop | Usually yes | Often yes | No, deterministic evaluation, no LLM judges |
| Disambiguation | Not tested | Rarely tested | Core requirement: 250 coexisting narratives |
| Update handling | Not tested | Sometimes | Required: facts change, old state must be superseded |
| Open standard | Varies | Usually proprietary | Open, published, system-agnostic |
| Compliance levels | Pass/fail | Score only | 4 levels with progression sequence |
The critical design decision: no LLM in the evaluation loop. ATANT uses deterministic verification. The expected answer is known, and the system's answer is compared directly. No "LLM-as-judge" subjectivity. A system either gets the right answer or it doesn't.
How Can Other Systems Be Evaluated Against ATANT?
ATANT is system-agnostic. Any AI system claiming to maintain continuity can be evaluated:
- Ingest the narrative corpus (250 stories, each as a sequence of user utterances)
- Process each utterance through the system's write path
- Query with the verification questions
- Compare the system's answers to the expected answers
- Score against the checkpoint framework
The full specification, story format schema, and example stories are published on GitHub. The evaluation paper is published on arXiv.
We built ATANT because the industry needs a shared definition of what continuity means and a shared way to measure it. The standard is open specifically so it can be adopted, challenged, and improved by others.
What Comes Next
ATANT v1.0 is the foundation. Future versions will add:
- Reconstruction quality metrics: not just "is the answer correct" but "how complete and useful is the reconstructed situation"
- Multi-language narratives, testing continuity across languages
- Proactive behavior testing: does the system surface relevant context without being asked
- Decay validation: does the system correctly age and deprioritize stale information
- Cross-system evaluation: standardized comparison across Mem0, Zep, Letta, and others
The standard grows as the field grows.
Try It
The full framework is on GitHub: github.com/Kenotic-Labs/ATANT
The specification, story format, compliance levels, and evaluation protocol are all published. If you're building an AI memory system, test it against ATANT. If you can pass ATANT-Scale at Gold, you have production-grade continuity.
If you can't, now you know exactly where it breaks.
Follow the research at kenoticlabs.com
Samuel Tanguturi is the founder of Kenotic Labs, building the continuity layer for AI systems.