Blog/Research

We Built the First Benchmark for AI Continuity: 250 Stories, 1,835 Questions

Kenotic LabsApril 7, 20269 min read

We Built the First Benchmark for AI Continuity: 250 Stories, 1,835 Questions

Every AI benchmark measures intelligence. None of them measure whether the system can maintain coherent context across time. So we built one.

ATANT (Automated Test for Acceptance of Narrative Truth) is the first open evaluation framework for AI continuity. It tests whether a system can persist, update, disambiguate, and reconstruct meaningful context across time. 250 narratives across 6 life domains, 1,835 verification questions, 10 checkpoints, 4 compliance levels. No LLM in the evaluation loop. The reference implementation scored 100% in isolated mode and 96% at 250-story cumulative scale.

Why Did We Build This?

AI benchmarks measure intelligence: MMLU, HumanEval, GSM8K, Chatbot Arena. They measure whether a model can answer questions, write code, solve math, generate coherent text.

None of them measure whether the system can remember what you told it yesterday. Or whether it can keep your sister's story separate from your coworker's. Or whether it updates correctly when the facts change. Or whether it can answer "summarize my current situation" instead of just "when is my interview?"

That's continuity. And until ATANT, there was no standard way to test it.

The industry has been building AI memory systems (Mem0, Zep, Letta, LangChain memory modules, vector databases) with no shared framework for evaluating whether they actually work. Each system reports its own metrics, on its own benchmarks, measuring its own definition of "memory."

We needed a shared standard. So we published one.

What Does ATANT Test?

ATANT tests AI continuity through narrative-based evaluation. Instead of synthetic fact pairs or single-turn Q&A, ATANT uses realistic multi-turn conversation narratives, the kind of context AI systems encounter in the real world.

The Test Corpus

MetricValue
Total narratives250
Total verification questions1,835
Life domains6 (Career, Relationships, Health, Learning, Daily Life, Life Events)
Testing phases5 rounds (50 stories each)
Question typesFact retrieval, temporal ordering, update verification, disambiguation, reconstruction

Each narrative is a multi-turn conversation that introduces facts, changes them, introduces overlapping entities, and tests whether the system maintains correct state through all of it.

Example: a story introduces a user with a job interview on Tuesday. Three turns later, the interview moves to Thursday. Five turns later, the user mentions their sister also has an interview. The verification questions test:

  • Does the system know the interview is Thursday (not Tuesday)?
  • Does the system distinguish between the user's interview and the sister's?
  • Can the system reconstruct the current situation including both?

These aren't hard questions for a human. They're hard for systems that store raw text and retrieve by similarity.

What Are the 10 Checkpoints?

ATANT evaluates continuity through a sequence of 10 checkpoints, each verifying a specific stage of the write path and read path:

CPNameWhat It Tests
CP1ClassificationIs the input correctly classified (personal fact, event, emotion, etc.)?
CP2Triple StorageAre the expected facts stored in the correct structured form?
CP3Predicted QueriesDoes the system generate the right query-answer pairs at write time?
CP4Object Type TaggingAre entities correctly typed (person, place, organization, etc.)?
CP5Query ClassificationIs the verification question correctly classified for retrieval?
CP6Structural MatcherDoes the question match to the correct stored triple?
CP7DTCM ConvergenceDoes the convergence gate activate and select the right traces?
CP8Final CombinedIs the answer correct? The headline metric.
CP9Temporal SystemAre temporal facts (dates, sequences, active/resolved) correct?
CP10Adaptation EngineDoes the system detect emotional state and adjust?

CP8 is what matters for compliance. The other checkpoints diagnose where failures occur in the pipeline, which is what makes ATANT a development tool, not just a scorecard.

What Are the 4 Compliance Levels?

LevelRequirementWhat It Proves
ATANT-Core50 stories, isolated mode, 100% CP8Basic continuity works
ATANT-Stress250 stories, isolated mode, 100% CP8Continuity generalizes across story types
ATANT-Cumulative50 stories, cumulative mode, 100% CP8Disambiguation works, multiple users, correct separation
ATANT-Scale250 stories, cumulative mode, 100% CP8Disambiguation scales to production levels

Scoring tiers within each level: Gold (100%), Silver (95-99%), Bronze (90-94%).

The sequence matters. A system that passes ATANT-Scale has proven it can maintain correct, disambiguated, updateable context across 250 coexisting narratives. That's the bar for production continuity.

What Did the Reference Implementation Score?

The first system evaluated against ATANT is NURA, the reference implementation built on DTCM (Decomposed Trace Convergence Memory) at Kenotic Labs.

ModeStoriesQuestionsCP8 Accuracy
Isolated (250)250/2501,835/1,835100%
Cumulative (50)50/50304/304100%
Cumulative (250)~210/2501,761/1,83596%

What the Results Mean

Isolated 100%, ATANT-Stress: Gold. Every story, every question, every checkpoint. The write path and read path work correctly when each narrative is tested independently.

Cumulative 50 at 100%, ATANT-Cumulative: Gold. 50 different people's narratives coexisting in the same database. The system retrieves the right fact for the right person every time.

Cumulative 250 at 96%, ATANT-Scale: Silver. 250 narratives. The 4% gap comes from predicate disambiguation at extreme scale. When 250 stories coexist, similarly-named predicates from different stories can compete. The Predicate Lexicon and Inverted Scoring Formula have been reducing this steadily.

How We Got Here

The path was not smooth:

DateArchitectureBest Score
Jan 2026Legacy pipeline58% (50 stories, with LLM in loop)
Feb 2026Scoring optimizations72% to regressed to 58%
Mar 8594 Equation System + DTCM100% isolated (50 stories)
Mar 125 rounds complete100% isolated (250 stories)
Mar 14ParsedUtterance pipeline100% cumulative (50 stories)
Mar 16Garbage gate + explanation rescue100% cumulative (50), 96% cumulative (250)

The legacy pipeline hit a ceiling at 58% and suffered from whack-a-mole regressions: fixing one story broke another. That forced the architectural rewrite to the 594 Equation System and DTCM. From that point, every test round passed on the first attempt.

What Makes ATANT Different From Existing Benchmarks?

Standard LLM benchmarksMemory-specific benchmarksATANT
What it testsModel intelligenceFact retrieval from stored dataFull continuity: persist, update, disambiguate, reconstruct
Test formatSingle-turn Q&AFact pairs or simple dialoguesMulti-turn narratives with updates, contradictions, overlapping entities
LLM in eval loopUsually yesOften yesNo, deterministic evaluation, no LLM judges
DisambiguationNot testedRarely testedCore requirement: 250 coexisting narratives
Update handlingNot testedSometimesRequired: facts change, old state must be superseded
Open standardVariesUsually proprietaryOpen, published, system-agnostic
Compliance levelsPass/failScore only4 levels with progression sequence

The critical design decision: no LLM in the evaluation loop. ATANT uses deterministic verification. The expected answer is known, and the system's answer is compared directly. No "LLM-as-judge" subjectivity. A system either gets the right answer or it doesn't.

How Can Other Systems Be Evaluated Against ATANT?

ATANT is system-agnostic. Any AI system claiming to maintain continuity can be evaluated:

  1. Ingest the narrative corpus (250 stories, each as a sequence of user utterances)
  2. Process each utterance through the system's write path
  3. Query with the verification questions
  4. Compare the system's answers to the expected answers
  5. Score against the checkpoint framework

The full specification, story format schema, and example stories are published on GitHub. The evaluation paper is published on arXiv.

We built ATANT because the industry needs a shared definition of what continuity means and a shared way to measure it. The standard is open specifically so it can be adopted, challenged, and improved by others.

What Comes Next

ATANT v1.0 is the foundation. Future versions will add:

  • Reconstruction quality metrics: not just "is the answer correct" but "how complete and useful is the reconstructed situation"
  • Multi-language narratives, testing continuity across languages
  • Proactive behavior testing: does the system surface relevant context without being asked
  • Decay validation: does the system correctly age and deprioritize stale information
  • Cross-system evaluation: standardized comparison across Mem0, Zep, Letta, and others

The standard grows as the field grows.

Try It

The full framework is on GitHub: github.com/Kenotic-Labs/ATANT

The specification, story format, compliance levels, and evaluation protocol are all published. If you're building an AI memory system, test it against ATANT. If you can pass ATANT-Scale at Gold, you have production-grade continuity.

If you can't, now you know exactly where it breaks.

Follow the research at kenoticlabs.com

Samuel Tanguturi is the founder of Kenotic Labs, building the continuity layer for AI systems.

The continuity layer is the missing layer between AI interaction and AI relationship.

Kenotic Labs builds this layer.

Get in touch