Why does AI forget everything between sessions?

AI systems are session-based - they process each conversation independently with no persistent state. This is why ChatGPT forgets what you told it, why Siri cant remember your preferences, and why chatbots make you repeat yourself. The solution is a continuity layer - infrastructure that persists, updates, and reconstructs context across sessions. Kenotic Labs builds this layer.

Why is ChatGPT getting worse at remembering things?

ChatGPT and other AI assistants appear to get worse because they lack true continuity - the ability to carry forward what matters, update it when things change, and reconstruct it when needed. Context windows and memory features are band-aids. Real continuity requires a dedicated infrastructure layer that handles persistence, temporal ordering, disambiguation, and reconstruction.

How do I add persistent memory to my AI agent?

Most AI agents fail because they lose state between tasks. Solving that requires a continuity layer that preserves what still matters, keeps track of change, and reconstructs the current situation when needed. Kenotic Labs provides this as infrastructure so AI agents can carry persistent context, temporal reasoning, and disambiguation across sessions.

Why do AI agents fail 80% of the time?

AI agent reliability is fundamentally a memory and state management problem. With 85% per-step accuracy, a 10-step workflow only succeeds 20% of the time. Agents fail because they cannot maintain context across steps, forget previous failures, and lose track of what has changed. A continuity layer solves this by providing persistent state, update tracking, and situation reconstruction.

What is the difference between RAG and continuity?

RAG (Retrieval Augmented Generation) retrieves similar chunks of text. Continuity reconstructs the current living state of a situation - including what changed, what still matters, and what should happen next. RAG finds related past things. Continuity understands the present. This is why RAG still hallucinates 17-33% of the time while deterministic reconstruction achieves near-perfect accuracy.

What is DTCM and why is it not just a database?

DTCM stands for Decomposed Trace Convergence Memory. It is Kenotic Labs' architecture for preserving and reconstructing the living state of a situation. A normal database stores facts and makes the model interpret them again from scratch. DTCM is designed as a continuity architecture, not just storage, so the system can remain oriented to what is active, what changed, and what should happen next.

If this layer became real, what would begin to change?

If the continuity layer becomes real, machines begin to understand why something matters, when action should happen, and what should happen next without being re-instructed from zero every time. In software, that changes how systems behave. In hardware, it points toward a different kind of machine substrate. The larger implication is that new businesses, new product categories, and new operating models become possible once situational coherence exists as infrastructure.

Why do chatbots make me repeat myself?

90% of customers have to repeat information to chatbots because these systems lack continuity - they cannot carry forward context from previous interactions. Each session starts from zero. A continuity layer would let chatbots remember your history, track what changed, and reconstruct your situation without asking again.

Why cant Siri or Alexa remember anything?

Voice assistants like Siri and Alexa are session-based - they process each command independently. They lack a continuity layer that would let them remember your preferences, track recurring patterns, and build understanding over time. The technology exists to fix this - it requires persistent memory infrastructure that survives across sessions, updates, and device restarts.

We Built the First Benchmark for AI Continuity: 250 Stories, 1,835 Questions

Every AI benchmark measures intelligence. None of them measure whether the system can maintain coherent context across time. So we built one.

ATANT (Automated Test for Acceptance of Narrative Truth) is the first open evaluation framework for AI continuity. It tests whether a system can persist, update, disambiguate, and reconstruct meaningful context across time. 250 narratives across 6 life domains, 1,835 verification questions, 10 checkpoints, 4 compliance levels. No LLM in the evaluation loop. The reference implementation scored 100% in isolated mode and 96% at 250-story cumulative scale.

Why Did We Build This?

AI benchmarks measure intelligence: MMLU, HumanEval, GSM8K, Chatbot Arena. They measure whether a model can answer questions, write code, solve math, generate coherent text.

None of them measure whether the system can remember what you told it yesterday. Or whether it can keep your sister's story separate from your coworker's. Or whether it updates correctly when the facts change. Or whether it can answer "summarize my current situation" instead of just "when is my interview?"

That's continuity. And until ATANT, there was no standard way to test it.

The industry has been building AI memory systems (Mem0, Zep, Letta, LangChain memory modules, vector databases) with no shared framework for evaluating whether they actually work. Each system reports its own metrics, on its own benchmarks, measuring its own definition of "memory."

We needed a shared standard. So we published one.

What Does ATANT Test?

ATANT tests AI continuity through narrative-based evaluation. Instead of synthetic fact pairs or single-turn Q&A, ATANT uses realistic multi-turn conversation narratives, the kind of context AI systems encounter in the real world.

The Test Corpus

Metric	Value
Total narratives	250
Total verification questions	1,835
Life domains	6 (Career, Relationships, Health, Learning, Daily Life, Life Events)
Testing phases	5 rounds (50 stories each)
Question types	Fact retrieval, temporal ordering, update verification, disambiguation, reconstruction

Each narrative is a multi-turn conversation that introduces facts, changes them, introduces overlapping entities, and tests whether the system maintains correct state through all of it.

Example: a story introduces a user with a job interview on Tuesday. Three turns later, the interview moves to Thursday. Five turns later, the user mentions their sister also has an interview. The verification questions test:

Does the system know the interview is Thursday (not Tuesday)?
Does the system distinguish between the user's interview and the sister's?
Can the system reconstruct the current situation including both?

These aren't hard questions for a human. They're hard for systems that store raw text and retrieve by similarity.

What Are the 10 Checkpoints?

ATANT evaluates continuity through a sequence of 10 checkpoints, each verifying a specific stage of the write path and read path:

CP	Name	What It Tests
CP1	Classification	Is the input correctly classified (personal fact, event, emotion, etc.)?
CP2	Triple Storage	Are the expected facts stored in the correct structured form?
CP3	Predicted Queries	Does the system generate the right query-answer pairs at write time?
CP4	Object Type Tagging	Are entities correctly typed (person, place, organization, etc.)?
CP5	Query Classification	Is the verification question correctly classified for retrieval?
CP6	Structural Matcher	Does the question match to the correct stored triple?
CP7	DTCM Convergence	Does the convergence gate activate and select the right traces?
CP8	Final Combined	Is the answer correct? The headline metric.
CP9	Temporal System	Are temporal facts (dates, sequences, active/resolved) correct?
CP10	Adaptation Engine	Does the system detect emotional state and adjust?

CP8 is what matters for compliance. The other checkpoints diagnose where failures occur in the pipeline, which is what makes ATANT a development tool, not just a scorecard.

What Are the 4 Compliance Levels?

Level	Requirement	What It Proves
ATANT-Core	50 stories, isolated mode, 100% CP8	Basic continuity works
ATANT-Stress	250 stories, isolated mode, 100% CP8	Continuity generalizes across story types
ATANT-Cumulative	50 stories, cumulative mode, 100% CP8	Disambiguation works, multiple users, correct separation
ATANT-Scale	250 stories, cumulative mode, 100% CP8	Disambiguation scales to production levels

Scoring tiers within each level: Gold (100%), Silver (95-99%), Bronze (90-94%).

The sequence matters. A system that passes ATANT-Scale has proven it can maintain correct, disambiguated, updateable context across 250 coexisting narratives. That's the bar for production continuity.

What Did the Reference Implementation Score?

The first system evaluated against ATANT is NURA, the reference implementation built on DTCM (Decomposed Trace Convergence Memory) at Kenotic Labs.

Mode	Stories	Questions	CP8 Accuracy
Isolated (250)	250/250	1,835/1,835	100%
Cumulative (50)	50/50	304/304	100%
Cumulative (250)	~210/250	1,761/1,835	96%

What the Results Mean

Isolated 100%, ATANT-Stress: Gold. Every story, every question, every checkpoint. The write path and read path work correctly when each narrative is tested independently.

Cumulative 50 at 100%, ATANT-Cumulative: Gold. 50 different people's narratives coexisting in the same database. The system retrieves the right fact for the right person every time.

Cumulative 250 at 96%, ATANT-Scale: Silver. 250 narratives. The 4% gap comes from predicate disambiguation at extreme scale. When 250 stories coexist, similarly-named predicates from different stories can compete. The Predicate Lexicon and Inverted Scoring Formula have been reducing this steadily.

How We Got Here

The path was not smooth:

Date	Architecture	Best Score
Jan 2026	Legacy pipeline	58% (50 stories, with LLM in loop)
Feb 2026	Scoring optimizations	72% to regressed to 58%
Mar 8	594 Equation System + DTCM	100% isolated (50 stories)
Mar 12	5 rounds complete	100% isolated (250 stories)
Mar 14	ParsedUtterance pipeline	100% cumulative (50 stories)
Mar 16	Garbage gate + explanation rescue	100% cumulative (50), 96% cumulative (250)

The legacy pipeline hit a ceiling at 58% and suffered from whack-a-mole regressions: fixing one story broke another. That forced the architectural rewrite to the 594 Equation System and DTCM. From that point, every test round passed on the first attempt.

What Makes ATANT Different From Existing Benchmarks?

	Standard LLM benchmarks	Memory-specific benchmarks	ATANT
What it tests	Model intelligence	Fact retrieval from stored data	Full continuity: persist, update, disambiguate, reconstruct
Test format	Single-turn Q&A	Fact pairs or simple dialogues	Multi-turn narratives with updates, contradictions, overlapping entities
LLM in eval loop	Usually yes	Often yes	No, deterministic evaluation, no LLM judges
Disambiguation	Not tested	Rarely tested	Core requirement: 250 coexisting narratives
Update handling	Not tested	Sometimes	Required: facts change, old state must be superseded
Open standard	Varies	Usually proprietary	Open, published, system-agnostic
Compliance levels	Pass/fail	Score only	4 levels with progression sequence

The critical design decision: no LLM in the evaluation loop. ATANT uses deterministic verification. The expected answer is known, and the system's answer is compared directly. No "LLM-as-judge" subjectivity. A system either gets the right answer or it doesn't.

How Can Other Systems Be Evaluated Against ATANT?

ATANT is system-agnostic. Any AI system claiming to maintain continuity can be evaluated:

Ingest the narrative corpus (250 stories, each as a sequence of user utterances)
Process each utterance through the system's write path
Query with the verification questions
Compare the system's answers to the expected answers
Score against the checkpoint framework

The full specification, story format schema, and example stories are published on GitHub. The evaluation paper is published on arXiv.

We built ATANT because the industry needs a shared definition of what continuity means and a shared way to measure it. The standard is open specifically so it can be adopted, challenged, and improved by others.

What Comes Next

ATANT v1.0 is the foundation. Future versions will add:

Reconstruction quality metrics: not just "is the answer correct" but "how complete and useful is the reconstructed situation"
Multi-language narratives, testing continuity across languages
Proactive behavior testing: does the system surface relevant context without being asked
Decay validation: does the system correctly age and deprioritize stale information
Cross-system evaluation: standardized comparison across Mem0, Zep, Letta, and others

The standard grows as the field grows.

Try It

The full framework is on GitHub: github.com/Kenotic-Labs/ATANT

The specification, story format, compliance levels, and evaluation protocol are all published. If you're building an AI memory system, test it against ATANT. If you can pass ATANT-Scale at Gold, you have production-grade continuity.

If you can't, now you know exactly where it breaks.

Follow the research at kenoticlabs.com

Samuel Tanguturi is the founder of Kenotic Labs, building the continuity layer for AI systems.