White Papers

Claude Memory Architecture

How Claude combines a 200k-token context window with multi-layered memory and RAG to deliver accurate, personalised, context-aware responses at scale.

Claude does not just read your last message and reply. It runs a multi-layered memory system — short-term context window, semantic vector store, tiered memory layers, and long-term persistent profiles. All of these are blended together before it generates a single token of response.

This article explains each layer, how they connect, and why the design matters.

What you will learn

  • How Claude's 200k-token context window works and what happens when it fills up
  • When and why Claude triggers external memory retrieval
  • How RAG (Retrieval-Augmented Generation) turns a user query into a precise memory lookup
  • What the four memory tiers are and how long each one lasts
  • How retrieved memory is fused with the active conversation before the model generates a reply

Background

Most large language models have one form of memory: the context window. Everything the model "knows" during a conversation is whatever fits in that window right now. Once it overflows, older content is lost.

Claude extends this with a retrieval layer — a separate vector database that stores past conversations, user preferences, project notes, and world knowledge. When the context window alone is not enough, Claude pulls relevant information from this store and blends it with the live conversation.

This approach is called RAG (Retrieval-Augmented Generation). Think of the context window as your short-term working memory. The vector store is a well-organised filing cabinet you can search in milliseconds.


Layer 1 — Context Window (Short-Term Working Memory)

The context window holds everything active in the current conversation. Claude uses a 200,000-token window — enough to hold roughly 150,000 words of text.

Three things happen as the window fills:

  1. Prioritisation — The system scores every item by importance. System instructions, recent messages, the user's stated intent, and key named entities all rank high. Generic filler ranks low.
  2. Compression — Older messages are summarised rather than dropped outright. The meaning and intent survive; the word-for-word detail does not.
  3. Eviction — When compression is not enough, the least relevant content is removed to make room for new input.

Recent messages rank higher

The context window uses recency weighting. Messages from the last few turns carry more weight than messages from early in a long conversation.


Layer 2 — Retrieval Trigger

Claude does not retrieve from external memory on every turn. It checks three conditions first:

  • Insufficient context — the current window does not contain what the user needs
  • User query signals external knowledge — the question refers to facts, past sessions, or domain knowledge not present in the window
  • Relevance scoring clears a threshold — the system estimates whether retrieval would actually improve the response quality

If any of these conditions are met, the RAG pipeline starts.


Layer 3 — Retrieval Pipeline (RAG)

RAG has five steps. Each step refines the signal from a raw user question into a ranked, ready-to-use block of context.

Step 1 — Query Understanding

The model analyses the user's intent. It extracts the core question, strips away filler, and identifies the key concepts that need to be matched in memory.

Step 2 — Query Embedding

The extracted query is converted into a high-dimensional vector — a list of numbers like [-0.24, -0.80, ..., 0.31]. This numeric representation captures the semantic meaning of the query, not just the words.

Two sentences that mean the same thing will produce similar vectors even if they share no words.

The query vector is compared against all vectors stored in the vector database. The search finds the chunks of stored memory that are semantically closest to the query.

This is similarity search — it finds meaning, not keywords.

Step 4 — Re-ranking

The top results from vector search are re-scored using a more precise relevance model. A typical result set might look like:

ResultScore
Chunk A0.92
Chunk B0.81
Chunk C0.75

Only the highest-scoring chunks move forward.

Step 5 — Context Assembly

The final chunks are assembled into a structured block of context. This block merges with the live conversation in the context window.


Layer 4 — Vector Memory (Semantic Knowledge Store)

Vector memory is the database that makes Step 3 possible. It has four components:

Prop

Type

Metadata filtering matters for privacy. Claude restricts retrieval to only the content the current user is allowed to see, even before the similarity search runs.


Layer 5 — Memory Tiers

Not all memory is equal. Claude organises stored information into four tiers based on how specific, how detailed, and how long-lasting each type is.

Episodic Memory

Stores individual interactions in full detail. This is where "what did we discuss last Tuesday" lives. It is the most granular and the most time-limited — detail fades as time passes.

Semantic Memory

Stores facts, concepts, and world knowledge extracted from many conversations. Less tied to a single event, more about what is generally true. Medium retention.

Procedural Memory

Stores learned patterns — how to solve a type of problem, how a user prefers code to be formatted, the steps in a recurring workflow. Abstracted from any single conversation. Long-term.

Foundational Memory

Stores core values, alignment principles, and the fundamental rules Claude operates by. This tier never changes during a session. It is the stable base that all other memory sits on.


Layer 6 — Persistent Context (Long-Term Continuity)

Persistent context is what makes Claude feel like it knows you across sessions. It has four stores:

  • User Profiles — preferences, background, communication style, domain expertise
  • Conversation History — a record of past sessions that can be retrieved when relevant
  • Project Context — ongoing work, domain-specific terminology, project-specific rules
  • Learning & Adaptation — how the model adjusts its behaviour based on accumulated feedback

Persistent context requires explicit storage

None of this happens automatically unless the system is configured to write to persistent storage. In a default API session, memory resets when the conversation ends.


Layer 7 — Context Integration (Blending Everything Together)

Before Claude generates a response, it combines three signals:

Context Fusion removes duplicates, resolves conflicts, and ranks content by relevance. The goal is a single, clean context block that gives the model exactly what it needs — no redundancy, no noise.

The final context is what the model actually reads before it writes the response.


Layer 8 — Infrastructure and Optimisation

The memory system runs on infrastructure built for performance, privacy, and cost control:

ComponentWhat it does
Caching LayerStores frequent query results so repeated lookups cost nothing
Index OptimisationKeeps vector indexes tuned for fast retrieval across regions
Distributed StorageScales memory horizontally across multiple data centres
Monitoring & MetricsTracks retrieval quality, latency, and usage patterns
Privacy & SecurityEncrypts data at rest and in transit; enforces access controls
Cost OptimisationBalances retrieval depth against token budget and latency targets

End-to-End Memory Flow

Every response the model produces can feed back into the memory layers — tightening the system over time as it learns from each interaction.


Summary

  • Claude's context window holds up to 200k tokens and manages overflow via prioritisation, compression, and eviction.
  • A retrieval trigger checks whether external memory would improve the response before starting a lookup.
  • The RAG pipeline converts a user query into a vector, searches the store, re-ranks results, and assembles clean context.
  • Four memory tiers — Episodic, Semantic, Procedural, and Foundational — store information at different levels of detail and duration.
  • Persistent context maintains user profiles, history, and project knowledge across sessions.
  • Context fusion blends retrieved memory with the live conversation before generation, removing duplicates and noise.
  • Infrastructure handles caching, distributed storage, privacy enforcement, and cost control at scale.

References

  1. Anthropic — Claude Memory Architecture (official infographic, 2025). Built by Anthropic. anthropic.com
  2. Lewis, P. et al. — "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.
  3. Johnson, J. et al. — "Billion-scale similarity search with GPUs." IEEE Transactions on Big Data, 2021.

How is this guide?

On this page