Progressive Disclosure: How Production Skills Manage Token Economics

TL;DR: Progressive disclosure is a three-layer loading model for Claude Code skills used in AEM production skill libraries. Layer 1 (skill descriptions) loads at every session start. Layer 2 (the SKILL.md body) loads when the skill is triggered. Layer 3 (reference files) loads only when the task needs them. Token costs stay flat as your library grows.

Most developers discover progressive disclosure by accident. They build three or four skills, everything works fine, then they add a tenth or a fifteenth and notice Claude getting slower, losing context mid-task, or forgetting instructions from earlier in the session. They blame the model. The pattern shows up repeatedly in developer forums and GitHub issues: context-management architecture, not model capability, is the underlying cause of most skill reliability failures in production libraries.

The real culprit is architecture.

A library of 15 skills doing naive startup loads is not a skill library. It's a context fire you don't know is burning.

Without progressive disclosure, a skill is binary: either its full content is in context or it isn't. Loading full SKILL.md bodies for 15 skills at session start burns 6,000-12,000 tokens before you've typed a single character. Add reference files and you're past 20,000 tokens before your first task. At that depth, Claude starts dropping instructions from the beginning of the window. That's the "Lost in the Middle" problem documented by Stanford's NLP Group: when relevant instructions appear in the middle of a long context rather than at the start, multi-document QA accuracy drops by up to 20 percentage points compared to instructions placed at position zero (Nelson Liu et al., ArXiv 2307.03172, 2023).

Progressive disclosure solves this by staging what loads and when.

What is progressive disclosure in Claude Code skill engineering?

Progressive disclosure is a loading architecture that splits skill content into three tiers — metadata, body, and references — where each tier loads only when its specific conditions are met, so Claude gains full skill capability without paying the full token cost until the moment a task actually requires it.

The design principle comes from UI design: don't show users complexity they haven't asked for. Applied to context management, this becomes: don't load Claude with information it hasn't needed yet. The mechanics differ, but the core rule is identical.

In AEM production skill libraries, we use progressive disclosure as the default for any skill with reference files longer than 200 lines, or any library with more than 10 skills. For simple skills under 50 lines with no external references, the overhead isn't worth adding. For anything complex, a 600-line rubric, a domain-specific vocabulary list, a 20-page style guide, progressive disclosure is the difference between a skill that holds its instructions at turn 20 and one that forgets its own constraints by turn 8.

The architecture has three layers. Each has a defined trigger condition and a defined token cost.

What are the three layers of progressive disclosure?

Progressive disclosure splits skill content into three layers — metadata (always loaded, 50-100 tokens per skill), body (loaded on trigger, 400-1,200 tokens), and references (loaded on demand, 500-4,000 tokens) — each with a distinct loading trigger and cost profile that keeps startup overhead near zero for inactive skills.

Layer 1: Metadata (the skill index) — Loaded at session start, always. This is the description field and the skill name only. For every skill in your library, Claude reads this layer at session start to know what skills exist and when to activate them. In a library of 20 skills, Layer 1 costs 800-1,500 tokens total. That's the full library, indexed.
Layer 2: Skill body (the SKILL.md body) — Loaded when an incoming user message matches the skill's trigger condition. This is the main instruction set, the output contract, the step-by-step process. In our builds, SKILL.md bodies run 400-1,200 tokens. It loads once, when needed, and nothing more.
Layer 3: Reference files — Loaded on demand during task execution, when the skill instructions call for them explicitly. A reference file is a rubric (2,000 tokens), a vocabulary list (500 tokens), or a domain style guide (4,000 tokens). These load only when the running task needs that specific file.

The distinction: Layer 1 is always present. Layers 2 and 3 are conditional.

For a full walkthrough of the SKILL.md structure that supports this architecture, see What Goes in a SKILL.md File?.

How does the metadata layer work at session start?

The metadata layer is what Claude reads at session start to know your skill library exists: it consists entirely of the description field from each SKILL.md file, loaded as a lightweight index of 50-100 tokens per skill so Claude can match incoming prompts to the right skill without pulling any full skill bodies into context.

At session start, Claude Code reads all SKILL.md files in your .claude/skills/ directory and loads their description fields into context. Not the full files. Just the descriptions.

This is why the description field is the single most load-bearing line in your entire skill. It's the only content that's always in context. Everything else loads conditionally. The description must do two jobs simultaneously: serve as the trigger condition (precise enough to fire on the right prompts and not on the wrong ones) and serve as the 50-token summary of what the skill does. The RULER benchmark (Hsieh et al., ArXiv 2404.06654, 2024) found that LLM performance on multi-hop retrieval tasks drops sharply as effective context length increases — models that claim 128K context windows scored 20-30 percentage points lower on complex retrieval tasks at 64K-128K context versus 4K, suggesting the description's position at the front of context is not incidental.

A typical description runs 80-150 characters. At 20 skills, that's roughly 600-900 tokens for the full library index, small enough to leave over 199,000 tokens for actual work in Claude's 200,000-token context window (Anthropic, 2024).

For a deeper look at how descriptions control skill discovery, see The SKILL.md Description Field: The One Line That Makes or Breaks Your Skill.

When does the SKILL.md body load into context?

The SKILL.md body loads when Claude semantically matches an incoming user message against the skill's description and decides the skill should run — adding 400-1,200 tokens to context for the full instruction set, output contract, and step-by-step process the skill needs to execute correctly for that specific task.

The match is semantic, not keyword-based. Claude reads the description as a specification of intended trigger conditions and evaluates whether the user's message falls within that specification. A description that says "Use when the user asks to review a pull request or asks about code changes in a branch" will match "can you look at my PR" but not "can you look at my Python file."

Once matched, Claude loads the full SKILL.md body into context. The user experiences this as normal response latency. The body contains the actual instructions, output format, step-by-step process, and operating constraints.

The body only loads once per trigger. If the task runs across multiple turns, the body stays in context for the full task. It does not reload on every message.

SKILL.md bodies that exceed 1,500 tokens start causing compliance issues in our builds. Long bodies dilute the instruction signal. If your body is growing past 1,200 tokens, that's a sign you're fitting reference-tier content into the instruction tier. Move it to a reference file. IFEval (Zhou et al., ArXiv 2311.07911, 2023), which measures verifiable instruction-following accuracy, illustrates why this matters: instruction-following performance varies significantly with prompt length and instruction density — more instructions competing for attention in a fixed context lowers per-instruction compliance rates, which tracks directly with what we observe when SKILL.md bodies grow too long. This is consistent with findings from LongBench (Bai et al., ArXiv 2308.14508, 2023), a multi-task long-context benchmark across 21 datasets: average model performance dropped 13-18 percentage points when input length shifted from under 8K tokens to over 32K tokens, even on tasks the model could otherwise handle correctly at short context.

How are reference files loaded on demand?

Reference files load when the SKILL.md body explicitly instructs Claude to read them — triggered by a precise Read directive inside the process steps, such as "Before scoring, read references/rubric.md in full" — so each file's 500-4,000 tokens enters context only at the exact task step that requires it.

The standard pattern is a line in the process section: "Before scoring, read references/rubric.md in full." Or: "Load references/vocabulary.md before generating output." Claude treats this as an instruction and executes it as a Read tool call.

The skill does not automatically pull reference files. It loads them in response to an explicit instruction inside the body. You control what loads and at what point in the task.

Why this matters: a skill with five reference files doesn't have to load all five for every task. A commit-review skill with three rubrics, code quality, security patterns, documentation, can include conditional logic in the body: "If the user asks for a security review, also load references/security-rubric.md." A quick commit message check loads one rubric. A full PR audit loads all three.

Most skills in our builds load one or two reference files per task. The conditional loading pattern matters most for skills that handle multiple task types from a single trigger.

"The failure mode isn't that the model is bad at the task — it's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." — Simon Willison, creator of Datasette and llm CLI (2024)

This applies directly to reference loading. If the SKILL.md body doesn't specify exactly when to load which reference, Claude won't load it reliably. The instruction has to be explicit. ToolLLM (Qin et al., ArXiv 2307.16789, 2023), which benchmarks LLM tool-use across 16,000+ real-world APIs, found that tool selection accuracy drops sharply when the task description is underspecified — the model defaults to no-tool responses rather than inferring an unstated tool call. The pattern translates directly: implicit "use the brand guidelines" leaves Claude with no file path and no trigger condition, so the load doesn't happen.

What is the real token cost difference between each layer?

A 20-skill library without progressive disclosure burns approximately 76,000 tokens at startup; the same library with progressive disclosure costs 3,800-5,300 tokens per task, leaving 194,000+ tokens free for actual work — these numbers are from AEM production skill libraries, not estimates. The gap compounds across every task in a session, because each new task starts from the same bloated baseline.

A library of 20 skills, no progressive disclosure (naive approach):

All SKILL.md bodies loaded at startup: 20 files x 800 tokens average = 16,000 tokens
Reference files loaded at startup: 20 files x 2 reference files x 1,500 tokens average = 60,000 tokens
Total startup cost: approximately 76,000 tokens
Remaining context window for the actual task: 124,000 tokens

The same library with progressive disclosure:

Layer 1 (all descriptions): 20 files x 75 tokens average = 1,500 tokens
Layer 2 (one triggered skill body): 800 tokens
Layer 3 (one or two reference files for that task): 1,500-3,000 tokens
Total cost for one task: 3,800-5,300 tokens
Remaining context window: 194,000-196,000 tokens

The gap is 71,000 tokens per task. Over a full session, that's the difference between Claude reliably holding your instructions versus forgetting your system prompt by turn 8.

Stanford's "Lost in the Middle" paper showed that instruction retrieval accuracy drops significantly when instructions appear in the middle of a long context (Nelson Liu et al., ArXiv 2307.03172, 2023). At 76,000 tokens of startup overhead, your task-specific instructions don't appear until position 76,000. With progressive disclosure, they appear at position 3,000.

The difference is 71,000 tokens of context position. That changes what Claude can reliably do.

How do you design your skill library to exploit progressive disclosure?

Three design decisions determine whether your library fully benefits from progressive disclosure: how tightly you write skill descriptions, whether heavy content lives in reference files rather than skill bodies, and whether load instructions inside the body are explicit enough that Claude executes them reliably at the right task step.

Keep descriptions short and precise — The description is Layer 1. It's always loaded. Every word costs tokens across every session. A 300-character description for a skill used twice a week pays a higher context tax than a 100-character description with the same trigger precision. Trim descriptions to the minimum that makes triggers reliable.
Move heavy content to reference files — If your SKILL.md body is growing past 1,000 tokens, audit it for content that belongs in Layer 3. Content that should move to reference files: rubrics, checklists with more than 8 items, domain vocabulary lists, brand guidelines, style guides, example libraries, comparative tables. Any content the model reads but doesn't act on in every task run is a reference file candidate. The original RAG architecture (Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," NeurIPS 2020) demonstrated exactly this: retrieving relevant document chunks on demand outperforms pre-loading full documents for knowledge-intensive tasks, because the model attends to a smaller, more relevant context at the point of need. On open-domain QA benchmarks, RAG achieved 44.5% exact match on Natural Questions versus 29.6% for the best closed-book baseline of the time — a 50% relative improvement from retrieval alone. The mechanism is the same: less irrelevant content in context, higher accuracy on what matters. Prompt compression research reinforces this further: LLMLingua (Jiang et al., Microsoft Research, ArXiv 2310.05736, 2023) found that typical production prompts contain up to 80% tokens that do not contribute to the answer, and that compressing prompts to remove low-information tokens reduced latency by 3-5x while maintaining over 97% of task accuracy on benchmarks including GSM8K and BBH.
Write explicit load instructions in the body — Progressive disclosure only works if the SKILL.md body contains clear instructions about when and what to load. "Read references/brand-voice.md before writing the first draft" is a correct instruction. "Use the brand guidelines" is not. Claude won't know where to find them or when to load them.

When is progressive disclosure not worth the added complexity?

Progressive disclosure adds structural overhead that is not worth the cost for simple skill libraries: if your skill has no reference files longer than 150 lines, or your library has fewer than 10 skills, the three-layer architecture adds maintenance complexity without delivering a meaningful token saving in return.

The threshold from our builds: build with progressive disclosure if the skill has reference files with 150+ lines of content, or if you have 10+ skills in your library. Below those thresholds, the architecture overhead outweighs the token savings.

A skill that fits in 400 lines of SKILL.md with no external references doesn't need a three-layer structure. Splitting it artificially into body and references adds maintenance complexity without meaningful benefit.

A library of 3-5 simple skills also doesn't need it. At that scale, 5 SKILL.md bodies loaded at startup cost 3,000-4,000 tokens total, which is negligible. Long-context benchmarks consistently show a performance cliff rather than a gradual slope: instruction-following and retrieval accuracy hold near their short-context baseline until the effective context depth crosses a threshold, at which point degradation is rapid. The ZeroSCROLLS benchmark (Shaham et al., ArXiv 2305.14196, 2023) confirmed this pattern on summarisation and QA tasks across long documents — models that were within the top performance tier at under 10K tokens fell to near-random performance on identical task types at 100K+ token inputs, with the sharpest drop occurring between 16K and 32K tokens. For Claude's 200K window, 3,000-4,000 tokens of startup load keeps you well below any such threshold.

This architecture works for single-skill activation on a given task. For cross-domain orchestration where three or four skills need to run in sequence, you need a multi-agent architecture instead, and the token economics shift significantly.

How do you know if progressive disclosure is working?

Three signs tell you whether your library is using progressive disclosure correctly: response latency and quality hold steady as the library grows, Read tool calls appear only at the task steps the body instructs rather than at session start, and instruction compliance stays consistent at turn 20 the same as at turn 2.

Claude doesn't slow down as your library grows — Adding skill 20 should not change response latency or quality for tasks that trigger skill 1. If it does, something in your library is loading too much at startup.
Claude executes reference reads as explicit tool calls — Watch the tool calls in the session. If you see Read calls appearing exactly where the SKILL.md body instructs them, on-demand loading is working. If you see Read calls at session start, something is triggering early loads.
Instruction compliance holds at turn 20 — A properly configured library with progressive disclosure maintains the same instruction compliance at turn 20 as at turn 2. If Claude is forgetting skill rules mid-session, the startup token load is too high. The degradation pattern is well-established in long-context evaluation literature: as the context window fills, models lose reliable access to instructions placed early in the sequence. The "Lost in the Middle" effect (Nelson Liu et al., ArXiv 2307.03172, 2023) documents this directly — compliance with instructions positioned far from the query drops substantially as surrounding content grows, which is exactly what happens when 76,000 tokens of startup overhead push your skill instructions away from position zero.

For troubleshooting when skills aren't behaving as expected, see Why Your Claude Code Skill Isn't Triggering (and How to Fix It).

Frequently asked questions

The questions below cover the decision-points developers hit most often when building production skill libraries with progressive disclosure: how many tokens descriptions actually consume at startup, what loads and when, how to tell loading is working correctly at turn 20, and when the three-layer structure is simply more overhead than it's worth.

How many tokens does Claude use to store my skill descriptions at startup? Each skill description runs 50-100 tokens depending on length. A library of 20 skills costs 1,000-2,000 tokens at startup for the full description index. This is the fixed, unavoidable cost of knowing your skills exist. For context on token budgeting at scale: the GPT-4 technical report (OpenAI, ArXiv 2303.08774, 2023) documented that instruction-following quality begins to degrade measurably when system and tool-context tokens consume more than 15-20% of the model's effective context budget — at 200K tokens, that ceiling is 30,000-40,000 tokens of overhead before quality degrades, meaning a 1,000-2,000 token description index is well within safe bounds.

Does Claude read all my skill files every time I start a session? Claude reads the description field of every SKILL.md file at session start. It does not read the full body of each file. The full body loads only when a skill is triggered. Reference files load only when an active skill's instructions call for them explicitly. This selective loading pattern mirrors findings from research on retrieval-augmented systems: Izacard et al. ("Few-Shot Learning with Retrieval Augmented Language Models," JMLR 2023) showed that retrieving and loading only the two or three most relevant document chunks at inference time matched or exceeded the accuracy of loading the entire document corpus across open-domain QA benchmarks — demonstrating that selective loading is not a compromise, it is the higher-accuracy approach.

My skills are making Claude slow and forgetful. Is progressive disclosure the fix? Yes, if your skills are loading full bodies at startup, or if reference files are loading unconditionally. You're consuming tens of thousands of tokens before your task begins. Check your SKILL.md bodies for any instructions that trigger file reads at load time rather than during task execution. The forgetfulness pattern is consistent with what the SCROLLS benchmark (Shaham et al., ArXiv 2201.03533, 2022) documented on long-document tasks: even models with formally sufficient context windows produced summaries and answers that omitted information from the beginning of long inputs when the total input length pushed earlier content toward the middle of the window. Loading full skill bodies at startup produces exactly this structure — your task instruction arrives mid-window, after 10,000-76,000 tokens of skill content.

What's the difference between loading a skill and loading one of its reference files? Loading a skill means the SKILL.md body has been added to context, because the trigger condition matched. Loading a reference file means a specific file inside the skill's references/ directory has been read into context by an explicit instruction in the body. These are separate events with separate costs. A skill can be active without any of its reference files loaded.

Can I have too many skills for progressive disclosure to help? Yes, but the threshold is higher than most developers hit. At 50 skills, the description index costs 3,000-5,000 tokens, which remains workable. The real problem at scale is skill collision: multiple skills activating on the same prompt because their descriptions are too broad. That's a trigger design problem, not a token problem. ToolLLM (Qin et al., ArXiv 2307.16789, 2023), which evaluated LLM tool selection across 16,000+ real-world APIs, found that tool selection accuracy fell by 30-40% in conditions with 20+ available tools when tool descriptions overlapped in scope — the model selected the wrong tool or no tool at the same rate it failed to select the right one. Tight, non-overlapping descriptions are the primary defence. See Why Your Claude Code Skill Isn't Triggering (and How to Fix It) for the collision diagnosis process.

Is progressive disclosure still relevant as context windows grow? Context window size doesn't eliminate the "Lost in the Middle" problem. It changes the scale at which it occurs. A 1M token window loaded with 200k tokens of skill content before your task still places your task-specific instructions far from position zero. The mitigation stays the same: load only what's needed, when it's needed. The Gemini 1.5 technical report (Reid et al., ArXiv 2403.05530, 2024) shows near-perfect single-document retrieval in needle-in-haystack tests across a 1M token window, but needle-in-haystack is a synthetic single-fact retrieval task — not multi-hop instruction-following under a competing context load. The RULER benchmark results (Hsieh et al., 2024) on harder multi-hop tasks tell a different story even at 128K context. A larger window is not a substitute for architectural discipline.

What's the right structure for a reference file? A reference file is a markdown document in the references/ directory of your skill folder. There's no prescribed structure beyond being readable. A rubric is a numbered checklist. A vocabulary list is a two-column table. A style guide uses H2/H3 sections. Match the structure to how Claude needs to use the content. Structure matters more than most developers expect: research on in-context document structure (Shi et al., "Large Language Models Can Be Easily Distracted by Irrelevant Context," ICML 2023) found that models perform 20-30% worse when the relevant portion of a document is surrounded by structurally similar but irrelevant text. A reference file that leads with the task-relevant content and keeps sections clearly delimited performs measurably better than an unstructured dump of the same information.

Why does the metadata layer have to be a description and not a separate file? The description field in SKILL.md is Claude Code's native index mechanism. It reads descriptions at startup because that's how the tool is designed. The description is both the index entry and the trigger specification simultaneously. You can't replace it with a separate metadata file without losing automatic trigger-detection behavior.

Last updated: 2026-04-14