How Does Progressive Disclosure Save Tokens and Improve Performance?

TL;DR: Progressive disclosure saves tokens by loading only what the current task needs, not everything in your skill library. For a 20-skill library, this cuts startup token cost from 15,000-80,000 tokens to under 2,000 tokens — a 95% reduction that also keeps task instructions near position zero in the context window, where model attention is highest. AEM skills are built on this architecture; it is the core reason AEM skill libraries maintain instruction compliance at session depths where naive-loaded Claude Code skills start to drift.

There's a counterintuitive thing about making Claude work better with AEM skills: the fix is usually giving it less, not more.

A developer who wants Claude to follow their skill instructions more reliably will often respond by adding more detail to the SKILL.md. More steps, more examples, more constraints. The file gets longer. Compliance gets worse. They add more. Compliance drops further.

The problem isn't the content. The problem is that all of it is loading at once.

How does progressive disclosure reduce token consumption?

Progressive disclosure splits skill content across three loading tiers — descriptions always loaded, skill bodies loaded on trigger, reference files loaded on demand — so a 20-skill library that would cost 76,000 tokens under naive loading costs under 5,300 tokens per task, a 95% reduction from the gap between "everything loaded at startup" and "only what this task needs."

In a naive skill library where all content loads at session start, the total token cost is:

(number of skills) x (average body tokens) + (number of reference files) x (average reference tokens)

For a 20-skill library with 800-token bodies and two reference files averaging 1,500 tokens each:

Body overhead: 20 x 800 = 16,000 tokens
Reference overhead: 20 x 2 x 1,500 = 60,000 tokens
Total: 76,000 tokens consumed before the first task begins

With progressive disclosure, the token cost for any single task is:

(all descriptions) + (one triggered skill body) + (reference files needed for this task)

For the same library:

Description index: 20 x 75 tokens = 1,500 tokens
One skill body: 800 tokens
Task-relevant reference files: 1,500-3,000 tokens
Total per task: 3,800-5,300 tokens

The savings: 71,000-72,000 tokens per task. That's not context optimization. That's a different architecture.

At Anthropic's published rate of $3 per million input tokens for Claude 3.5 Sonnet (Anthropic, 2024), a 76,000-token startup overhead costs approximately $0.23 per session — before a single task runs. Across 100 developer sessions per day, that's $23 in wasted token spend daily, or roughly $8,400 per year consumed entirely by skill infrastructure that never contributes to task output.

How does token reduction translate to better performance?

Token reduction translates to better performance through two distinct mechanisms — freeing up context window space for actual task content, and keeping task-specific instructions near position zero in the context window where model attention is strongest — and because both mechanisms improve with every additional skill you add to the library, the gains compound across a long session.

Mechanism 1: More usable context window

Claude's context window is 200,000 tokens (Anthropic, 2024). Every token consumed by skill infrastructure is a token unavailable for your actual task. Content you generate, context you provide, conversation history, file contents you've read — all of it shares the same window.

At 76,000 tokens of startup overhead, you have 124,000 tokens for task content. At 3,800-5,300 tokens of progressive-disclosure overhead, you have 194,000-196,000 tokens for task content. In practical terms, that's the difference between a session that can hold 40 turns of complex work and a session that starts degrading at turn 15.

Research on distraction in long contexts compounds this: Shi et al. (NeurIPS 2023, arXiv 2302.00093) found that adding irrelevant context to prompts reduced LLM reasoning accuracy by 10-20%, even when the answer was present in the original, shorter prompt. Loaded skill infrastructure that isn't needed for the current task is functionally irrelevant context — and the degradation mechanism is the same.

Mechanism 2: Context position and instruction reliability

Stanford's "Lost in the Middle" study found that language models retrieve instructions with lower accuracy when those instructions appear in the middle of a long context, compared to when they appear near the beginning or end (Nelson Liu et al., ArXiv 2307.03172, 2023). The study showed instruction retrieval degradation of 20-40% when instructions were placed at middle positions in a 32,000-token context.

With 76,000 tokens of skill content loaded before your task, your task-specific instructions don't appear in the context until position 76,000. With progressive disclosure, they appear at position 1,500-3,000 (after the description index).

The instructions are identical. The position is different. The model's ability to follow them changes.

"Models placed in the middle of long contexts lose track of instructions at a rate that makes mid-context policy placement unreliable for production systems." — Nelson Liu et al., Stanford NLP Group, "Lost in the Middle" (2023, ArXiv 2307.03172)

How does progressive disclosure improve instruction compliance specifically?

Instruction compliance is the rate at which Claude follows your SKILL.md constraints correctly, and progressive disclosure directly improves two of the three factors that govern it — instruction position, by keeping task-relevant instructions near context position zero, and instruction competition, by loading one skill body at a time rather than all 20 simultaneously — without touching the third factor, instruction clarity.

Instruction clarity
Instruction position
Instruction competition

Instruction clarity: Is each constraint specific and unambiguous? Progressive disclosure doesn't directly affect this. That's a writing problem, not an architecture problem.

Instruction position: Where in the context do the instructions appear? Progressive disclosure keeps task-relevant instructions near position zero by deferring everything else. Direct improvement.

Instruction competition: How many other instructions is Claude holding simultaneously? A 76,000-token skill library loaded at startup puts 20 sets of instructions in context before your task begins. Progressive disclosure puts exactly one skill's body in context when it triggers. Less competition, higher compliance.

In our builds at AEM, switching a complex multi-reference skill from naive loading to progressive disclosure consistently produces compliance improvements on multi-step outputs. The skill's instructions don't change. The loading architecture does.

The improvement is most visible in multi-step skills with 5+ process steps and 3+ constraints. Simple skills with 3 steps and one constraint are reliable regardless, because there's not enough competing content to cause drift.

This matches what Stanford CRFM found in the HELM benchmark suite (2022): instruction-following accuracy on complex multi-step tasks dropped by 15-25% as prompt length increased from under 1,000 tokens to over 10,000 tokens, with the steepest degradation occurring when competing instruction sets were present alongside task instructions (Liang et al., arXiv 2211.09110, 2022).

What are the measurable differences at scale?

Three metrics show the difference between progressive disclosure and naive loading in a 20-skill library — session startup token cost drops from 76,000 to 1,500 tokens (a 50:1 ratio), per-task overhead falls by 95%, and session depth before instruction compliance degrades extends from turn 10-15 under naive loading to turn 30 and beyond.

All three compound as the library grows.

Metric 1: Session startup token cost

A library of 20 skills with naive loading: 76,000 tokens at session start. With progressive disclosure: 1,500 tokens at session start. The ratio is approximately 50:1 in favor of progressive disclosure.

Metric 2: Per-task token cost

A single task run with naive loading: 76,000 tokens (already loaded) + task content. With progressive disclosure: 3,800-5,300 tokens (active skill) + task content. The progressive-disclosure overhead per task is 95% lower than the naive approach.

Metric 3: Session depth before instruction degradation

In sessions where naive loading consumes 76,000 tokens at startup, instruction compliance drops measurably by turn 10-15 as conversation history fills the remaining context. Sessions using progressive disclosure maintain consistent instruction compliance at turn 30 and beyond in our builds, because the context window stays largely available for task content.

This is consistent with findings from the SCROLLS benchmark (Shaham et al., EMNLP 2022, arXiv 2201.03533), which evaluated LLM performance on long-document tasks and found that task completion accuracy declined measurably once input length exceeded approximately 50% of the model's available context window — the same threshold at which a 76,000-token naive skill library pushes you when loaded into a 200,000-token context.

What doesn't progressive disclosure fix?

Progressive disclosure is a loading architecture, not a content quality fix, and it leaves four categories of problems untouched: unclear instructions in the SKILL.md body, trigger collisions when two skill descriptions match the same prompt, description index growth as the library scales past 50 skills, and cross-skill orchestration gaps when a single task needs multiple skills running in sequence. If those problems exist in your skill library, changing when content loads won't change the output.

It doesn't fix unclear instructions. A vague or ambiguous SKILL.md body produces inconsistent output regardless of when it loads. Writing quality is separate from loading architecture.

It doesn't fix trigger collisions. If two skills have descriptions that match the same prompt, both will try to activate. That's a description design problem, not a token problem. For collision diagnosis, see Why Your Claude Code Skill Isn't Triggering (and How to Fix It).

It doesn't eliminate the baseline cost of your description index. Every skill adds 50-100 tokens to the permanent session overhead. At 50 skills, that's 2,500-5,000 tokens always in context. Still manageable, but growing.

It doesn't solve cross-skill orchestration. When a single task needs multiple skills to run in sequence, progressive disclosure helps with per-skill token costs but doesn't address the coordination layer. Multi-skill workflows need a different architecture.

For a complete breakdown of how the numbers work across different library sizes, see Progressive Disclosure: How Production Skills Manage Token Economics.

Frequently asked questions

Does progressive disclosure help with response latency, or just token costs? Both. Fewer tokens loaded at the start of a task means Claude processes less text before generating output. In practice, the latency improvement is modest for individual tasks but compounds across a long session as the context window fills more slowly.

Published time-to-first-token (TTFT) benchmarks from Artificial Analysis (2024) show that TTFT scales near-linearly with input token count across major LLM APIs — roughly 1-3ms of additional latency per 1,000 input tokens. At 76,000 tokens of startup overhead, that adds 76-228ms to the first response of every task, before any generation begins.

If I only have 5 skills, will progressive disclosure make a noticeable difference? Probably not for token costs: 5 skills with 800-token bodies is 4,000 tokens at startup, which is negligible. The description-based trigger system still operates, but the savings from splitting reference files only become meaningful when you have large reference files or many skills.

Can I measure the token cost of my current skill library? Yes. Count tokens in each SKILL.md body (rough estimate: characters divided by 4). Multiply by the number of skills you trigger in a typical session. That's your approximate skill overhead per session. Add reference files for each triggered skill to get the full cost.

Does progressive disclosure make skill development more complex? Yes, in two ways. First, you have to decide which content belongs in the body vs. reference files. Second, you have to write explicit Read instructions in the body for each reference file. This adds 2-4 lines of instruction per reference file. The overhead is low, but it exists.

What's the minimum skill size where progressive disclosure becomes worth it? From our builds: the breakeven point is a skill with at least one reference file containing 150+ lines of content, or a library with 10+ skills. Below both thresholds, the architecture overhead isn't justified by the savings.

Last updated: 2026-04-14