What Are the Three Layers of Progressive Disclosure in Claude Code Skills?

TL;DR: The three layers are: metadata (the description field, always loaded at session start), skill body (the full SKILL.md body, loaded when the skill triggers), and reference files (external markdown files, loaded on demand via explicit Read instructions). Each layer loads at a different time, costs different tokens, and carries different types of content.

The metadata layer is the bouncer. The body layer is the bar. The reference files are the back room. Most tasks only see the first two.

This article draws on production patterns from AEM (Skill-as-a-Service), a Claude Code skill library built for repeatable agentic workflows. Understanding what each layer carries and when it loads is the practical foundation of progressive disclosure architecture. Most developers who've heard of progressive disclosure can name the three layers but can't answer a precise question: what exact file path does Layer 3 load from, and what instruction triggers it? The details matter here, because vague instructions produce inconsistent loading. A 2025 Berkeley study of multi-agent LLM systems found that failing to follow task requirements was the single most common system design failure mode, accounting for 11.8% of all recorded failures — ahead of step repetition, context loss, and role misalignment (Cemri et al., "Why Do Multi-Agent LLM Systems Fail?", arXiv:2503.13657, 2025).

What is the metadata layer and what does it contain?

The metadata layer is the description field in each SKILL.md file — a single string that loads at session start, every session, regardless of whether the skill ever triggers, and controls both when the skill activates and what Claude knows it can do, at a cost of roughly 40-80 tokens per skill.

This is the only layer with zero conditional loading. It's always there. That makes it both the cheapest and the most expensive layer: cheap because it's a single short string per skill, expensive because it accumulates across your entire library.

At session start, Claude Code reads every SKILL.md file in your .claude/skills/ directory and adds each skill's description to context. A library of 20 skills with 80-character descriptions contributes roughly 800-1,200 tokens total. A library of 50 skills is 2,000-3,000 tokens. Research on agentic software engineering systems found that input tokens account for 54% of total token usage — what the authors call a "communication tax" from agents repeatedly passing full contexts back and forth (Salim et al., "Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering", MSR 2026, arXiv:2601.14470).

The description field has to answer two questions with a single string of text:

When should this skill activate? (trigger condition)
What does this skill do? (index entry)

A description that handles both correctly looks like: "Use when the user asks to review a pull request, examine code changes in a branch, or check a diff for quality or security issues." That's a trigger condition with three activation patterns and an implicit capability summary.

A description that fails one of the two jobs creates problems. Too narrow and the skill won't trigger on legitimate requests. Too broad and it fires on prompts that belong to a different skill. The balance is the hardest part of Layer 1 design.

The stakes of getting the description right extend beyond individual skills. Cemri et al. found that fixing a single agent role description in ChatDev — ensuring the CEO agent had the final say in conversations — produced a +9.4% increase in overall task success rate, with no other changes to the system (Cemri et al., "Why Do Multi-Agent LLM Systems Fail?", arXiv:2503.13657, 2025). The description field is the Layer 1 equivalent: one string, session-wide scope, no fallback if it's wrong.

For a deep analysis of how to write descriptions that pass the production bar, see The SKILL.md Description Field: The One Line That Makes or Breaks Your Skill.

What is the skill body layer and what does it contain?

The skill body is everything in the SKILL.md file below the description — the step sequence, output contract, constraints, and reference Read instructions — and it loads in full, at 600-1,000 tokens, the moment an incoming message matches the description's trigger condition.

The body carries the working instructions Claude executes from:

Body content that belongs here:

The step sequence: "Step 1: Read the input. Step 2: Identify the relevant section. Step 3: Generate output in the specified format."
The output contract: field names, structure, required sections, word count targets.
Universal constraints: rules that apply to every single run of the skill, with no exceptions.
Reference file Read instructions: "Before scoring, read references/rubric.md in full."

Body content that doesn't belong here:

Rubrics, scoring criteria, or evaluation checklists longer than 8 items.
Any content that's only needed for a subset of task types this skill handles.
Reference material the model reads but doesn't act on in every run.

The production threshold from AEM's builds: a SKILL.md body should fit in 600-1,000 tokens. At 1,500 tokens, instruction compliance starts dropping. Past 2,000 tokens, the body is carrying content that should live in Layer 3. This matches the broader research pattern: GPT, Claude, and Gemini models all show an average 39% performance drop when instructions are spread across extended context rather than delivered as a tight, complete specification (Liu et al., "LLMs Get Lost In Multi-Turn Conversation", Microsoft Research, arXiv:2505.06120, 2025).

"The single biggest predictor of whether an agent works reliably is whether the instructions are written as a closed spec, not an open suggestion." — Boris Cherny, TypeScript compiler team, Anthropic (2024)

A body that stays tight and explicit loads faster, holds attention better, and produces more consistent output than a body padded with reference content.

What is the reference layer and what does it contain?

The reference layer is a set of markdown files stored in the references/ subdirectory inside your skill folder, each ranging from 500 to 5,000 tokens, and none of them load unless the skill body contains an explicit Read instruction naming that file's exact path.

A reference file is any heavyweight content that the skill needs for some tasks but not every task. Common examples:

A scoring rubric with 20 criteria (2,000-3,000 tokens)
A domain vocabulary list with 150 terms (1,500-2,500 tokens)
A brand voice and style guide (1,000-4,000 tokens)
A library of approved examples (1,000-5,000 tokens)
A comparison table between options (500-1,500 tokens)

Keeping reference files out of unconditional context load is directly supported by performance research: models exhibit a 15-20% accuracy drop on retrieval tasks when the relevant passage is placed in the middle of a long context versus at the start — a degradation pattern that applies equally whether the middle-context content is user data or pre-loaded reference material (Liu et al., "Lost in the Middle: How Language Models Use Long Contexts", TACL 2024).

The loading trigger is an explicit Read instruction in the SKILL.md body. The instruction must name the file path exactly:

Before evaluating quality, read `references/quality-rubric.md` in full.

Claude executes this as a file Read tool call. The file's content enters context at that point in the task. Nothing happens before then. The on-demand trigger is what keeps the reference layer from becoming a liability: without it, every session start would add thousands of tokens before the first user message arrives. Salim et al. found that the Design phase — where agent instructions and role definitions are established — accounts for only 2.4% of total token consumption in an end-to-end agentic workflow, while the iterative Code Review phase alone accounts for 59.4%; the gap exists precisely because the design phase loads a small, defined set of instructions rather than accumulating context through repeated full-context passes (Salim et al., "Tokenomics", MSR 2026, arXiv:2601.14470).

Conditional loading uses straightforward conditional syntax in the body:

If the user requests a security review, read `references/security-rubric.md` before scoring.
If the PR includes documentation changes, read `references/documentation-rubric.md`.
For all reviews, read `references/code-quality-rubric.md`.

This lets a single skill serve multiple task types without loading all reference content for every run. A standard review loads one reference file. A full audit loads three. The same principle drives retrieval-augmented architectures at scale: the Tokenomics study found that in agentic software engineering systems, the primary cost lies not in initial code generation but in automated refinement and verification — stages where context accumulates unnecessarily if content is pre-loaded rather than pulled on demand (Salim et al., MSR 2026, arXiv:2601.14470).

How do the three layers interact during a task?

The three layers fire in sequence: the description loads at session start (roughly 60 tokens), the body loads on trigger (roughly 800 tokens), and reference files load mid-task only when the body instructs it — bringing a typical total to around 3,860 tokens rather than 5,400 if everything loaded unconditionally.

Suppose you have a writing-review skill with this structure:

.claude/skills/writing-review/
  SKILL.md                         (description + body)
  references/
    brand-voice.md                 (1,800 tokens)
    readability-rubric.md          (1,200 tokens)
    example-library.md             (2,400 tokens)

Session start: Claude reads the description field from SKILL.md. Cost: approximately 60 tokens. The body and references don't load.

User asks for a writing review: Claude matches the request against the description. The full SKILL.md body loads. Cost: approximately 800 tokens. The body's instructions say: "Read references/brand-voice.md before evaluating tone. Load references/readability-rubric.md for all reviews. Load references/example-library.md only if the user asks for examples."

Task execution: Claude reads brand-voice.md (1,800 tokens) and readability-rubric.md (1,200 tokens). Total reference load: 3,000 tokens. If the user didn't ask for examples, example-library.md never loads.

Total context cost for this task: 60 (description) + 800 (body) + 3,000 (two reference files) = 3,860 tokens. If all three reference files had been loaded unconditionally at session start, the cost would be 5,400 tokens for the references alone, before any task began. Keeping context lean matters: Liu et al. found that GPT-3.5-Turbo's multi-document QA performance drops by more than 20% in worst-case middle-of-context scenarios — performance in 20- and 30-document settings fell below the model's closed-book baseline entirely — because performance peaks when relevant information sits at the start of the input, and degrades when models must retrieve instructions from the middle of a densely packed context, the "Lost in the Middle" effect that unconditional preloading directly triggers (Liu et al., "Lost in the Middle: How Language Models Use Long Contexts", TACL 2024).

At scale across a library of 15-20 skills, these per-task savings compound into a context window that stays clean and keeps instruction compliance high through a full session. Cemri et al. (arXiv:2503.13657, 2025) identified context loss as a distinct failure mode in multi-agent systems, separate from instruction mis-specification — meaning even well-written skills fail if the context they depend on is crowded out by unconditionally loaded material that should have been kept in Layer 3.

For a detailed comparison of the actual token numbers at each tier across a 20-skill library, see Progressive Disclosure: How Production Skills Manage Token Economics.

Frequently asked questions

Do the three layers have to map to exactly one file each? Layer 1 (metadata) is always one field in one file. Layer 2 (skill body) is always one SKILL.md file per skill. Layer 3 (references) can be any number of files in the references/ directory. A complex skill might have 5-6 reference files. A simple skill might have none.

Can I nest reference files in subdirectories? Yes. references/rubrics/security.md is a valid path. The Read instruction in the body just needs to specify the correct relative path.

What format should reference files use? Whatever format makes the content readable and usable by Claude. Rubrics work well as numbered lists. Vocabulary lists work well as tables. Style guides work well with H2/H3 sections. There's no required schema.

If Claude loads a reference file, does it stay in context for the rest of the session? Yes. Once a reference file is read into context, it stays there for the session duration. If multiple skills use the same reference file (unusual but possible), it only needs to load once.

Can I put code scripts in reference files? Yes. A bash script, Python function, or JSON schema can live in a reference file. The skill body instructs Claude to read it, and Claude can then use the code as a template, run it as a shell command, or apply it as a format spec.

What happens if Claude can't find a reference file? Claude returns a file-not-found error for that Read call and continues execution. The skill body's instructions should account for this: either treat the reference as optional or include a fallback. A skill that silently skips a missing rubric produces inconsistent output. A skill that errors on a missing rubric tells you the architecture is broken.

Last updated: 2026-04-14