I Added 20 Reference Files and My Skill Got Worse — What Happened?

Twenty reference files isn't thorough. It's just twenty files. AEM builds Claude Code skills in production, and this pattern appears consistently: when Claude loads even a third of those files on a typical invocation, the context pressure from the reference content overwhelms the skill's actual instructions. Output quality drops. The skill starts producing the average of its reference files instead of executing the skill's core process.

TL;DR: Progressive disclosure works only when most skill invocations load 0–2 reference files. At 20 reference files, a complex query triggers 6–8 simultaneous loads. The resulting context density leaves insufficient attention budget for the SKILL.md instructions that govern behavior. Fix it with an audit: consolidate always-loaded files into SKILL.md, evaluate rarely-loaded ones for removal, and delete anything that never loads.

Why does progressive disclosure break down at scale?

At scale, progressive disclosure breaks down because simultaneous reference loads generate more token pressure than the skill's core instructions can compete with. A skill with 20 reference files and a complex query loads 8-12 files at once, injecting thousands of tokens of competing content before SKILL.md instructions even register.

Progressive disclosure is the loading model that keeps Claude Code skills fast and focused. At startup, Claude reads only the skill metadata (the description field and frontmatter). When a skill is invoked, Claude reads SKILL.md. Reference files load only when a step in SKILL.md explicitly calls for them.

The model assumes that most invocations need most of the SKILL.md body and a small subset of the reference files. A skill with 3 reference files and a complex query might load all 3. A skill with 20 reference files and a complex query might load 8–12.

Each reference file is a full document read: every line enters the context window as token pressure competing with the skill's core instructions. A single 300-line reference file at 4 tokens per line averages 1,200 tokens, consistent with the industry benchmark of 100 lines of text averaging roughly 1,000 tokens (Anthropic tokenizer documentation, 2024). Eight files is 9,600 tokens of reference content. Claude's standard context window is 200,000 tokens (Anthropic, 2024), but on top of the SKILL.md body (another 800–1,500 tokens), conversation history, and the user prompt, the skill's behavioral instructions quickly become minority content in a complex session.

"The single biggest predictor of whether an agent works reliably is whether the instructions are written as a closed spec, not an open suggestion." — Boris Cherny, TypeScript compiler team, Anthropic (2024)

Twenty reference files is the opposite of a closed spec. It's an open library. And libraries require the reader to navigate, synthesize, and prioritize — tasks that degrade accuracy when the reader is a language model with a fixed attention budget. Research confirms the mechanism: performance drops by roughly 20 percentage points when relevant information sits in the middle of a long context rather than at the start or end (Liu et al., Stanford/TACL, "Lost in the Middle," 2023, ArXiv 2307.03172). A separate study of 18 frontier models found 20–50% accuracy drops as input length scaled from 10K to 100K tokens (Chroma, 2025).

What does reference overload actually produce?

Reference overload degrades skill output in three distinct ways, each rooted in the same cause: competing reference content crowds out the SKILL.md instructions that govern behavior. The failures are measurable and compound each other. A skill that hits all three simultaneously produces output that is thin, incomplete, and slow.

Diluted output. The skill produces outputs that blend content from multiple reference files without correctly weighting any of them. The output looks comprehensive but feels thin: it covers all the reference material at surface level instead of applying the most relevant content with precision. In our builds, we've found that skills with 10+ reference files see output quality plateau or decline relative to equivalent skills with 3–5 well-structured files. The marginal reference file beyond the sixth adds noise faster than it adds signal.
Instruction displacement. When reference content saturates the context window, the model's effective attention on the SKILL.md instructions drops. Steps that were followed reliably at 3 reference files start being skipped or simplified at 10. The skill's process becomes a suggestion rather than a procedure. Claude Code documentation notes that an activated skill uses under 5,000 tokens for full content load (Anthropic Claude Code docs, 2024). That budget disappears fast when 8 reference files add 9,600 tokens of competing content on top of it.
Slow invocation. Each reference file is a file read operation. Twenty potential reads (even if only 8 fire on a given invocation) means the startup sequence involves navigating a large file set. This adds measurable latency to skill response time in production environments. That latency is compounded by a structural problem: explicit output formats with examples push model consistency from roughly 60% to over 95% in benchmarks (Addy Osmani, Engineering Director, Google Chrome, 2024), but that output format structure is exactly what gets displaced when reference content crowds out the SKILL.md instructions that define it.

How do you audit an over-referenced skill?

Audit by load frequency, not by file count. Map how often each reference file actually loads during invocations, identify files that never load and delete them, consolidate files that load on 80% or more of runs into SKILL.md, and trim domain knowledge files to the actionable content Claude needs. Four steps, applied in that order.

Step 1: Map load frequency. For each reference file, estimate or measure how many invocations trigger its load. A file loaded on every invocation is operating like part of SKILL.md — it should probably be folded into the skill body. A file loaded on fewer than 1 in 5 invocations is a genuine reference file, used only when needed.
Step 2: Identify the never-loaded files. At least 3–5 of your 20 files probably never load. They were created with the intention of covering edge cases that don't arise in practice, or they're holdovers from an earlier version of the skill. Delete them. File presence creates loading pressure even when the files aren't loaded, because SKILL.md often references them in "if X, read Y" conditionals — Claude evaluates those conditions every time.
Step 3: Consolidate the always-loaded files. Files that load on 80%+ of invocations belong in SKILL.md body or in a single consolidated reference file. Two or three small always-loaded files can merge into one well-organized document without losing any information. The token cost of a single 400-line file is lower than the coordination cost of three 150-line files that always load together.
Step 4: Trim the domain knowledge files. Reference files often accumulate domain knowledge that's accurate but low-value: historical context, deprecated patterns, extensive background on how the domain works. Trim each file to the actionable content Claude needs to produce correct output. Background explanations that don't change Claude's behavior should be removed.

For the core principles behind how reference files should work, see What Are Reference Files in a Claude Code Skill? and What Is Progressive Disclosure in Claude Code Skills?. For how reference files become a chain problem when they reference each other, see What Happens When Reference Files Chain to Other Reference Files?. For the general context overload problem this creates, see What Is Context Bloat and How Does It Hurt Skill Performance?.

What's the right number of reference files?

For most production skills, the right number is 1–3. Above 6, each additional reference file needs a clear justification: a distinct loading condition, a domain that cannot overlap with existing files, and a load frequency below 80%. Above 7, the skill is probably trying to cover multiple use cases that belong in separate skills.

1–3 reference files: Lean and targeted. Each file serves a specific loading condition that SKILL.md can't accommodate inline. This is the right range for most production skills.
4–6 reference files: Acceptable if the skill covers genuinely diverse domains. Map load frequency and loading conditions for each file to confirm distinct triggers and no domain overlap between files.
7+ reference files: Review required. The skill is probably covering multiple distinct use cases that would be better served by separate skills. At this count, the progressive disclosure model is under stress.

The honest limitation: this framework applies to single-domain skills. Cross-domain orchestration skills that need to load context from multiple domains can justify more reference files, but should use a multi-agent architecture instead of a single-skill approach with 20 files.

Frequently asked questions

The reference file count problem reduces to token math and load frequency. Fewer, larger files outperform many small ones when they cover the same domain. Detecting which files actually load requires a diagnostic step in SKILL.md. Shared files at .claude/skills/shared/ eliminate duplication without reference chains. Safe deletion means archiving first, then deleting after two clean weeks.

Can I just increase the size of my reference files instead of having many small ones?

Yes, within limits. One 600-line reference file is better than six 100-line files if they cover the same domain, because loading one file is one read operation with one context injection. The limit is around 400–500 lines before the file itself becomes too long for Claude to use effectively.

How do I know which reference files are actually loading during invocations?

In fresh-session testing, add a temporary diagnostic step to SKILL.md: "Before starting, list the reference files you are loading." Run several representative invocations. The files Claude lists are the ones loading. Files not listed in any test run are candidates for removal.

My skill has 20 reference files but they're all small (under 50 lines each). Is that different?

Fifty lines at 4 tokens per line is 200 tokens per file. Twenty files is 4,000 tokens of reference content if all load simultaneously — plus whatever conditionally loads. The token math still matters. Small files don't eliminate the context pressure problem; they reduce it. The load frequency audit still applies.

Should I delete reference files I'm not sure I need?

Move them to an archive folder outside the skill directory first, then delete them from the skill folder. Run your normal invocations for 2 weeks. If no invocations degrade, the files were not contributing. Delete from the archive. If quality drops, restore from the archive and investigate what those files were loading.

Can I use shared reference files to reduce duplication across skills?

Yes. A shared reference file at .claude/skills/shared/ can be cited by multiple skills' SKILL.md files directly. This is better than duplicating the file in each skill folder and better than having one skill's reference file point to another skill's reference file (which creates a reference chain).

Last updated: 2026-04-18