Why Can't I Just Put Everything in One Big SKILL.md File?

TL;DR: You can, and for simple skills it's fine. The problem starts when your skill grows past 1,000-1,200 tokens. At that size, every task trigger loads the full file into context, including parts irrelevant to the current task. Token costs climb and instruction compliance drops.

A 4,000-token SKILL.md is not a skill. It's a small novel that Claude has to read every time you ask it to do anything.

That's the core problem with monolithic skill files: they don't distinguish between content that's needed on every run and content that's needed on some runs. Everything loads together, always. A rubric you use twice a week loads at the same time as the trigger logic that fires every session. A 200-line vocabulary list loads even when you're doing a task that never touches vocabulary.

The solution — what AEM calls progressive disclosure architecture — is to split skills into a lean always-loaded body and conditionally-loaded reference files. The bigger the file, the more you're paying for content you don't need.

What actually happens when you put everything in one SKILL.md file?

When your skill triggers, Claude reads the entire SKILL.md body into context in one operation — every token, every section, every piece of content regardless of whether the current task needs it, including rubrics, vocabulary lists, and example libraries that the current task will never use, paying for content and diluting the instructions that matter. At 400 tokens that's invisible. At 2,000 tokens you're paying for content you don't need and diluting the instructions you do.

That's fine at 400 tokens. It gets expensive at 1,500 tokens. At 3,000 tokens, you're loading a file that likely contains three or four distinct types of content:

instructions
a rubric
a vocabulary list
examples

when your current task might only need the instructions and the rubric.

The second problem is instruction density. Every token you add to a SKILL.md body is one more thing Claude has to hold in working attention while executing the task. Stanford's NLP Group found that model accuracy on instruction-following drops when instructions are embedded in longer contexts ("Lost in the Middle," Nelson Liu et al., ArXiv 2307.03172, 2023). A 400-token body keeps your instructions tight and salient. A 3,000-token body buries the key instructions in surrounding content.

We've seen this in practice at AEM. Skills that perform well at 600 tokens start showing instruction dropout when padded to 2,000. The model doesn't fail to read the file — it fails to weight the critical constraints correctly when they're competing with pages of secondary content.

What does this look like at scale?

At scale, monolithic skill files compound fast: a 10-skill library with 1,500-token bodies each consumes 15,000 tokens of skill overhead before your first task runs — and that number doubles or triples the moment you add rubrics, vocabulary sections, or example libraries to each body, pushing total overhead past 30,000-40,000 tokens per session.

Imagine a library of 10 skills, each with a monolithic SKILL.md averaging 1,500 tokens. No reference files, just big flat files.

Every session start loads the description index for all 10 skills (approximately 500-1,000 tokens). Then every triggered skill loads 1,500 tokens. On a typical session where 3-4 skills fire across multiple tasks, you're consuming 4,500-6,000 tokens in skill-body loads, on top of the session index.

Now add reference-style content to those bodies. A skill with a built-in rubric, an example library, and a vocabulary section hits 3,000-4,000 tokens easily. That same library of 10 skills now loads 30,000-40,000 tokens per session, depending on which skills activate.

At 40,000 tokens of skill overhead, your actual task content doesn't start until position 40,000 in the context window. According to Anthropic's documentation, Claude Sonnet's context window is 200,000 tokens (2024). You've used 20% of it on skill infrastructure before writing a single line of your task. At Anthropic's published input rate of $3 per million tokens for Claude Sonnet (Anthropic pricing page, 2025), that 40,000-token overhead costs roughly $0.12 per session in input alone — before your actual task content has run a single instruction. And it compounds: research by Freda Shi et al. found that model accuracy drops dramatically when irrelevant information is embedded in the input context, even when the model has access to all the information it needs ("Large Language Models Can Be Easily Distracted by Irrelevant Context," ArXiv 2302.00093, 2023). A 2024 benchmark study found that most state-of-the-art LLMs — including models with 128K context windows — show measurable performance degradation on retrieval and instruction tasks once effective context utilization passes roughly 50% of the stated window size (Hsieh et al., "RULER: What's the Real Context Size of Your LLM?," ArXiv 2404.06654, 2024).

"When you give a model an explicit output format with examples, consistency goes from ~60% to over 95% in our benchmarks." — Addy Osmani, Engineering Director, Google Chrome (2024)

The inverse also holds: when you give a model a 4,000-token file of mixed content instead of a focused 800-token instruction body, consistency drops. The format matters. So does the size.

How does splitting across files fix this?

Splitting moves content out of the always-loaded body and into conditionally-loaded reference files, so a security rubric only loads when the task calls for a security review, a vocabulary list only loads when the task requires terminology checks, and the base body stays under 900 tokens on every run.

The rule from progressive disclosure architecture:

Stay in the SKILL.md body: trigger logic, core instructions, output format, step sequence, constraints that apply to every run.
Move to reference files: rubrics, checklists with more than 8 items, vocabulary lists, style guides, brand guidelines, example libraries, comparison tables.

A well-split skill has a 600-900 token body with explicit Read instructions pointing to reference files. The body loads when the skill triggers. Reference files load when the body's instructions call for them.

A code-review skill with a built-in 1,800-token security rubric becomes a 700-token body with a line that says "Load references/security-rubric.md when the user requests a security review." The rubric loads only when needed. A standard code review never touches it.

This does require one additional discipline: the Read instructions in the body have to be explicit. "Load references/security-rubric.md before scoring security posture" is a correct instruction. "Use the security guidelines" is not. Claude won't know where to find them.

For a full walkthrough of what belongs in each SKILL.md section, see What Goes in a SKILL.md File?.

What should and shouldn't be in the main SKILL.md body?

The SKILL.md body should contain exactly what Claude needs on every single task run: trigger logic, operating steps, output format, universal constraints, and Read instructions pointing to reference files — nothing more, because every additional token inflates the cost of every trigger without improving the runs that don't need that content. Everything else — rubrics, checklists longer than 8 items, vocabulary lists, style guides, example libraries — belongs in a reference file.

Include in the body:

The description field (trigger condition and skill summary)
The operating steps or process
The output format specification
Hard constraints that apply universally
Read instructions for reference files (with conditional logic if needed)

Move to reference files:

Any checklist longer than 8 items
Rubrics used to score or evaluate output
Domain vocabulary, jargon, or terminology lists
Brand voice or style guidelines
Named example libraries
Comparative tables or decision matrices

The test: if a piece of content only affects some runs of the skill, it belongs in a reference file, not the body. Anthropic's prompt engineering documentation recommends placing the most important instructions at the start of the prompt, before supplementary content, to maximize instruction salience (Anthropic, "Prompt engineering overview," docs.anthropic.com, 2024) — a design constraint that a 3,000-token flat body structurally violates.

A practical threshold from our builds: if your SKILL.md body exceeds 1,200 tokens, audit it using the categories above. Every piece of content that doesn't belong in the body is adding dead weight to every trigger.

When is a monolithic skill file fine?

A monolithic SKILL.md is the right choice when the body stays under 600 tokens, there's no content you'd load conditionally, and the skill does one focused thing without rubrics, vocabulary sections, or example libraries that would only apply to certain runs. Simple formatters, slug generators, and comment writers fit cleanly in a flat file — no reference files needed, no overhead worth adding.

A single flat SKILL.md file works well when:

The skill body is under 600 tokens
There's no content you'd want to load conditionally
The skill is simple enough that everything it needs is also short

A slug-generation skill, a file-naming formatter, a code comment writer. These are often 200-400 tokens total. No rubric, no vocabulary list, no style guide. A flat file is the right choice here.

The architecture overhead of reference files and explicit Read instructions adds maintenance complexity. Don't add it where the token savings are negligible.

The signal for when to split: you notice you've written a SKILL.md body that contains a long checklist, a rubric, or any block of content that reads more like documentation than instructions. That content belongs in a reference file. In our experience at AEM, this pattern appears in the majority of skills once a team has been building for more than a few weeks — the body grows incrementally, section by section, until the token cost and compliance problems become visible.

For the full picture of how progressive disclosure changes token economics as your library scales, see Progressive Disclosure: How Production Skills Manage Token Economics.

Frequently asked questions

The most common questions about SKILL.md file size cover three areas: whether Claude Code enforces a hard limit (it doesn't), whether a large context window solves the compliance problem (it doesn't), and how to tell in practice when a body has grown too large to perform reliably.

Is there a hard token limit for SKILL.md files? No hard limit from Claude Code. The limits are practical: bodies past 1,200 tokens start showing compliance degradation in our builds, and bodies past 2,000 tokens frequently cause instruction dropout for complex multi-step tasks.

What happens to a very long SKILL.md file in a large context window? A large context window doesn't fix the instruction compliance problem — it changes the scale at which it occurs. A 4,000-token skill body still dilutes the instruction signal compared to a focused 800-token body. Bigger window, same relative problem.

Can I use sections within one SKILL.md file instead of separate reference files? You can structure your SKILL.md with H2 sections to organize content. The file still loads in full on every trigger. Section headers improve readability for humans, but they don't change the token cost or the instruction density problem.

How do I know if my SKILL.md body is too long? Two practical signs: first, the file exceeds 1,000 tokens when you count the content below the description field. Second, you notice Claude skipping or partially following constraints that are defined late in the file.

If my skill has no reference files, does progressive disclosure still apply? Partially. The three-tier architecture is most relevant when you have reference files to split out. Without them, you still benefit from the description-based trigger system (Tier 1 and Tier 2), but there's no Tier 3 in play.

Last updated: 2026-04-14