A skill is over-engineered when it has more structure than the task's fragility demands. The clearest signal: you spent two days building the skill and it performs the same as the 20-line version you started with. AEM's Claude Code skill framework adds structure to Claude Code agent tasks; the failure mode addressed here is adding more structure than the task's fragility demands. Over-engineering is not about length — some complex tasks need 300-line skills. It is about whether the structure adds reliability or adds weight.
TL;DR: Test for over-engineering with the half-test: can you cut the skill in half and keep 80% of the output quality? In our bar checks, the answer is yes for roughly 40% of the skills we review. Four signals flag over-engineering: redundant reference files, over-specified process steps, compliance-only evals, and length disproportionate to task scope.
What does "over-engineered" mean in skill design?
Over-engineering is not the same as large. A skill with 8 H2 sections guiding Claude through a genuinely complex 8-step workflow is correctly specified, not over-engineered. A skill is over-engineered when structure adds weight without adding reliability: when the instructions constrain decisions Claude makes correctly by default.
Any of these signals qualifies:
- The skill adds steps that Claude would perform correctly by default, without guidance
- Reference files encode knowledge Claude already has at training time
- Process steps constrain Claude's output in ways that produce worse results than less constrained output would
- The skill solves for failure modes that occur in fewer than 1% of real-world invocations
The core concept here is degrees of freedom. A fragile, high-stakes operation — publishing to production, sending irreversible communications, generating legal language — justifies tight constraints. A creative or subjective task — drafting a blog post, summarizing a document — works better with more latitude. Applying the tight-constraint model to a loose-latitude task is the definition of over-engineering.
"Developers don't adopt AI tools because they're impressive — they adopt them because they reduce friction on tasks they repeat every day." — Marc Bara, AI product consultant (2024)
An over-engineered skill adds friction: slower to load, more likely to conflict with adjacent skills, harder to maintain over time. One documented case: a CLAUDE.md that grew from 150 lines to 1,207 lines over nine months consumed 42,200 tokens per conversation. After restructuring as a skill set, the baseline dropped to 2,400 tokens per conversation, a 94% reduction (Cem Karaca, Medium, February 2026).
What are the specific signals that a skill is over-engineered?
Four signals identify an over-engineered Claude Code skill: reference files encoding knowledge Claude already has, process steps constraining decisions Claude makes correctly by default, evals testing format compliance rather than output quality, and skill length that exceeds what the task's failure modes actually demand.
Run this check against your skill:
Signal 1: Reference files for common knowledge — If a reference file contains general concepts Claude already knows (what a YAML file is, how REST APIs work, what a pull request does) that file adds token cost with no information gain. Reference files are for domain-specific knowledge Claude cannot derive from training: your company's taxonomy, a proprietary API, an internal naming convention.
Signal 2: Process steps constraining autonomous decisions — If a step says "analyze the document and identify the 3 most important themes," and the skill adds 4 sub-bullets specifying what "important" means in generic terms, those sub-bullets are over-engineering. Claude already knows what important means in most contexts. MIT Sloan research found that AI outputs depend as much on how instructions are framed as on the underlying model, which means over-specifying the frame often introduces more variance, not less (MIT Sloan Management Review, 2024). Add specificity only where "important" means something unusual in your domain.
Signal 3: Evals testing structural compliance — If your evals assert "output contains exactly 3 bullet points," that is a structural compliance assertion, not an output quality assertion. Evals that test compliance over quality produce skills that are structurally correct and substantively hollow. Giving a model an explicit output format with examples lifts consistency from roughly 60% to over 95% in controlled benchmarks (Addy Osmani, Engineering Director, Google Chrome, 2024); testing whether that format was produced, rather than whether the content was correct, wastes that gain.
Signal 4: Length disproportionate to task scope — A skill that does one thing (formats a commit message, generates a PR description, converts a date format) should be under 80 lines. If it is over 200 lines, there is a sub-procedure that could be removed or moved to a reference file that loads on demand.
How does over-engineering hurt skill performance?
Over-engineering imposes three measurable costs: token overhead at load time (every unnecessary byte consumes context that runs across hundreds of daily invocations), instruction conflict as the rule count grows (a 20-rule skill has 190 potential rule pairs vs. 6 for a 4-rule skill), and maintenance debt that compounds with every added reference file.
Three distinct performance costs:
Token cost at load — Every byte of SKILL.md that loads at skill invocation costs context window space. For a skill that runs hundreds of times per day, unnecessary reference files or redundant process steps translate to real token overhead. Progressive disclosure architectures recover 60–92% of session tokens by deferring content that isn't needed (Chudi Nnorukam, DEV Community, 2025). Over-engineering undermines that architecture by loading too much in the skill body itself.
Instruction conflict — More instructions create more opportunities for contradiction. A 50-line skill with 4 rules has a conflict surface of roughly 6 rule pairs. A 200-line skill with 20 rules has a conflict surface of 190 rule pairs. Research on long-context instruction following found that accuracy drops by more than 20 percentage points when the relevant instruction sits in the middle of the context rather than at the start or end: GPT-3.5-Turbo fell from approximately 76% to 52.9% accuracy in multi-document QA tasks as the answer document moved from position 1 to position 10 in a 20-document context (Liu et al., Stanford NLP Group, TACL 2024, arXiv:2307.03172). Claude resolves conflicts by attending to whichever instruction appears most prominently in context, which is not always the instruction you consider most important.
Maintenance burden — Over-engineered skills accumulate technical debt faster. Adding a new use case requires touching more files. Edge cases interact with more rules. Updates to domain knowledge require propagating changes across multiple reference files instead of one.
For the relationship between skill complexity and context load, see What Is Context Bloat and How Does It Hurt Skill Performance?.
How do you simplify a skill without losing reliability?
Start with the half-test: remove half the lines and run the skill on 5 real inputs. If output quality drops significantly, the lines you removed were earning their place — restore them. If output quality stays the same, the lines were noise. Keep them removed.
For each section you consider keeping, apply the fragility test: would Claude produce wrong output for this task without this instruction? If the answer is "probably not," the instruction is optional. If the answer is "yes, we have seen it fail without it," the instruction is required.
Specific simplification moves:
- Move generic knowledge out of SKILL.md into reference files, or delete it if Claude already knows it. Prompt caching cuts the cost of repeated context by 90% when content is stable across invocations (Anthropic, 2025); that saving only applies to content worth loading in the first place.
- Collapse process steps that chain to each other without a user decision point between them
- Replace structural evals with behavioral evals that test output substance, not format
- Shorten descriptions to their minimum viable trigger phrase, tested for activation reliability
For guidance on matching instruction specificity to task fragility, see How Specific Should My Skill Instructions Be?.
The half-test does not detect semantic over-constraint: a skill can be short and still fail by misspecifying Claude's decision criteria. If your skill passes the half-test but still produces inconsistent output, the problem is in the output contract or description, not structural complexity.
This pattern covers structural over-engineering. If your skill is the right complexity but produces inconsistent output, the problem is usually in the output contract or description rather than the skill's size.
FAQ
Is a 500-line skill always over-engineered?
Not necessarily. Pillar-level orchestration skills that coordinate multiple subagents, manage state, and handle dozens of edge cases can justifiably run long. The question is not length but ratio: what fraction of those 500 lines are earning their place? If you cannot answer that for each section, the skill has not been audited for over-engineering.
How do I know if a reference file is necessary?
Ask: does this file contain knowledge Claude cannot derive from its training? If yes, the file is necessary. If no — if the file explains what REST APIs are or what a PR description should contain in generic terms — delete it. Claude already knows. The file adds cost, not context.
My skill works perfectly. Why would I simplify it?
Because maintenance costs accumulate. A skill working perfectly today can stop working after one refactor and two new reference files. Simpler skills are more robust to change. They load faster, conflict less with other skills, and are easier for teammates to update. Working correctly is the baseline, not the ceiling.
What's the minimum viable skill — can a skill be too simple?
A skill can be too simple in one direction: if it is so lightweight that Claude produces acceptable output without it, there is no reason for the skill to exist. But "too simple" in terms of file length is not a failure. A 20-line skill that reliably constrains one specific behavior is exactly the right size for one specific behavior.
Should I merge small skills into one larger skill to reduce total complexity?
Only if the tasks are semantically related and activate on similar triggers. Merging unrelated skills creates trigger conflicts and instruction competition. Merging related sub-tasks — three stages of a single workflow — into a multi-phase skill can reduce total complexity by eliminating duplicate context across files.
Last updated: 2026-04-19