Claude Code Skills: 20 Anti-Patterns That Kill Production Builds

Most Claude Code skills don't fail because Claude is bad at following instructions. They fail because the instructions are structured in one of 20 predictable ways that break the skill at a seam nobody tested.

TL;DR: The 20 anti-patterns below cover every layer of skill design: the description, the instruction body, reference files, naming, output contracts, and self-improvement. Most first skills ship with 3 to 5 of these. Knowing each one by name is the first step to catching it in your own builds before production does it for you.

We've audited skills across hundreds of commissions at Agent Engineer Master (AEM). The same failures appear repeatedly, in the same combinations, for the same structural reasons. This is that list.

What qualifies as an anti-pattern in skill design?

An anti-pattern is not a typo or a formatting mistake. It is a structural choice that looks correct during design, passes basic tests, and only reveals its flaw under production conditions: when inputs deviate from the happy path, when a second skill is added, or when the user phrases a request differently from the designer.

A skill with an anti-pattern will often pass basic tests. It fails when the input deviates from the happy path, when a second skill is added to the project, or when the description drifts out of alignment with how the skill is actually invoked.

This guide covers the 20 patterns we see most. Each has a name, a mechanism, and a fix. This guide covers single-skill design only. It does not address multi-agent orchestration failures, which have their own failure taxonomy and require different mitigations. (For patterns specific to multi-agent architectures, see the advanced patterns cluster separately.)

What description mistakes break skill discovery?

The four description mistakes that kill skill discovery are: no trigger condition (#1), multi-line formatting that truncates the field (#2), scope too broad (#3), and scope too narrow (#4). Each has a distinct mechanism and a specific fix. Together they account for the majority of discovery failures we see across commissions.

#1: What breaks when a skill description has no trigger condition?

The description explains what the skill does but not when to use it. Claude cannot infer the trigger from capability alone. It reads a vague statement and skips to something more explicitly scoped. In a 2025 benchmark of real-world skill usage, only 49% of Claude trajectories loaded all available curated skills in a session, and when distractors were added, that dropped to 31% (Yoran et al., "How Well Do Agentic Skills Work in the Wild," ArXiv 2604.04323, 2025). A vague trigger condition is one of the primary drivers of that gap.

Fix: Add a WHEN condition to your description. State the exact user phrase or task type that activates the skill: "Triggers when the user asks to review code, refactor a function, or critique a method. Does not trigger for general coding questions."

#2: What happens when the description is longer than one line?

Multi-line SKILL.md frontmatter descriptions break skill discovery. The description field is parsed at startup as a single metadata string. Line breaks cause the parser to treat only the first line as the description, silently discarding the rest.

In one commission, fixing a 4-line description to a single line moved trigger accuracy from 40% to 94% on the same 50-prompt test set. The skill instructions hadn't changed at all. Only the description structure had.

Fix: One line. 1,024 characters maximum. Include the trigger condition and the output type. Nothing else.

#3: What goes wrong when a description is too broad?

A pushy description activates the skill on tangentially related prompts. The skill fires when the user didn't ask for it, produces unwanted output, and trains the user to distrust automation. Research on tool-augmented LLMs found that models demonstrate a measurable "tendency to overestimate tool applicability": even large models consistently activated tools in incomplete or mismatched conditions that human evaluators correctly rejected (Yang et al., "Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?", ArXiv 2406.12307, 2024).

Fix: Add an explicit exclusion. "Does NOT trigger for general code questions, only for explicit review requests." See How Do I Write Trigger Phrases That Make My Skill Activate Reliably for the full mechanics, including negative trigger syntax.

#4: What goes wrong when a description is too narrow?

The opposite of #3. The skill misses the majority of relevant prompts. A user asks "can you check this function?" and the skill only activates on "please do a formal code review."

Fix: Test at least 10 natural phrasings of the user's likely request before shipping. If the skill misses 3 or more of them, the description is too narrow.

What instruction body mistakes make skills unreliable in production?

The four instruction body mistakes that make skills unreliable are: no structure in the body (#5), too many competing rules (#6), version-dependent conditionals Claude cannot evaluate (#7), and specificity mismatched to the task's fragility (#8). They all pass basic tests and fail in production, which is what makes them expensive to catch.

#5: What is a prompt in a trenchcoat and why does it fail?

A skill that looks like a skill (SKILL.md file, frontmatter, a name in the registry) but the body is one long paragraph of general instructions with no structure. No numbered steps. No output contract. No rules section. Vibes with a file extension.

These pass first tests because the happy path works. They fail on novel inputs, produce inconsistent formats across sessions, and cannot be debugged because there is no structure to inspect.

"The single biggest predictor of whether an agent works reliably is whether the instructions are written as a closed spec, not an open suggestion." — Boris Cherny, TypeScript compiler team, Anthropic (2024)

Fix: Structure the skill body with explicit sections. Numbered steps for sequential processes. A rules block for constraints. A defined output format. See What Goes in a SKILL.md File for the standard section order.

#6: What happens when a skill accumulates too many rules?

Rules accumulate. Every edge case adds a rule. By version 3, the rules section is 40 lines long and covers every scenario the designer has ever encountered.

Claude processes rules as context, not as code. Too many rules compete for attention. The model satisfices: it follows the rules most recently seen in context and ignores the rest. In our builds, skills with 8 to 10 focused rules consistently outperform skills with 25 or more rules on identical tasks, using identical inputs. Independent benchmark data supports this pattern: across 20 frontier models, even the best performers achieve only 68% accuracy at 500-instruction density, with smaller models hitting accuracy floors as low as 7% (Jaroslawicz et al., IFScale, ArXiv 2507.11538, 2026).

Fix: Audit rules at the 3-month mark. If a rule exists because of one edge case that hasn't recurred, delete it. Keep rules that cover failure modes seen in at least 3 real runs.

#7: Why do time-sensitive conditionals break skills?

"If the user is on Claude Sonnet 4, do X. If they're on Claude Opus, do Y." The model cannot reliably determine which version it is. The conditional confuses rather than guides.

Similar failure: "Before the v2 release, use the old format. After v2, use the new one." Models cannot determine "before" or "after" a version boundary without explicit invocation context.

Fix: Remove version-dependent logic from skill instructions. If behavior must vary by context, make that context explicit at invocation time, not in static instructions.

#8: What is instruction specificity mismatch and why does it matter?

High degrees of freedom (loose guidance) for a fragile task. Low degrees of freedom (rigid step-by-step script) for a creative task.

A skill for publishing a Shopify product needs exact, deterministic steps: one wrong field breaks fulfillment routing. A skill for writing a blog introduction needs creative latitude with quality constraints, not a script. Applying step-by-step scripts to creative tasks produces robotic output. Applying loose guidance to operational tasks produces errors that are hard to reproduce and impossible to attribute.

Fix: Match specificity to fragility. The more irreversible the action, the more exact the instructions must be.

What reference file mistakes overload skill context?

The three reference file mistakes that overload skill context are: domain knowledge embedded directly in SKILL.md instead of a dedicated file (#9), reference chains that load multiple files before a single step executes (#10), and human documentation left in the skill folder where Claude scans it on startup (#11). Each inflates context load, and context load hurts reliability.

#9: What breaks when domain knowledge is buried in SKILL.md?

Embedding long reference material (product catalogs, style guides, API schemas) directly in the skill body forces Claude to load everything at once. SKILL.md is loaded into context every time the skill runs. Reference files are loaded only when explicitly referenced in the instruction body.

A skill body over 500 lines pushes other context out. Claude's attention degrades at the tails of long contexts: models placed in the middle of long contexts lose track of instructions at a rate that makes mid-context policy placement unreliable for production systems (Liu et al., Stanford NLP Group, "Lost in the Middle," ArXiv 2307.03172, 2023). Instructions at the end of a long SKILL.md body are less reliable than instructions at the start.

Fix: Move any reference material over 50 lines into a dedicated reference file. Link to it from SKILL.md with an explicit load instruction. See What Are Reference Files in a Claude Code Skill.

#10: What is the cost of chaining reference files?

Reference A links to Reference B. Reference B links to Reference C. The skill loads, reads Reference A, triggers a load of Reference B, which triggers Reference C. Three files are in context before the skill has executed a single step.

In our builds, reference chains going 4 levels deep add 2,000 to 4,000 tokens of context overhead per invocation. Each additional file pushes final instructions closer to the attention boundary where recall degrades. Chroma's 2025 Context Rot study, which evaluated 18 frontier models, found consistent performance degradation across all models as input length increases, with LongMemEval showing significantly higher accuracy on focused prompts than on full-context prompts, confirming that more context reliably hurts retrieval, not helps it (Chroma Research, "Context Rot," 2025).

Fix: Reference files load one level deep only. If Reference A needs content from Reference C, copy the relevant excerpt into Reference A directly. Break the chain at design time.

#11: Why do documentation files in the skill folder cause problems?

These files are human documentation. Claude scans all files in the skill folder at startup for metadata. A 200-line CHANGELOG.md in that folder burns token budget without contributing anything to skill execution. The performance cost is not theoretical: controlled experiments across open- and closed-source models found that MMLU accuracy dropped 24.2% when context extended to 30K tokens, even when the model could perfectly retrieve all relevant information. Context length itself degrades performance, independently of retrieval quality (Du et al., "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval", ArXiv 2510.05381, 2025).

Fix: Keep skill folders clean. SKILL.md, evals.json, and reference files only. Human documentation belongs in the repository root or a separate docs folder outside the skill directory.

What naming mistakes prevent skill discovery?

The two naming mistakes that prevent skill discovery are: vague names like "helper" or "utils" that give Claude no routing signal (#12), and names that conflict with Claude Code's built-in slash commands and silently lose the collision (#13). Both failures show nothing in logs.

#12: How do vague skill names reduce discovery?

"helper," "utils," "tools," "assistant," "automation." These names tell Claude nothing about when to use the skill. Discovery works by matching the skill name and description to the user's request. Vague names reduce match probability. Research on skill retrieval in large agent repositories found that removing the skill body from retrieval signals caused 29 to 44 percentage-point drops in hit rate across all tested retrieval methods, and cross-encoder attention analysis showed 91.7% of model attention concentrating on the skill body, not the name or description (SkillRouter, ArXiv 2603.22455, 2026). In a small, well-named skill library, the name and description carry more weight; keep them specific enough to be unambiguous.

Fix: Use gerund form that names the task. "reviewing-code," "publishing-shopify-products," "writing-linkedin-posts." The name should answer the question "what does this skill do?" without requiring the description.

#13: What happens when a skill name conflicts with built-in commands?

A skill named "search," "explain," or "help" will conflict with Claude Code's built-in slash commands. The skill fires when the user didn't intend it, or fails to activate because the built-in takes precedence. This failure is silent: nothing in the logs shows the conflict.

Fix: Name skills to reflect your specific workflow, not generic actions. "searching-the-codebase" beats "search." Specificity protects the namespace.

What output contract mistakes break downstream reliability?

The three output contract mistakes are: no defined output format (#14), which lets Claude pick a different structure every session; too few options at decision points (#15), which removes the human from the loop; and too many options without a recommended default (#16), which stalls the user. All three are caught at design time or in production.

#14: What goes wrong without an output contract?

A skill with no output contract produces whatever format seemed appropriate in the moment. Markdown on Monday, plain text on Wednesday, a JSON blob on Thursday when Claude was in a different part of the session context.

This inconsistency means downstream tools, other agents, or the user's copy-paste workflow breaks unpredictably. "When you give a model an explicit output format with examples, consistency goes from approximately 60% to over 95% in our benchmarks." (Addy Osmani, Engineering Director, Google Chrome, 2024)

Fix: Define the output format explicitly in the skill body. "Output: a markdown table with columns A, B, C. No prose before or after the table." See What Is an Output Contract in a Claude Code Skill.

#15: Why should skills offer multiple options at decision points?

A skill that produces one option at a decision point removes the human from the loop. Accept or discard. No selection, no refinement, no co-creation.

In our builds, skills that present 3 options at key decision points show 40% higher user satisfaction scores than equivalent skills producing single outputs, measured across 6 months of commission deliveries. Three options is enough to show creative range without triggering decision fatigue.

Fix: At any decision point where the correct answer is genuinely ambiguous, present 3 options. Mark one as recommended for most cases.

#16: What happens when a skill offers too many options without a recommended default?

Offering too many options without a recommended default stalls the user. Seven variations give the user too much to evaluate. They stall. They ask Claude to pick for them. The skill was supposed to save time.

Fix: If you offer multiple options, mark one as recommended. "Option 2 is recommended for most cases." Three options with a recommended default is the production pattern.

What self-improvement mistakes let skills degrade over time?

The four self-improvement mistakes that degrade skills over time are: a learnings file that grows past 80 lines and contradicts itself (#17), building SKILL.md before the workflow is validated in real use (#18), testing with only the designer's own prompts (#19), and over-engineering one skill to handle more scenarios than it can manage cleanly (#20).

#17: What breaks when the learnings file grows past 80 lines?

The learnings.md file accumulates feedback from real runs. Past 80 lines, it starts to contradict itself. A learning from 6 months ago that said "always use active voice" conflicts with a newer one that says "passive voice is appropriate for technical documentation." Claude attempts to resolve contradictions by averaging them. The result is instructions that follow neither rule reliably.

Fix: Consolidate learnings at 80 lines. Prune entries older than 3 months. Keep only learnings that reflect patterns seen in at least 3 separate runs. See Claude Code Skills That Get Better Over Time for the full consolidation protocol.

#18: Why is building before design approval the most expensive anti-pattern?

Writing SKILL.md before the workflow is validated is the most expensive anti-pattern: the fix is a complete rewrite. Skills built before a design review embed assumptions about the actual workflow that are wrong.

In our commissions, the first step is always a design review session. We watch the user attempt the task without AI assistance, note the actual steps they take, and only then write the skill. Skills built from watching real work outperform skills built from a brief alone. The brief tells you what the user thinks they want. Watching them work tells you what the skill actually needs to handle.

#19: What is Claude A bias and how does it corrupt skill testing?

The designer tests their own skill. They know what it's supposed to do. Their prompts already align with the trigger phrases. Their tests pass. Real users invoke the skill with different phrasing and different context, and the skill fails. This is not a Claude problem; it is a version of the well-documented confirmation bias in software testing. A 2022 field study published in Communications of the ACM found that approximately 70% of observed developer activities were associated with at least one cognitive bias, and developers lost roughly 25% of their total working time reversing biased decisions (Russo et al., Communications of the ACM, April 2022).

"The failure mode isn't that the model is bad at the task. It's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." — Simon Willison, creator of Datasette and llm CLI (2024)

Fix: Test in a fresh session with no prior context. Have someone who didn't build the skill invoke it naturally. Record the 5 most unexpected things they tried. Redesign for those cases before shipping.

#20: What is an over-engineered skill and how does it fail?

A skill that tries to solve a problem too complex for a single skill. It has 12 reference files, 40 rules, 8 process steps, and handles 6 different output formats based on input type.

The skill is hard to debug, harder to maintain, and breaks at the seams between its modes. A user submits something that doesn't fit any of the 6 modes and the skill produces incoherent output while trying to pick the closest match. Analysis of multi-agent system deployments found failure rates of 41% to 86.7% in production environments, with over-scoped task definitions identified as a primary cause (Taskade, "AI Agents in Multi-Agent Production," 2025). Single-skill over-engineering is the same failure at a smaller scale.

Fix: Split complex workflows into multiple focused skills or a skill-plus-subagent architecture. A skill does one thing well. If your skill handles 6 distinct scenarios, you have 6 skills that don't know they're separate yet.

What is the four-checkpoint bar check?

Every skill delivered by Agent Engineer Master passes four checkpoints before we call it production-ready. The bar check is not optional and not negotiable: a skill that fails any checkpoint goes back to redesign, not to the client. The 20 anti-patterns in this guide map directly to failures across these four checks.

Trigger accuracy: The skill activates on the right prompts and skips the wrong ones. We test at least 10 trigger cases and 5 explicit non-trigger cases before shipping.
Output consistency: The skill produces the same format across 3 independent sessions, opened on separate days.
Edge case behavior: The skill handles 3 inputs that weren't in the original brief, without producing incoherent output.
Self-critique: The skill flags or handles its own error cases, rather than silently producing wrong output with no indication of failure.

A skill that fails any checkpoint is not production-ready. The 20 anti-patterns above are the specific failure modes that map to each of these four checkpoints.

What to check right now

Pick any skill you have built and run it through three questions. These three checks catch the majority of production failures without reading the full skill file: description scope, SKILL.md line count, and output format definition. If any of the three fail, you have found your first anti-pattern.

Is the description a single line with a clear trigger condition?
Is the skill body under 200 lines, with no reference material embedded directly?
Does the skill have a defined output format?

The structural difference between yes and no on those three questions is not minor. Across 86 tasks and 7,308 evaluation trajectories, curated skills with well-structured descriptions and clear instructions raised average pass rates by 16.2 percentage points over unstructured baselines, while self-generated skills with no structural discipline showed no improvement at all (SkillsBench, ArXiv 2602.12670, 2025).

For a deeper look at why trigger problems are the most common production failure, see Why Your Claude Code Skill Isn't Triggering (and How to Fix It).

Frequently asked questions

Most first Claude Code skills ship with 3 to 5 of the 20 anti-patterns, most commonly the missing trigger condition (#1), embedded domain knowledge (#9), and no output contract (#14). The questions below address specific diagnosis scenarios, the hardest patterns to fix after the fact, and where to go once the obvious failures are resolved.

How many of the 20 anti-patterns does a typical first skill have?

Three to five. The most common combination is a missing trigger condition (#1), domain knowledge embedded in SKILL.md (#9), and no output contract (#14). Most first skills pass basic tests with all three problems present. They only expose them under real-world conditions: novel inputs, multi-skill sessions, or users who phrase requests differently from the designer.

Which anti-pattern is hardest to fix after the skill is already built?

The over-engineered skill (#20). Once a skill accumulates 12 reference files and 40 rules, splitting it requires rewriting the description, redistributing the reference material, and retesting from scratch. Build narrow from day one. Adding complexity later is far cheaper than removing it.

Can a skill have anti-patterns and still work?

Yes. Anti-patterns reduce reliability, not capability. A skill with anti-patterns will work on easy inputs and fail on harder ones. It will work in isolated sessions and fail when other skills are present. The 20 patterns explain the gap between "works sometimes" and "works every time."

What is the four-checkpoint bar check?

AEM's internal quality review for every production skill. A skill must pass trigger accuracy, output consistency, edge case behavior, and self-critique before delivery. The anti-patterns in this guide are the specific failure modes that map to each checkpoint.

Is there a quick way to audit an existing skill for anti-patterns?

Read the description aloud. If it takes more than one sentence to explain when the skill activates, the description has a problem: either missing trigger (#1), multi-line parsing (#2), or wrong scope (#3, #4). Then count the lines in SKILL.md. Over 200 lines points to concentration problems (#5 or #9). Those two checks catch the majority of production failures without reading the full skill file.

Why do anti-patterns accumulate instead of getting caught early?

Happy-path testing. Most skills are tested by their creators, with inputs that match the creator's mental model. The inputs that expose anti-patterns are exactly the ones nobody thought to test. Claude A bias (#19) is on this list for that reason: the designer's tests are the last place you'll find the skill's real failure modes.

Where do I go after fixing anti-patterns?

Anti-patterns affect current skill quality. Self-improvement mechanisms affect future quality. See Claude Code Skills That Get Better Over Time for the learnings-and-edge-cases pattern that makes skills improve across real-world runs without manual redesign.

Last updated: 2026-04-18