How Do You Diagnose Whether a Skill Failure Is a Description, Instruction, or Reference Loading Problem?

Every Claude Code skill failure in an AEM production skill library starts at one of three layers: discovery (the skill never triggers), execution (the skill triggers but runs incorrectly), or reference loading (the skill runs correctly on simple inputs but fails when reference content is needed). Each layer has a specific test. AEM's three-layer diagnostic framework isolates the failure layer before any fix is attempted, preventing the most common mistake in skill debugging: fixing the wrong thing.

TL;DR: Start at the discovery layer. Invoke the skill explicitly with /skill-name and observe whether it runs. If it does not run, the problem is the description. If it runs but produces wrong output, the problem is the instructions. If it produces wrong output only on complex inputs requiring reference files, the problem is reference loading. Work top-down, not bottom-up.

What are the three diagnostic layers of a skill failure?

A Claude Code skill failure traces to exactly one of three layers: the description that triggers it, the instructions that run it, or the reference files it loads. Each layer produces a distinct symptom pattern. Identifying the layer before attempting any fix is the entire point of this framework.

In our diagnostic work, Layer 1 accounts for roughly 60% of reported skill failures. Most "broken skills" are working skills with underperforming descriptions.

Layer 1: Discovery. The skill is not being activated. Claude sees the user's prompt and does not recognize that this skill should handle it. Symptom: the skill never runs, or runs only when explicitly invoked with /skill-name but not on natural language prompts.
Layer 2: Execution. The skill activates but runs incorrectly, skipping steps, producing the wrong output format, or violating the output contract. Symptom: the skill runs, but the output is wrong.
Layer 3: Reference loading. The skill activates and executes its main steps correctly, but fails on inputs that require reference file content. Symptom: the skill works on simple inputs and fails on complex ones, consistently.

How do you diagnose a discovery problem?

Run the skill with an explicit /skill-name invocation first. If it executes correctly but never fires on natural language prompts, the problem is the description: the skill works, the trigger vocabulary does not. The SKILL.md description field has a maximum of 1,024 characters (Anthropic, Claude Code skill authoring documentation). Every character should match how users actually phrase requests.

Explicit invocation test. Invoke the skill directly with /your-skill-name [sample input]. If the skill runs correctly with explicit invocation but never fires on natural language prompts, the problem is exclusively in the description. The skill works. The description does not match how users phrase requests for the skill.
Description vocabulary audit. Read your description and ask: does it contain the vocabulary a user would naturally use when asking for this task? A skill for "reviewing code for security vulnerabilities" needs description language that matches how developers actually phrase that request: "check this for security issues," "audit this code for vulnerabilities," "look for security problems in this file." If the description uses vocabulary that does not appear in natural phrasing, the classifier will not match.

Properly optimized descriptions improve activation from roughly 20% to 50%; adding concrete trigger examples pushes activation further to around 90% (community research compiled from Official Anthropic Documentation, @mellanon, January 2026). A separate finding: in a production library with 63 installed skills, 33% of skills were completely hidden from the Claude Code agent due to the system prompt character budget. The agent had no knowledge they existed (GitHub issue #13099, Anthropic Claude Code repository, 2025). For the majority of discovery failures, adding 1-2 natural-language trigger phrases to the description resolves the issue. For a detailed guide to what these failures look like in practice, see Why Isn't My Claude Code Skill Working?.

How do you diagnose an instruction problem?

When explicit invocation produces wrong output, the failure is in the SKILL.md instructions. Isolate the failing step first, then determine whether the problem is ambiguity or a gap in the output contract. Specification and design issues account for 41.8% of LLM system failures in production (Cemri et al., NeurIPS 2025, arXiv 2503.13657).

Step isolation. Run the skill with a minimal test input and observe which step produces the deviation. Where does the output first diverge from what you expected?
Step ambiguity check. For the failing step, ask: could Claude interpret this instruction in more than one way? Ambiguous instructions produce inconsistent output because different sessions resolve the ambiguity differently. Adding an explicit output format with examples can improve output consistency from roughly 60% to over 95% (Addy Osmani, Engineering Director, Google Chrome, 2024). The fix is to make the instruction unambiguous: name the exact format, the exact field names, the exact length constraint.
Output contract check. Compare the actual output against the "does NOT produce" list in your output contract. If the skill produces something it should not, or omits something it should include, the output contract is either missing the relevant constraint or contains a contradiction.

"Models placed in the middle of long contexts lose track of instructions at a rate that makes mid-context policy placement unreliable for production systems." — Nelson Liu et al., Stanford NLP Group, "Lost in the Middle" (2023, ArXiv 2307.03172)

This applies to instruction placement within SKILL.md. Instructions placed early in the file are attended to more reliably than instructions buried mid-file. If a step is consistently skipped, check where it appears in the file and whether it is preceded by a large block of context that pushes it toward the middle of the loaded content.

How do you diagnose a reference loading problem?

Reference loading failures are conditional: the skill works on inputs that do not need the reference and fails on inputs that do. Run two versions of the same test input, one with reference files active and one suppressed. If only the reference-dependent output is wrong, the problem is reference loading. If both are wrong, the problem is in the instructions.

Reference file isolation. Run the skill on an input that requires reference content. Then run the skill on an identical input and explicitly tell Claude to ignore the reference files in that test. If both outputs are wrong, the problem is in the instructions, not the references. If only the reference-dependent output is wrong, the problem is in reference loading.
Reference chain audit. Check whether your reference files point to other reference files. Chains (file A loads file B which loads file C) create loading failures because Claude does not reliably follow multi-hop reference instructions. The one-level-deep rule exists to prevent this failure mode. Anthropic's official skill authoring documentation states: keep references one level deep from SKILL.md (Anthropic, Claude Code skill authoring best practices, 2025).
Reference file size check. Check whether any reference file exceeds 10,000 tokens (roughly 40,000 characters). Oversized reference files can push other instruction content out of the effective attention window. Performance degrades by more than 30% when relevant information shifts from the start or end of a context window to the middle (Liu et al., "Lost in the Middle", Stanford NLP Group, 2023, arXiv 2307.03172).

For a deeper look at how reference files are designed to load, see What Are Reference Files in a Claude Code Skill?.

This diagnostic framework covers single-skill failures. For failures that occur only when multiple skills are installed together, see My Skill Worked Fine Until I Added Another Skill to the Project — the diagnostic for interference is different.

FAQ

Most skill diagnostic questions resolve to one of the three layers above. Session inconsistency and model-tier failures trace to Layer 1 or Layer 2; multi-layer failures are real but uncommon. The questions below address the edge cases and decision points that come up most often when applying the three-layer framework to Claude Code skills in AEM production libraries.

If the skill works in one session but fails in another, which layer is that?

Session-to-session inconsistency is almost always a Layer 1 or Layer 2 problem. Discovery failures produce inconsistent activation because the classifier's confidence varies with how the user phrases the request. Instruction ambiguities produce inconsistent execution because Claude resolves ambiguous instructions differently across sessions. Reference loading failures tend to be consistent — they fail on the same input type every time.

How do I know if a step is being skipped versus producing wrong output?

Add a diagnostic marker: temporarily modify the step to produce a visible signal ("After completing this step, output the word CHECKPOINT"). If the checkpoint appears, the step ran. If it does not appear, the step was skipped. Once you know which category the failure falls into, diagnose within that category.

What if the skill fails only with certain model tiers — Haiku but not Sonnet?

Model-tier failures are a Layer 2 problem. Specifically, instructions that are clear enough for Sonnet but ambiguous for Haiku. Haiku has less reasoning capacity for ambiguous instructions. The fix is to make the failing step more explicit — name the exact format, field names, and length constraint. Implicit reasoning tasks that Sonnet handles gracefully often need to be explicit for Haiku.

Is the three-layer diagnostic the same as the five-phase skill engineering process?

No. The five-phase process is a build methodology (brief, evals, architecture, build, optimize). The three-layer diagnostic framework is for post-build failures. They operate at different stages: five-phase tells you how to build, three-layer tells you what went wrong after you built it.

Can a skill have failures at multiple layers simultaneously?

Yes. A skill with a mediocre description AND ambiguous instructions AND chain-loading reference files has all three problems at once. Start at Layer 1 and fix sequentially. Fixing Layer 1 first gives you reliable activation before you diagnose whether Layers 2 and 3 are also failing. Trying to fix all three at once makes it impossible to know which change fixed which problem.

Last updated: 2026-04-19