What Should I Avoid When Creating My First Claude Code Skill?

In AEM's Claude Code skill engineering work, first skills fail in five predictable ways. Each one is a structural mistake that looks reasonable during design and only reveals its cost in production.

TL;DR: Five mistakes account for over 80% of first-skill failures in our commissions:

No trigger condition in the description
Embedding reference material in SKILL.md
Over-engineering the first version
Testing only with self-crafted inputs
Building before doing the task manually

This is the short version of a longer pattern catalog. See The Anti-Patterns Guide: 20 Mistakes That Kill Claude Code Skills for the full list.

What's the biggest mistake in a first Claude Code skill?

Skipping the trigger condition in the description. A skill without a trigger condition is a skill Claude doesn't know when to use. The Claude API caps skill descriptions at 1,024 characters maximum, so every character must specify exactly when to invoke the skill, not just what it does (Anthropic, Claude API Docs, 2025).

The description field does two jobs: it tells Claude what the skill does and tells Claude when to activate it. Most first-time builders write a "what" description and skip the "when." The result is a skill that exists in the registry and rarely fires.

A complete description includes both: "Triggers when the user asks to review a code function, refactor a method, or check for bugs in a specific block of code. Does not trigger for general coding questions." That's the trigger condition. Without it, Claude has to guess.

Fix: Before writing anything else, answer this question: "What exact phrasing from a user should activate this skill?" Write that into the description first.

Why shouldn't I put everything in my SKILL.md file?

SKILL.md is loaded every time the skill runs. If your SKILL.md contains a 300-line style guide, a 200-line product catalog, and a 150-line API reference, Claude loads all 650 lines into context before executing a single step.

That's a problem for two reasons. First, it consumes token budget that could hold other skills or conversation context. Second, Claude's attention degrades at the ends of long context windows. Instructions buried at line 400 of a 650-line file are followed less reliably than instructions at line 10 (Liu et al., Stanford NLP Group, "Lost in the Middle," ArXiv 2307.03172, 2023). A 2025 study of 18 frontier models, including Claude, found that some dropped from 95% to 60% accuracy once input crossed a length threshold, with no warning and no error (Chroma, "Context Rot," 2025).

Reference material belongs in reference files, loaded on demand. SKILL.md should contain the skill's process steps, rules, and output contract. The reference material that supports those steps lives in separate files that the skill loads only when needed.

A well-built first skill has a SKILL.md under 100 lines. As the HumanLayer engineering team found in practice: "If it's over 300 lines, it's probably costing you tokens every single turn" (HumanLayer, 2025). The same applies to SKILL.md. See What Goes in a SKILL.md File for the correct section structure.

What does "over-engineered" mean for a first skill?

A first skill that handles 6 different input types, produces 4 different output formats, and has 30 rules to cover every edge case the designer imagined. The failure math compounds fast: a 95%-accurate agent on a 20-step task succeeds only 36% of the time (Towards Data Science, 2024). More rules mean more decision steps, and more steps multiply failure opportunities.

Over-engineering is a common pattern in first builds. You're building the skill from scratch, you know all the scenarios where it might be needed, and you want to handle them all. The result is a skill that fails in hard-to-debug ways because too many rules compete for Claude's attention. "When you give a model an explicit output format with examples, consistency goes from approximately 60% to over 95% in our benchmarks." (Addy Osmani, Engineering Director, Google Chrome, 2024) The implication: more output format options produces less consistent output.

A first skill should do one thing. One trigger condition. One output format. One clear purpose. Add scope after the simple version is working reliably, not before.

Fix: Write the simplest version of the skill that would actually be useful. Ship that. Extend it only after you've seen it work on real inputs.

How do I test my first skill correctly?

Not by testing it yourself with inputs you crafted while building it. That's the Claude A bias problem: you know exactly what the skill expects, so your tests pass, but real users invoke it differently. In our commissions, a 20-to-30-minute session with a single outside tester consistently surfaces more edge cases than an hour of self-testing.

"The failure mode isn't that the model is bad at the task. It's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." — Simon Willison, creator of Datasette and llm CLI (2024)

The correct test protocol for a first skill:

Give someone who didn't build the skill a one-sentence description of what it does.
Watch them invoke it without coaching them on phrasing.
Record every input that produces wrong or unexpected output.
Fix the instruction body to handle those inputs.

This process takes 20 to 30 minutes and exposes more edge cases than an hour of self-testing. It also reveals whether your description is too narrow (they invoke it differently than you expected) or too broad (it fires when they didn't intend it).

In our commissions, we never accept a skill as production-ready without this step. A skill tested only by its designer is a fair-weather skill.

Should I build the skill before I've done the task manually?

No. Building before doing is the most expensive first-skill mistake because it produces a skill you'll need to rewrite. A skill built from a brief embeds the designer's assumptions about how the task works. In our builds, those assumptions are wrong in ways that only become obvious when you watch someone actually do the task.

A skill built from a brief or a description of a task locks in those assumptions before the first real test. What you imagined the task involves and what actually happens are rarely the same thing.

The correct sequence:

Do the task manually at least once, noting every step you actually take.
Compare those steps to the steps you imagined when designing the skill.
Build the skill from the real steps, not the imagined ones.

In our builds, this observation phase happens before any SKILL.md gets written. Skills built after watching real work take half the iteration cycles to reach production quality compared to skills built from briefs alone.

What's the right scope for a first skill?

One task. One trigger. One output format. If your skill collection grows past 42 skills, one-third of them become invisible to Claude without warning, because the available_skills context budget runs out at roughly 16,000 characters (Anthropic, Claude Code, GitHub issue #13099, 2025). A narrow scope keeps each skill's description short enough to survive that cut.

If you catch yourself writing "or" in the trigger condition ("triggers when the user asks to review code or write tests or refactor a method"), that's three skills. Build one of them. When it works reliably, build the second.

The boundary test: if a real user would describe your skill's purpose in one sentence, it has the right scope. If they need two sentences, it's two skills.

This single-trigger approach is designed for single-domain skills. For skills that need to coordinate across multiple tools or output types, you need a multi-agent architecture rather than a single expanded SKILL.md.

See What Is a Claude Code Skill for the foundational scope guidelines, and How Do I Create My First Claude Code Skill for the step-by-step build process.

Frequently asked questions

Is it okay to build a skill for something I only need occasionally?

Yes, but keep the scope tight. Skills for rare tasks fail for the same reasons as skills for frequent tasks, just slower. The less you use a skill, the less feedback you get for improving it. Start with a skill for a task you do at least weekly, where you'll notice quickly if it breaks.

How long should my first skill file be?

Under 100 lines for SKILL.md. A well-structured first skill file has five sections:

Description: 1 line
Trigger/output summary: 2 to 3 lines
Process section: 10 to 20 numbered steps
Rules section: 5 to 10 rules
Output format definition: 3 to 5 lines

If you're over 100 lines, you're either embedding reference material that belongs in a separate file or writing a skill that's too complex for a first build.

What if my first skill already exists and has these problems?

Fix the trigger condition first: is it a single line with both a WHEN and a NOT condition? That's the highest-leverage change. Then check the SKILL.md length: anything over 200 lines needs reference files pulled out. Those two changes fix the majority of first-skill failures.

I built my skill and it activates, but the output is wrong. Is that an anti-pattern problem?

Probably an instruction body problem, not a description problem. The trigger is working (the skill activates) but the instructions are ambiguous (the output is wrong). Add a specific output format definition to the instruction body: the exact structure, fields, and format of what the skill should produce. Explicit format instructions move output consistency from roughly 60% to over 95% (Addy Osmani, Google Chrome, 2024).

Should I add evals before or after fixing these first-skill mistakes?

After. Fix the structural problems first: single-line description with trigger, lean SKILL.md, one output format. Then write evals to verify the happy path is solid. Evals on a poorly structured skill measure the wrong thing.

What happens if I ignore these and just ship the skill anyway?

The skill will work sometimes and fail without explanation at other times. You'll get frustrated because it "worked before." You'll add more rules trying to fix symptoms, which makes the problem worse. The correct fix is always structural: description, body, references. More rules don't solve structural problems.

Last updated: 2026-04-18