What Is a "Fair-Weather Skill" That Only Works on Easy Inputs?

Quick answer: A fair-weather skill is a Claude Code skill that performs correctly on ideal inputs during development but breaks on real-world edge cases, ambiguous requests, incomplete inputs, unusual formatting, or scenarios the developer didn't consider when writing the instructions. The cause is almost always that the skill was tested only with inputs the developer controlled.

Every skill looks good on the prompts you write yourself. You know what the skill needs. You frame the request correctly. You provide complete context. The skill produces exactly the right output.

Then a teammate uses it for the first time, or you invoke it on a messy real-world task, and the output breaks.

The skill didn't fail because the instructions were wrong. It failed because the instructions were only right for the narrow range of inputs you tested with. That is a fair-weather skill.

At AEM, we track fair-weather patterns as the most common cause of Claude Code skill failures in production. In AEM's production skill work, we've found that the majority of skills submitted for audit pass their developer's own test suite but fail immediately when handed to a colleague or placed into a real workflow — the developer's prompts had silently compensated for gaps the instructions never covered.

What Does "Fair-Weather Skill" Mean in Skill Engineering?

A fair-weather skill works when the inputs are ideal and the conditions match development closely — and breaks when they don't. The term names a skill that performs on the narrow range of inputs the developer tested with, and fails on any input that deviates from those conditions, even when the deviation is minor or entirely predictable in production.

The name captures the pattern: a sailor who can only navigate in calm water is not a navigator. A skill that only works on clean, developer-crafted inputs is not a production skill.

Fair-weather skills share three characteristics:

The instructions handle the default path. The happy path, the most expected input in the most expected format, works every time.
The instructions don't handle deviations. When the input is partially incomplete, or the user phrases the request differently than the developer expected, or the data is in an unusual format, the skill produces wrong output or fails to produce output at all.
The developer doesn't know this. Because testing happened in controlled conditions with controlled inputs, the failure modes are invisible until production.

Why Do Fair-Weather Skills Get Built?

Because developers test skills with inputs they wrote themselves — a pattern called Claude A bias, where the person building the skill also controls every test prompt, meaning the skill is only ever validated against scenarios the developer already understood and expected. This is the same bias that affects all AI development when the developer and the tester are the same person.

Claude A is the session where you build and test the skill. You type: "Write a technical blog post about authentication in Python." The skill activates. The output is correct. You refine the instructions and test again. The skill still works. You ship it.

Claude B is the fresh session a colleague uses when they type: "can you help with a blog post, something about auth, I'm not sure of the title yet, it's for a Python tutorial series we're building." The input is real, messy, and incomplete. The skill either:

Produces an article with a placeholder title and wrong scope
Generates something plausible but off-brief
Activates a different skill entirely because the prompt doesn't match the description cleanly

None of these are the fault of the colleague. The skill was not built to handle inputs like theirs. It is a fair-weather skill.

What Does a Fair-Weather Skill Look Like vs a Production Skill?

Here's the same skill built two ways — the fair-weather version handles only the ideal input path, while the production version explicitly anticipates missing data, ambiguous phrasing, and deviations from the expected format before any output is generated.

Fair-weather version (handles ideal input only):

## Step 3 — Write the post
Write a technical blog post based on the topic the user provided. Include an introduction, 
3-5 main sections, and a conclusion.

This instruction works when the user provides a fully-specified topic. It breaks when the user provides:

a vague topic
a topic question instead of a title
multiple possible topics
no topic at all

Production version (handles realistic input variation):

## Step 3 — Clarify scope before writing
If the user has not provided: (a) a specific title or topic question, (b) a target audience, and 
(c) a desired length or depth — ask for these before starting. Do not assume defaults for missing 
information. If all three are provided, proceed to writing.

## Step 4 — Write the post
Write a technical blog post based on the confirmed topic and audience. Minimum structure: 
H1 title, 40-60 word TL;DR paragraph, 3-5 H2 sections, FAQ block with 3+ Q&As. 
If the user asks for a different structure, confirm before deviating from the minimum.

The production version handles each of these cases:

missing titles
missing audience specification
alternative structures
ambiguous input

It takes more instruction lines. It produces reliable output on realistic inputs.

How Do I Make My Skill Production-Ready Instead of Fair-Weather?

Three techniques remove the fair-weather failure modes: adversarial input testing, explicit failure mode naming in the instructions, and a Claude B fresh-session test before shipping — each one is specific, independent of the others, and takes under an hour to implement for a typical single-domain skill.

How do I test with adversarial inputs?

After writing the skill, test it with 5–10 inputs you did not craft specifically for the skill — the goal is to simulate what a real user types, not what you typed while building it, because those two prompt populations are consistently different in ways that expose fair-weather failure modes. Use:

Incomplete prompts ("write a blog post" with no topic)
Ambiguous prompts ("help me with content about authentication, it's complex")
Off-format inputs (a bullet list of ideas instead of a title)
Multi-part prompts ("I need a blog post and also a LinkedIn summary for it")
Prompts with incorrect assumptions ("write a 10,000-word post about Python auth")

If the skill handles all of these correctly, it is not a fair-weather skill. If any of them break it, add instructions that handle the specific failure.

How do I name and handle failure modes?

Fair-weather skills don't name what can go wrong — production skills do, by adding explicit conditional handling for each edge case directly in the instruction set, so Claude has a defined response path for every deviation rather than attempting to guess the right behaviour from context. For each step that has edge cases, add a conditional:

## Step 2 — Validate input
If the user's prompt is missing a target topic, ask: "What specific topic should this article cover?"
If the user provides a topic question instead of a title ("how does OAuth work?"), convert it to 
a working title before proceeding ("How OAuth Works: A Developer's Guide").
If the user provides multiple possible topics, ask them to select one before starting.

Named failure modes get handled. Unnamed failure modes produce inconsistent output. In AEM's production skill work, we've found that most edge-case failures trace back to a step that described what to do on the expected path but said nothing about what to do when the input deviated from it.

What is the Claude B test and when do I run it?

Run the Claude B test before any skill reaches a team or production workflow: open a completely fresh Claude Code session with no context carried from development, type a natural prompt you'd realistically receive from a user who doesn't know the skill exists, and observe whether the skill activates correctly and produces output that matches the output contract. Observe specifically whether:

The skill auto-activates correctly
The output matches the output contract
Edge cases in the input are handled appropriately

If the fresh-session test passes, the skill is production-ready. If it fails, the instructions are not yet complete enough to work without developer context.

"The failure mode is not a crash. It is a quiet omission that looks like completed work." — Marc Bara, Project Management Consultant (March 2026, https://medium.com/@marc.bara.iniesta/claude-skills-have-two-reliability-problems-not-one-299401842ca8)

For a structured approach to testing and iteration, see The Complete Guide to Building Claude Code Skills in 2026.

How Do I Tell If an Existing Skill Is Fair-Weather?

Three signals indicate a fair-weather skill in production: the skill works for you consistently but teammates report inconsistent results, it handles simple requests but fails on complex or multi-part ones, and it produces correct output structure but wrong content when the input is ambiguous — all three point to the same root cause.

Signal 1: The skill works consistently for you but team members report inconsistent results. Your prompts are shaped to fit the skill. Theirs are not.

Signal 2: The skill works on simple requests but breaks on complex or multi-part ones. Simple requests match the developer's test cases. Complex requests don't.

Signal 3: The skill produces correct output structure but wrong content when the input is ambiguous. It followed the steps but guessed on the gaps the instructions didn't cover.

All three signals point to the same root cause: the instruction set doesn't handle input variation. The fix is the same:

test with adversarial inputs
name the failure modes
add conditional handling for each one found

For the broader pattern of anti-patterns and how to diagnose them, see What Are the Most Common Mistakes When Building Claude Code Skills?.

Frequently Asked Questions

How many adversarial test inputs do I need before a skill is production-ready?

Test until you find no new failure modes. In practice, 10-15 adversarial inputs covers most realistic variation for a single-domain skill. For multi-domain skills or skills with complex input handling, 20-30 inputs are appropriate. When 5 consecutive inputs produce correct output without any instruction revisions, the skill has passed the adversarial threshold.

Can I build a fair-weather skill intentionally for controlled environments?

Yes. If your skill only ever receives developer-crafted inputs, for example, a skill that only runs in an automated pipeline with validated inputs, you don't need to handle edge cases that can't appear. The fair-weather pattern is a problem in user-facing workflows. In tightly controlled pipelines, it's acceptable scope limitation. Document the input constraints explicitly in the output contract.

What's the difference between a fair-weather skill and an incomplete skill?

A fair-weather skill has complete instructions for its happy path. An incomplete skill is missing steps entirely. Fair-weather skills break on edge cases. Incomplete skills break on expected inputs too. Both are fixable, but they require different fixes: fair-weather skills need adversarial test coverage and conditional handling; incomplete skills need the missing steps written.

Is the Claude A / Claude B problem specific to fair-weather skills?

It's the mechanism that creates fair-weather skills, but it affects any skill built without external testing. Even a skill with solid edge case handling can have blind spots introduced by Claude A bias, the developer's prompt framing fills in gaps the instructions don't cover. The Claude B test is the standard check: test in a fresh session with a cold, natural prompt.

How do I get useful adversarial test inputs if I can't predict what users will type?

Look at actual usage. If the skill has been in production, review the conversation logs for prompts where the skill produced wrong output or behaved unexpectedly. Those are your adversarial inputs, they already exist and have already found failure modes. For new skills, ask teammates to use the skill without any guidance from you and observe what they type. Their unguided prompts are the most realistic test cases you can get.

Last updated: 2026-04-14