Merging vs Splitting Claude Code Skills: When to Combine and When to Keep Separate

Merging two skills in an AEM skill library reduces your context cost from two 100-token description slots to one. It also combines two precise trigger conditions into one broader one. That broader trigger is the tradeoff: merged skills activate on more prompts than either separate skill would, which is only a good outcome when the two tasks should always arrive together.

TL;DR: Merge Claude Code skills when they always activate on the same prompts, share more than 60% of their reference material, and produce a combined description under 700 characters. Keep skills separate when they serve different workflow phases, different users, or when one needs to be maintained or archived independently. The merge decision is a context-budget optimization, not a quality improvement.

What Are the Two Competing Costs in the Merge vs. Split Decision?

Two costs pull in opposite directions when you decide whether to merge skills: discovery budget cost, where each skill description occupies system prompt space at session startup, and trigger precision cost, where a merged description must cover two tasks and therefore matches a broader set of prompts than either separate skill would.

Discovery budget cost: Each skill description occupies approximately 100 tokens in the system prompt at session startup (source: Claude Code documentation, 2024). A library of 20 skills spends 2,000 tokens on discovery. Merging two skills into one saves 100 tokens per session. At the 30-skill degradation threshold, 100 tokens is the difference between a healthy library and one starting to slip. Skill budget research published in December 2025 found that Claude Code's undocumented system prompt metadata budget sits at approximately 15,500 characters: with 63 skills installed, 21 of them (33%) became completely invisible to the agent and could not be discovered or invoked (Alexey Pelykh, skill-budget-research, GitHub Gist, Dec 2025).

Trigger precision cost: Every merge broadens the combined description to cover two tasks instead of one. A broader description matches more incoming prompts. Some of those additional matches are correct: you wanted the skill. Some are not: the skill fires when a simpler response would have served. Trigger precision decreases with each merge. Research on context length and model performance shows that performance degrades as context grows longer, even when relevant information is fully retrievable. Every additional token in the system prompt increases the signal-to-noise ratio problem (Nelson Liu et al., Stanford NLP Group, "Lost in the Middle," arXiv 2307.03172, 2023).

The right merge decision depends on which cost is currently limiting your library's performance. If you are hitting the discovery budget ceiling and your top two candidates for merging share their activation context, merge them. If trigger precision is already causing false positives, do not merge further until you fix existing descriptions first. Controlled experiments across five frontier models showed that extending input length alone substantially degrades reasoning performance, even when the model can retrieve every relevant token with 100% exact match (Yao et al., "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval," arXiv 2510.05381, 2025).

When Should You Merge Multiple Small Skills into One?

Merge skills when all four of these conditions are true: the skills always activate on the same prompts, they share more than 60% of their reference material, a merged description stays under 700 characters, and a single owner or team governs the combined skill. All four must hold. One condition missing and the merge creates more problems than it solves.

They always fire on the same prompts: if Skill A activates on every prompt that activates Skill B, you have one workflow in two files and merging loses no precision.
They share more than 60% of their reference material: the combined skill loads that shared content once instead of twice, which is where merged skills recoup their context budget savings.
The combined description stays under 700 characters: if covering both tasks requires more than 700 characters, the description bloat is diagnostic: the tasks need separate activation contexts.
Neither skill needs independent maintenance paths: if different teams own each skill, merging means both teams approve every change; separate skills preserve independent ownership.

If you analyze invocation logs and find that Skill A activates on every prompt that activates Skill B, the triggers are semantically equivalent. Merging loses no precision because separate activation was never happening. Conversely, if writing a merged description requires more than 700 characters without becoming vague, treat that as a hard stop: the tasks are not conceptually adjacent enough to share a trigger. Third-party benchmarks of Claude Code skill libraries found that keeping skills focused on a single trigger condition reduced unnecessary activations and cut token costs by up to 70% in real sessions, compared to broad-trigger merged skills that fired on adjacent prompts (MindStudio, "5 Claude Code Skills That Cut Token Costs by Up to 70%," 2025).

A library of 40 small skills has a focus problem. A library of 5 mega-skills has a trigger problem. The right number is in the middle, and the merge decision is how you find it. Simon Willison, creator of Datasette and the llm CLI, identifies ambiguous instructions as the root cause of almost every production AI failure, arguing that the model is rarely the weak link (Simon Willison, 2024). Merged skill descriptions that try to cover two tasks in one trigger are a textbook case of that ambiguity.

"Developers don't adopt AI tools because they're impressive — they adopt them because they reduce friction on tasks they repeat every day." — Marc Bara, AI product consultant (2024)

A merged skill that activates on unintended prompts creates friction: Claude runs a 12-step process when the user wanted a 2-sentence answer. That is the specific way merge decisions go wrong.

When Should You Keep Skills Separate?

Keep skills separate when they fire at different workflow phases, serve different user roles, or carry independent maintenance histories. The 400-line combined instruction threshold is the hard floor: beyond that point, instruction adherence drops in production regardless of how clean the merge looks on paper.

Different workflow phases: a skill for "drafting a PR description" and a skill for "reviewing a PR for approval readiness" should stay separate. They fire at different points in the workflow, for different reasons, sometimes for different users. Merging them creates a skill that runs a review and drafts a description at the same time, which is never what anyone wanted.
Different users or audiences: if one skill is for senior engineers running architecture reviews and another is for junior engineers running code quality checks, merging them creates a skill whose instructions try to serve both audiences. It serves neither well.
One needs to be archived independently: if your team decides to retire the PR description skill but keep the review skill, separate files make the decision a one-line archive operation. A merged skill requires refactoring to extract the retained portion.
Instruction volume exceeds 400 lines combined: skills built past 400 lines in the SKILL.md body show declining instruction adherence in production environments compared to separate skills at half the length each. Two 200-line skills outperform one 400-line skill on instruction-following metrics in our commissioned builds. Chroma's context rot research (July 2025) tested 18 frontier models and found every one showed measurable performance degradation as input token count grew, with more complex tasks exhibiting more severe decline (Chroma Research, "Context Rot: How Increasing Input Tokens Impacts LLM Performance," 2025).

How Do You Evaluate a Merge Before Committing?

Run three steps before merging any two skills: write the merged description and verify it stays under 700 characters, test trigger overlap on a 20-prompt validation set to confirm both skills already share their activation context, and check that the combined instruction length stays under 400 lines. All three steps gate the decision.

Write the candidate merged description: if you cannot describe both skills' trigger conditions in one description under 700 characters without it becoming vague, stop. The tasks are not mergeable without losing trigger precision. Note that Claude Code 2.1.86 introduced a 250-character soft cap for skill descriptions in the /skills listing; a merged description that exceeds this threshold may have its tail truncated during discovery (claude-code GitHub issue #40121, Anthropics/claude-code, 2024).
Test trigger overlap on a 20-prompt validation set: run 10 prompts that should activate Skill A but not B, and 10 prompts that should activate Skill B but not A. If both sets activate both skills at your current setup, the triggers already overlap and merging them makes the overlap explicit. If the sets activate cleanly separately, merging them will create new overlap.
Check combined instruction length: if the merged SKILL.md body would exceed 400 lines, the tasks are not candidates for a single skill. Build a shared reference file instead and have both separate skills load that file when needed. Each skill entry in the system prompt carries approximately 109 characters of fixed XML overhead beyond its description length, so merging also eliminates one fixed overhead slot from the total budget (claude-code GitHub issue #13099, anthropics/claude-code, 2025).

For the token economics of discovery budget management, see At What Skill Count Does Claude's Performance Actually Degrade? and How Does the 100-Token Per Skill Metadata Cost Shape Library Architecture at Scale?.

What Are the Warning Signs That a Merged Skill Needs to Be Split?

Three signals indicate a merged skill has passed its useful life: it fires on prompts the user did not intend, it accumulates two separate streams of unrelated change requests, or its instructions begin to contradict each other. Any one of these signals is sufficient reason to split. Waiting for all three compounds the maintenance debt.

Activation on wrong-context prompts: if users report the skill fires when they did not want it, the merged trigger is too broad. The fix is either splitting the skill back into two with precise descriptions, or tightening the merged description to add negative exclusions.
Two distinct maintenance change requests: if the skill gets two separate, unrelated change requests in the same quarter (one touching the PR drafting section and one touching the review section), the skill has two independent codebases inside one file. Split it before the next change cycle.
The instructions contradict each other: a merged skill that covers two use cases sometimes produces instructions that conflict: "always request the full context before proceeding" in one section, "proceed immediately with available information" in another. Contradictory instructions cause inconsistent behavior. Split the skill.

This framework applies when you control the skill files. For community skills installed from external repositories, skip the merge decision entirely and manage discovery budget by archiving unused community skills rather than modifying their structure.

Frequently Asked Questions

The right default is to start with separate skills and merge only after confirming trigger overlap in production logs. Merging is a compression operation that works well when both conditions are already true; splitting a merged skill that accumulated shared assumptions is more expensive than the merge ever saved.

Is it better to start with separate skills and merge later, or start merged and split if needed?

Start separate. It is easier to merge two clean, precise skills once you have confirmed their trigger overlap than to split a merged skill that has accumulated assumptions about shared context. The merge direction preserves the quality you built into the originals.

Can merged skills share a single set of reference files?

Yes, and this is often the primary benefit. A merged skill that loads a shared coding standards reference once is more efficient than two separate skills each loading the same file. The reference file itself stays separate; the skill that loads it becomes one instead of two.

What is the maximum practical size for a merged SKILL.md file?

400 lines is the production bar for merged skills. Above that length, instruction adherence drops measurably and the skill becomes harder to maintain. If the merged content exceeds 400 lines, the tasks belong in separate skills that share a reference file.

Does merging skills affect the Claude A / Claude B testing process?

Yes. Merged skills are harder to test in isolation because testing Skill A's behavior requires ensuring Skill B's instructions do not interfere. Keep this in mind during the evaluation phase: a merged skill needs a test set that covers both task types, not just one.

Can I merge a user-level skill with a project-level skill?

Not cleanly. User-level and project-level skills have different installation paths and different update processes. Merging them requires choosing one location, which affects whether the merged skill is available globally or only in the current project. In most cases, keeping team-shared skills at project level and personal workflow skills at user level is cleaner than merging across the boundary.

What happens to the merged skill's word count for the discovery budget?

The combined description field length for a merged skill counts toward the approximately 15,500-character metadata budget as a single entry (Alexey Pelykh, skill-budget-research, GitHub Gist, Dec 2025). If two 400-character descriptions merge to a 600-character merged description (not simply 800), the merge saves 200 characters in addition to the 100-token metadata reduction. Description consolidation compounds the budget benefit. Each skill carries roughly 109 characters of fixed overhead in the system prompt, so the description length is only part of the cost.

Last updated: 2026-05-04