Skill Design Best Practices

The art of minimal instruction

Claude already knows how to program. It understands testing, deployment, documentation, design patterns, the whole catalog. Skills should provide what Claude cannot infer: your team's specific conventions, your codebase's quirks, the procedures that exist only in tribal knowledge.

Challenge every line of a skill with one question: does Claude actually need this?

Bad skills explain general concepts:

---
name: api-endpoint
description: Creates API endpoints
---

## Background

REST APIs use HTTP methods to perform operations.
GET retrieves data, POST creates resources, PUT updates them...

Claude knows all of this. Every token spent explaining REST wastes context that could hold something Claude actually needs.

Good skills skip to the specific:

---
name: api-endpoint
description: Creates API endpoints following project conventions
---

## Conventions

- Controllers go in `src/controllers/` with `.controller.ts` suffix
- All endpoints require the `@Authenticated` decorator unless in ALLOWED_PUBLIC_ROUTES
- Error responses use `ApiError.from()`, never raw exceptions
- Integration tests in `__tests__/integration/` with `.test.ts` suffix

These are things Claude cannot figure out by reading code: where decorators go, how errors get wrapped, where tests live. This is the kind of instruction that actually changes output.

The 500-line guideline

Keep SKILL.md under 500 lines. This is not an arbitrary style preference. When skills exceed this length, two things go wrong.

First, context economics gets ugly. A 2,000-line skill consumes roughly 8,000-10,000 tokens when it activates. For simple tasks matching the skill's description, most of that content sits unused while displacing conversation history or other relevant context.

Second, coherence breaks down. Long skills pile up instructions that Claude cannot hold together across a full reasoning chain. Line 1,500 might contradict line 200. The model's attention thins out as content expands, and instructions start competing instead of reinforcing.

When a skill starts creeping toward 500 lines, refactor. Move reference material to supporting files. Keep SKILL.md as the high-level procedure and link out to the details.

Freedom levels

Not all tasks need the same level of prescription. Match instruction specificity to how much the task can tolerate variation.

High freedom: guidelines, not scripts

Some tasks have many valid approaches. Code review. Refactoring suggestions. Documentation improvements. For these, provide principles and let Claude navigate:

---
name: code-review
description: Reviews code for quality, security, and maintainability
---

## Review priorities

1. Security vulnerabilities (injection, auth bypass, data exposure)
2. Logic errors and edge cases
3. Performance concerns for hot paths
4. Readability and maintainability
5. Test coverage gaps

## Style

Flag issues with severity (critical/major/minor).
Explain why something is problematic, not just that it is.
Suggest fixes but do not rewrite entire files.

High freedom instructions describe what to evaluate and how to communicate findings. The exact steps vary by context, so prescribing them would just get in the way.

Medium freedom: patterns with parameters

Some tasks have a preferred approach but tolerate variation. Creating test files. Adding new components. Updating configuration. For these, provide the pattern and let Claude adapt it:

---
name: component-create
description: Creates React components following project architecture
---

## Structure

New components follow this pattern:

```text
src/components/[ComponentName]/
├── [ComponentName].tsx       # Component implementation
├── [ComponentName].test.tsx  # Unit tests
├── [ComponentName].css       # Styles (CSS modules)
└── index.ts                  # Public exports

Implementation

Use functional components with hooks
Props interface named [ComponentName]Props
Default export the component, named export the props interface
Include at least one happy-path test

Medium freedom instructions establish structure without dictating every line. Claude decides implementation details within the pattern.

Low freedom: exact procedures

Some tasks break if you look at them wrong. Database migrations. Deployment procedures. Configuration changes with no margin for interpretation. For these, prescribe exact steps:

---
name: db-migrate
description: Creates and applies database migrations
---

## Creating a migration

Run exactly:

```bash
npm run migrate:create -- --name $0

This generates a timestamped file in migrations/.

Migration file requirements

Every migration MUST include:

up() method for applying the change
down() method for reverting it
Both methods must be idempotent

Do not modify existing migrations. Create new ones to adjust schema.

Applying migrations

Run exactly:

npm run migrate:up --verify

The --verify flag is mandatory. It validates migration checksums before applying.

Never use --force or --skip-verify without explicit user approval.


Low freedom instructions leave no room for interpretation.
Claude follows the procedure exactly because deviation causes failures that are annoying to debug and expensive to fix.

<Callout type="info">
High freedom means many routes reach the destination.
Low freedom means there's one path and falling off hurts.
Match your instructions to the terrain.
</Callout>

## Writing effective descriptions

The `description` field punches above its weight.
It determines when Claude activates the skill and whether users find it when browsing capabilities.

### Third person, present tense

Write descriptions as if documenting a tool for a catalog:

```yaml
# Good
description: Analyzes git diffs and generates conventional commit messages

# Bad
description: I can help you write commit messages

# Bad
description: Use this when you need to commit code

Third person keeps things consistent across a skill library. When Claude evaluates which skill fits a task, uniform style helps matching work better.

Include activation triggers

Descriptions should mention when the skill applies, not just what it does:

# Good
description: Creates API endpoints following project conventions. Use when adding new routes or when asked about endpoint patterns.

# Bad
description: Creates API endpoints following project conventions.

The activation trigger helps Claude decide when to load the skill unprompted. "Use when adding new routes" tells Claude: if someone asks about routes, this skill probably applies.

Be specific about scope

Vague descriptions match everything, which means they match nothing well:

# Too broad
description: Helps with code

# Too narrow
description: Creates Express.js middleware for JWT validation in the auth service

# Right scope
description: Creates middleware functions following project patterns. Use when adding request processing, authentication, logging, or error handling middleware.

The right scope captures a category of tasks without matching everything or only one narrow scenario.

Controlling invocation

Some skills should never run on their own initiative. Deployment, commits, destructive operations: these need explicit user intent.

The disable-model-invocation flag

---
name: deploy-production
description: Deploys application to production environment
disable-model-invocation: true
---

With this flag, Claude never activates the skill on its own. Users must type /deploy-production explicitly. This prevents the unpleasant surprise of asking "how would I deploy this?" and watching Claude actually do it.

Use disable-model-invocation: true for:

Deployments and releases
Database modifications
Commits and pushes
Any operation with side effects that require explicit intent

The user-invocable flag

---
name: internal-validation
description: Validates internal data structures for consistency
user-invocable: false
---

Setting user-invocable: false hides the skill from the command menu. The skill exists only for other skills to call programmatically. Use this for helper skills that form building blocks of larger workflows but make no sense to invoke directly.

Evaluation-driven development

Skills should not be guessed into existence. Writing instructions based on intuition and hoping they work is how skill libraries fill up with junk. The alternative: systematic evaluation against real tasks.

The development cycle

Identify the gap: Run Claude on representative tasks without the skill. Watch where it fails or produces inconsistent results. Those failures define what the skill must fix.
Create test cases: Build 3-5 scenarios that exercise the skill's intended behavior. Each test case includes an input, expected behavior criteria, and context files if needed.
Establish baseline: Record how Claude performs on test cases before the skill exists. This gives you a comparison point.
Write minimal instructions: Create just enough to address documented failures. Resist the urge to add instructions for hypothetical scenarios that have not actually happened yet.
Evaluate: Run test cases with the skill active. Compare against baseline. Did the skill improve results? Did it introduce regressions?
Iterate: Refine instructions based on evaluation results. Add specificity where Claude still fails. Remove content where Claude succeeds without guidance.

Test case structure

A practical test case looks like this:

{
  "name": "create-crud-endpoint",
  "skill": "api-endpoint",
  "input": "Create a CRUD endpoint for user preferences",
  "context_files": ["src/controllers/users.controller.ts"],
  "expected_behavior": [
    "Creates file at src/controllers/preferences.controller.ts",
    "Includes @Authenticated decorator on all methods",
    "Uses ApiError.from() for error responses",
    "Creates integration test file"
  ]
}

Run the test, observe results, check each expected behavior. Failures point to skill gaps. Passes confirm the skill is doing its job.

Evaluate outcomes, not steps

Do not test whether Claude followed a specific path. Test whether Claude produced correct output.

# Wrong approach - testing steps
expected:
  - Called the Bash tool with "npm run migrate:create"
  - Read the generated file
  - Edited the up() method first

# Right approach - testing outcomes
expected:
  - Migration file exists with correct timestamp
  - up() method creates the expected table
  - down() method drops the table
  - Migration applies successfully

Agents take different paths to correct outcomes. Testing paths creates brittle evaluations that break when Claude gets creative. Testing outcomes validates what you actually care about.

This takes discipline. The temptation is to write skills based on intuition and move on. Teams that build evaluation into their workflow produce skills that actually work. Teams that skip evaluation end up with skill libraries full of untested assumptions and wonder why the agent keeps getting things wrong.

Common skill design mistakes

Over-explaining fundamentals

Claude knows programming. Instructions like "functions should have descriptive names" or "use meaningful variable names" waste tokens on things Claude already does by default. Focus on what makes your project different.

Conflicting requirements

Long skills accumulate instructions that end up contradicting each other. "Always use async/await" in one section and "use callbacks for legacy compatibility" in another will confuse Claude. Before publishing a skill, read through it looking for internal contradictions.

Mixing concerns

A skill named code-review that also handles formatting, deployment suggestions, and documentation updates is three skills crammed into one. Each skill should do one thing. Compose skills for complex workflows.

Untested descriptions

Descriptions that sound good may not trigger correctly in practice. "Helps with development" matches everything. "Creates TypeScript interfaces" matches nothing when someone asks to "add types." Test descriptions by watching when Claude actually activates the skill across different phrasings of similar requests.

Nested reference chains

SKILL.md should link to supporting files. Supporting files should not link to other supporting files. Deep reference chains create loading sequences that degrade coherence. Keep the reference hierarchy flat: one level from SKILL.md to detail files, no deeper.

Iteration and maintenance

Skills are code. They deserve the same treatment as production code: version control, review, testing, maintenance.

Review skills quarterly. Delete skills nobody uses. Update skills when procedures change. Refactor skills that have grown unwieldy.

The goal is a lean library where every skill pulls its weight: frequently used, reliably effective, maintained alongside the codebase it supports.

On this page