Skills Are Engineered Systems, Not Prompts

How I designed a Claude Code skill that scaffolds production-grade Bruno API test collections — and why the design process matters more than the artifact.

AIClaude CodeEngineeringDXTestingMay 02, 2026·11 min read

Skills Are Engineered Systems, Not Prompts

A Claude Code skill is not a detailed prompt. It is a program. It has scope boundaries, a validation surface, failure modes, and rules that either hold under real conditions or don't. The question is not "what does it do" — it's "how do you know it works." This article covers the design of a skill that scaffolds production-grade Bruno API test collections, and the engineering process that took it from 0% to ship-ready across three external passes.

Key takeaways

A skill is a program: scope boundaries, validation surface, and failure modes — not a longer prompt
Internal review cannot close the gap between "looks correct" and "actually works" — only execution against real conditions can
Author/reviewer context isolation is the load-bearing structural choice; a reviewer who knows the rationale is primed to defend the rule

The Artifact

This article covers the design of a skill that generates complete Bruno API test collections for any REST backend: directory structure, environments, request files with declarative assertions and Chai test scripts, a CI workflow, a pre-commit hook, and a maintenance rule injected into the project's CLAUDE.md. Final size: 2,790 lines across one SKILL.md and six reference files. Three external iterations against a real backend. Pass rate went from 0% to 60%, with the remaining gap entirely attributable to project-level test data state, not skill output.

That outcome required an engineering process. The process is what this article is about.

Why Bruno

Before designing anything, the tool had to be chosen. A comparative analysis against Postman produced a recommendation for engineering leadership. The conclusion was Bruno, but the reasoning that carried into the skill design was specific: Bruno is filesystem-first. Collections are plain YAML files committed alongside code, versioned with Git, and readable by any standard tool.

That property is what makes AI-assisted authoring cheap. A skill that generates files into a project directory is trivially integrated with Claude Code. A skill that talks to a cloud API through an MCP server — authenticating, handling rate limits, modifying hosted artifacts — is not. The filesystem-first architecture of Bruno was the prerequisite for the skill's architecture to be simple.

A second decision followed from the same logic. Bruno 3.0.0 supports two collection formats: the legacy .bru markup and the new OpenCollection YAML. OpenCollection YAML was the explicit choice: it's the recommended format going forward, it's standard-compliant, and any standard YAML linter can validate it without Bruno-specific tooling. That last property matters for the pre-commit hook and the CI pipeline the skill also generates.

Both decisions constrained the rest of the design in productive ways.

Designing the Skill

Scope as a constraint, not a shortcut

The skill generates new collections. It does not update existing ones, migrate Postman collections, or handle GraphQL. Those exclusions were deliberate.

The update case was the most important one to cut. I wanted ongoing maintenance to happen during regular Claude Code sessions — an engineer changes an endpoint, Claude Code sees the API has changed and updates the collection as part of the same session. That behavior belongs in a CLAUDE.md maintenance rule injected by the skill, not in a separate update mode. The skill's job is to establish the collection and the contract; the contract's job is to keep the collection current. Conflating the two would have made both worse.

Pre-flight checks enforce the scope: if /bruno/opencollection.yml already exists, the skill stops and says so. Hard stop, not a prompt to confirm. A skill that can accidentally overwrite a maintained collection is not a safe skill.

Progressive disclosure over front-loading

The SKILL.md is the entry point. At first-draft stage it was roughly 270 lines. Detail lives in six reference files: the OpenCollection YAML format spec, the scripting API, test patterns, CI templates, pre-commit hook templates, and the README.md scaffold. The skill reads them on demand — the scripting API reference is only loaded when auth is non-trivial; the CI templates only in the final generation step.

This is not a context-size optimization. It is a clarity choice. Front-loading everything into a single file produces a document that no one navigates well, including the model. Deferring detail to reference files keeps the main workflow readable and the references authoritative.

Opinionation as a forcing function

Early drafts contained "prefer A, but B is acceptable when..." constructs throughout. The approach was collapsed to single prescribed behaviors on every decision where an alternative existed.

The reason is reviewability. When a skill prescribes one approach, a reviewer can check "did the generated output follow the rule" as a binary. When it prescribes two, the reviewer has to read intent. Intent doesn't surface in a generated YAML file. Binary compliance does.

Opinionation also clarifies the rules themselves. "Prefer before-request scripts for missing-variable guards" is ambiguous. "Every chained request must have a before-request script that throws with the producer's actual file path in the error message" is a rule a reviewer can check in thirty seconds.

Rules with attached failure modes

The skill's critical rules section grew from roughly 10 rules to 25 across iterations. Every rule in the final version carries the failure mode it prevents. "Use OpenCollection short-form assertion operators (eq, neq, lt...) — never equals, lessThan, notEquals" is paired with "those silently fall back to string comparison and every assertion fails at runtime." The rule and its consequence are one entry.

Rules without consequences age poorly. They read as style preference and get deprioritized under time pressure. Rules with consequences read as constraints, because they are.

The Validation Loop

What internal review can and cannot catch

After the initial design phase, three internal iterations ran: me reviewing the skill's text against my own understanding of the format and the expected outputs. These caught consistency issues, missing documentation, advisory language that needed to become prescriptive. They were necessary.

They were insufficient.

The Bruno CLI 3.2.x auth: block bug is a good example. Bruno's OpenCollection format includes a typed authentication block — auth: { type: bearer, token: "{{authToken}}" } — which the spec documents as the standard way to attach a bearer token to requests. The skill's initial design used it. In practice, the CLI silently ignores it: it logs a single stderr warning (toBrunoAuth failed: Unsupported auth type) and sends no Authorization header. Every authenticated request in the generated collection returned 401, and the warning was easy to miss in CI output. I had read the spec. I had written the rule. The rule was wrong. You cannot discover that a correctly-authored rule produces incorrect behavior by re-reading the rule.

Neither did the folder ordering problem. Bruno's CLI runs request folders in lexicographic order, which means the collection's execution sequence is determined by folder names. The initial design used zzz-cleanup/ for teardown — last alphabetically, therefore last to run — which looked correct. It breaks as soon as any other folder is added whose name sorts after zzz, or when a new domain folder needs to slot between two existing ones. The dependency ordering between folders was structurally unreliable. The fix — numeric prefixes (00-auth/, 10-reference-data/, 20-{domain}/, 90-cleanup/) encoding explicit sequence — is obvious in hindsight and impossible to discover by reading a document.

Both bugs required running generated code against a real backend.

Author, user, reviewer — three roles, not one

The validation loop ran with three distinct roles:

Author instance — the Claude Code conversation that held the skill's source files and edited them across all iterations. Single conversation, retained design rationale and rule history.

User — me. Ran the skill against the target project, routed reports between conversations, made scope calls, pushed back on AI suggestions when they were wrong. The author/reviewer separation only works if the user controls what moves between them.

Reviewer instance — a separate Claude Code conversation, run against the generated /bruno/ collection output rather than the skill's source. Reused across all three external passes so it accumulated context from prior reviews and could produce status-delta tables, but structurally isolated from the author instance.

The author/reviewer separation is the load-bearing structural choice. A reviewer who knows why a rule was written is primed to justify it. A reviewer who sees only the output evaluates it on its merits — and has no incentive to protect any prior decision.

The reviewer for pass 2 isolated the auth: block bug by building a synthetic request against httpbin.org/headers — an endpoint that echoes back every header it receives — to confirm the Authorization header was absent from the request even when authentication was configured. That diagnostic work happened because the reviewer had no attachment to the original rule and no access to my rationale for writing it. Context isolation produced an honest investigation.

Three passes against a real messy backend

The target project was a Laravel codebase with Passport OAuth, multi-tenant auth, third-party CRM integrations, destructive registration endpoints, seeded reference data, and rate-limited API middleware.

Pass	Requests passing	Main findings
1	0 / 84	Operator-name bug broke every assertion. Plus 12 other issues.
2	19 / 48 (40%)	Two design reversals required: `auth:` block is a no-op; alphabetical folder ordering is structurally unreliable.
3	26 / 43 (60%)	Six polish items. Reviewer verdict: "ship-ready."

The remaining 40% gap at pass 3 was entirely project-level: the seeded test user lacked role, company, and permissions that authenticated requests required. No further iteration on the skill would have closed it. Recognising that inflection point was a scope call, not a performance failure.

A real backend would surface the operator-name bug. It would not have surfaced the auth: block no-op, the 429 mid-suite rate limiting from the API's throttle middleware, or the empty-seed silent-capture failure — where a reference-data request was designed to capture an ID from a database record and store it for downstream requests, but the database seed was empty, and Bruno substituted an empty string for the variable rather than failing loudly.

Severity triage and status-delta tables

Every review used four severity tiers: Critical, High, Medium, Low. From pass 2 onwards, every review opened with a status-delta table showing what happened to each finding from the prior review.

Pass 2's two Criticals — the auth: block reversal and the folder-ordering reversal — meant those two reversals happened before anything else was touched. Pass 3's six polish items had predictable scope. Iteration budget was legible going into each pass rather than discovered during it.

The delta table also made regression visible. If a pass-2 fix had been reintroduced in pass 3, it would have appeared as "re-introduced" in the first paragraph, not buried in narrative. None were.

Skill-level versus project-level in every review

Every review explicitly categorised findings into four buckets: skill-level must-fix, skill-level nice-to-have, project-specific, and out-of-scope but worth raising upstream.

The auth: block bug is a CLI issue in Bruno 3.2.x — it belongs upstream, filed against the Bruno project. The workaround (setting the Authorization header manually via a folder-level headers: entry instead of relying on the typed auth: block) belongs in the skill. Both belong, but in different places. Without explicit separation, iteration drifts toward patching every observed symptom rather than fixing the structural problem, and the skill accumulates rules that only work for one specific codebase.

The pass-3 reviewer's "ship-ready" verdict summarised the collection as correct (assertion operators use the right syntax, auth flows through folder-level headers rather than the broken typed block, folder execution order is deterministic), loud (every chained request has a guard script that throws a clear error with the upstream file's actual path if a required variable is missing), self-documenting (every folder and request carries a docs: block explaining purpose, parameters, and dependencies — the collection doubles as browsable API documentation), and coherently tagged (CI excludes write operations, endpoints requiring out-of-band input, and teardown by default; a contract tag distinguishes liveness checks on empty lists from real smoke tests whose data exists in CI).

None of those four properties existed in the initial skill state. All four emerged from the iteration loop.

The Distance Between Looking Correct and Actually Working

The output of the design phase was a skill that looked complete and well-structured. The first external pass revealed it was broken on every single assertion.

That distance — between "plausible" and "correct" — is the thing that internal review cannot close. It requires execution against real conditions with a reviewer who has no stake in the prior design decisions.

The method that closed it: filesystem-first artifact that could be run and measured, author/reviewer context isolation, execution against a real messy backend, severity triage with explicit status deltas, skill-level versus project-level separation in every review.

The method is reusable. The Bruno skill is just the first project it ran on.

Building a Kiosk on a Multi-Tenant Platform

May 01, 2026 · 9 min read