Testing Strategy
Philosophy
The test suite must be compact, fast, and trustworthy. We optimize for five signals: speed (smoke under 30s), size (fewer LOC with same coverage), quality (strong assertions, proper isolation), coverage (90%+ on runtime and skills), and reliability (zero flaky tests). More tests is not better — each test must earn its place by proving a specific architectural invariant that would catch a real regression.
Tests are organized by tier. Unit tests mirror the source tree 1:1 so you always know where to find and add tests. Integration and acceptance tests are organized by feature. Quality tooling (npm run test:quality) tracks all metrics automatically and shows trends over time.
Quick Reference
| I want to... | Command |
|---|---|
| Run unit tests (fast feedback) | npm run test:suite:smoke |
| Run unit tests with coverage | npm run test:coverage |
| Run unit + integration | npm run test:suite:regression |
| Run all tiers + build + acceptance | npm run test:suite:all |
| Run one test by name | npx vitest run --project unit -t "pattern" |
| Run one project | npx vitest run --project integration |
| Run acceptance tests | npm run test:acceptance |
| Re-record LLM fixtures | npm run test:acceptance:record |
| Real LLM acceptance | npm run test:acceptance:live |
| Full validation (real LLM + slow) | npm run test:acceptance:all |
| Check quality (history recorded automatically) | npm run test:quality |
| Check timing trends | npm run test:quality -- --tests |
| Deep per-file coverage analysis | npm run test:quality -- --coverage |
Goals
| Signal | Metric | Target |
|---|---|---|
| Speed | Per-tier wall-clock | See speed targets below |
| Size | Total test LOC | Minimize — fewer tests with same coverage is better |
| Quality | Assertion density, weak%, isolation% | density ≥1.5, weak=0%, isolation ≥98%, source mapping ≥90% |
| Coverage | Lines, branches, functions by tier | src ≥90%/70%/80%, skills ≥80%/60%/70%, external ≥50% |
| Reliability | Flaky test count | 0 |
Speed Targets
| Tier | Target | Notes |
|---|---|---|
| Unit (smoke) | <30s | Already fast, maintain |
| Integration | <120s | Library cache pre-warming keeps per-test overhead <200ms |
| Acceptance | <5min | Dominated by child process startup and hub git/npm operations |
| Full suite | <10min | Sum of all tiers + build |
Integration tests use library cache pre-warming (test/integration/helpers.ts) to avoid redundant module loading. The library's 170+ skill files are loaded once per vitest worker process and shared across all createRealRuntime() calls. This keeps per-test overhead under 200ms instead of ~1-2s.
What Gets Tested
Tests cover src/ (runtime) and skills/ (built-in skills including extensions). Nothing else — .kiro/, scripts/, docs/, examples/ are not tested.
Test Organization
Unit tests mirror the source tree
Unit tests live in test/src/ and test/skills/, mirroring the source file they test. One test file per source file.
| Source | Test |
|---|---|
src/cascade/merge.ts |
test/src/cascade/merge.test.ts |
skills/pipeline/cwd.skill.ts |
test/skills/pipeline/cwd.test.ts |
skills/extensions/auth/auth.skill.ts |
test/skills/extensions/auth.test.ts |
When adding or updating a unit test, place it at the mirrored path. If no test file exists, create one there. If one exists, add to it.
npm run test:quality shows the full report by default — all files, reverse map, per-tier breakdown. Use --brief for abbreviated output.
Integration and acceptance tests are organized by feature
- Integration (
test/integration/): Named by subsystem —bootstrap.test.ts,auth.test.ts,external-services.test.ts - Acceptance (
test/acceptance/): Named by feature —init.test.ts,agent.test.ts,compile.test.ts
Directory structure
test/
├── src/ # Unit tests mirroring src/
├── skills/ # Unit tests mirroring skills/
├── integration/ # Integration tests (by subsystem)
├── acceptance/ # Acceptance tests (by scenario, spawns CLI)
└── fixtures/ # Shared test fixtures
File header
Every test file has a metadata header:
/**
* @module-tag unit
* @invariants
* - mergeConfigs replaces by default, applies policy per $merge
* - $-prefixed keys are never merged
* @relationships
* - test/src/cascade/orchestrator.test.ts tests cascade-level merge
* @last-audit 2026-04-15
*/
Suites
| Suite | What it runs | When to use |
|---|---|---|
| smoke | Unit tests, bail after 3 | Quick feedback during development |
| coverage | Unit tests + V8 coverage report | When you need coverage numbers |
| regression | Unit + integration, bail after 3 | Before committing runtime changes |
| everything | All tiers + build + acceptance | Final validation |
Skip Flags
| Env var | Effect |
|---|---|
AGENT_APPS_SKIP_EXTERNAL=1 |
Skip tests requiring AWS credentials (Bedrock, MCP servers) |
AGENT_APPS_INCLUDE_SLOW=1 |
Include slow tests (distribution, hub, compile) |
AGENT_APPS_PROXY_MODE=replay |
Replay recorded LLM responses (no Bedrock calls) |
AGENT_APPS_PROXY_MODE=record |
Record LLM responses to fixtures for future replay |
Acceptance Test Modes
Acceptance tests use agent-proxy record/replay to avoid hitting the LLM on every run. The default npm run test:acceptance replays recorded fixtures — no AWS credentials needed, runs in ~3 minutes instead of ~30.
| Script | Mode | Slow tests | Use case |
|---|---|---|---|
test:acceptance |
replay | skipped | Fast default for development |
test:acceptance:record |
record | skipped | Re-record LLM fixtures after prompt/behavior changes |
test:acceptance:live |
passthrough | skipped | Real LLM validation (requires AWS credentials) |
test:acceptance:all |
passthrough | included | Full end-to-end validation (real LLM + distribution) |
How record/replay works
The agent-proxy skill (skills/extensions/agent/proxy/agent-proxy.skill.ts) intercepts the provider interface between agent.skill.ts and the real LLM provider (agent-strands). It records the LLM's decisions (tool calls, messages) while letting tool execution happen for real.
Recording: The proxy delegates to the real provider, captures all hook events and tool call args/results, writes them to a JSON fixture file.
Replay: The proxy reads the fixture, fires recorded events to the hook callback, and executes recorded tool calls against the real pipeline. Only the LLM's decisions are canned — side effects (file writes, shell commands) happen for real.
Nesting: Nested agent calls (sub-agents, recursive chains) go through the proxy too. Start-order indexing ensures each nested call gets the correct recording slot regardless of completion order.
Writing LLM-dependent acceptance tests
All tests that invoke the LLM must be acceptance tests and support record/replay:
import { useRecordings, runCli, suiteDir } from "./helpers.ts";
useRecordings("my-test-name"); // Must be at module top level, before beforeAll
let dir: string;
beforeAll(async () => { dir = await suiteDir("my-test-name"); });
Key rules:
- Call
useRecordings("name")at module scope — this enables proxy injection - Use
suiteDir("name")for temp directories — gives stable paths for record/replay - Use
runCli()orrunCliRetry()for CLI invocations — proxy args are injected automatically - Use
getServerProxyArgs()for long-lived server processes (MCP, web servers) - After changing prompts or skill behavior, re-record with
npm run test:acceptance:record - Recordings live in
test/fixtures/recordings/<name>/
Coverage Tiers
| Tier | Scope | Target |
|---|---|---|
| Runtime | src/**/*.ts |
90% lines, 70% branches, 80% functions |
| Skills | skills/**/*.ts (minus exempt) |
80% lines, 60% branches, 70% functions |
| Exempt | Infrastructure-dependent code | 25% lines, 10% branches, 10% functions |
Exempt files require real infrastructure (AWS credentials, LLM access, interactive terminals, git/npm operations) and cannot be meaningfully unit tested. They are covered by integration and acceptance tests instead. The exempt list is maintained in scripts/coverage-summary.mjs.
These are the authoritative targets. The coverage-summary script and vitest.config.ts thresholds should match these. Thresholds are for visibility — they do not block commits.
Coverage output: npm run test:coverage writes V8 coverage to coverage/ (JSON, HTML, text). Read the summary with npm run test:coverage:summary.
Quality Metrics
npm run test:quality reports assertion density, weak%, isolation%, source mapping%, code metrics, per-tier breakdown, cross-subtree import hints, and test timing trends. Full detail by default — use --brief for abbreviated output. With --coverage, also runs deep per-file analysis. History is recorded automatically to test/.history/quality.jsonl — use --no-history to suppress.
Timing history is recorded to test/.history/timings.jsonl every time tests run through scripts/test-run.sh (all npm run test:* scripts except bare test and test:coverage). The quality report reads this history and shows: per-tier duration summary (unit, integration, acceptance), trend vs previous run and 5-run rolling average, tests whose runtime changed >2x vs their historical median, and the slowest tests. Outlier detection requires 3+ historical runs.
Source mapping uses two detection methods:
- Import scanning — static
from "..."and dynamicimport("...")patterns in test files. - Location heuristic — if
test/<path>.test.tsexists for<path>.ts, it counts as mapped (catches child-process tests that don't import the source directly).
Cross-subtree imports are flagged as hints, not violations. A test importing source files from the same subsystem directory (e.g., channel.test.ts importing gateway-send.skill.ts) is normal. Only cross-subsystem imports (e.g., a cascade test importing a pipeline skill) are flagged.
Test File Headers
Every test file — all tiers — must have a metadata header. See the tests-audit skill for the full vocabulary.
/**
* @module-tag unit | integration | acceptance
* @invariants
* - <architectural guarantee>
* @relationships
* - <related test file>
* @last-audit YYYY-MM-DD
*/
@last-audit enables incremental audits — only files with stale dates are re-reviewed.
Writing Tests
See the tests-update skill.
Test Auditing
See the tests-audit skill.
Tags
| Tag | Meaning | Timeout |
|---|---|---|
unit |
Pure logic, mocked deps | 10s |
integration |
Real bootstrap + disk I/O | 30s |
acceptance |
CLI acceptance tests against dist/ | 120s |
external |
Requires AWS credentials | 60s |
slow |
>30s per test (hub, compile) | 180s |
network |
Uses network ports | 30s |
Test Context
- Normal skills:
mockContext()—ctx.target === ctx - Middleware skills:
mockMiddlewareContext()—ctx.targetis the served context - Integration: Real
Runtimewith bootstrap and disk I/O - Acceptance:
runCli()— spawnsdist/src/cli.js
npm Scripts
| Script | What it runs |
|---|---|
test:suite:smoke |
Unit tests, bail after 3 |
test:suite:regression |
Unit + integration, bail after 3 |
test:suite:all |
All tiers + build + acceptance |
test:unit |
Unit tests only |
test:coverage |
Unit tests with coverage report |
test:coverage:summary |
Coverage by tier from last run |
test:integration |
Integration only |
test:acceptance |
Acceptance — replay mode, skip slow (fast default) |
test:acceptance:record |
Acceptance — record LLM fixtures |
test:acceptance:live |
Acceptance — real LLM, skip slow |
test:acceptance:all |
Acceptance — real LLM + slow tests (full validation) |
test:quality |
Quality report (tests + code) |
test:mutation |
Stryker mutation testing |
check |
tsc + biome + knip + eslint + madge |
check:all |
check + clone detection + npm audit |