Testing Strategy

Philosophy

The test suite must be compact, fast, and trustworthy. We optimize for five signals: speed (smoke under 30s), size (fewer LOC with same coverage), quality (strong assertions, proper isolation), coverage (90%+ on runtime and skills), and reliability (zero flaky tests). More tests is not better — each test must earn its place by proving a specific architectural invariant that would catch a real regression.

Tests are organized by tier. Unit tests mirror the source tree 1:1 so you always know where to find and add tests. Integration and acceptance tests are organized by feature. Quality tooling (npm run test:quality) tracks all metrics automatically and shows trends over time.

Quick Reference

I want to...	Command
Run unit tests (fast feedback)	`npm run test:suite:smoke`
Run unit tests with coverage	`npm run test:coverage`
Run unit + integration	`npm run test:suite:regression`
Run all tiers + build + acceptance	`npm run test:suite:all`
Run one test by name	`npx vitest run --project unit -t "pattern"`
Run one project	`npx vitest run --project integration`
Run acceptance tests	`npm run test:acceptance`
Re-record LLM fixtures	`npm run test:acceptance:record`
Real LLM acceptance	`npm run test:acceptance:live`
Full validation (real LLM + slow)	`npm run test:acceptance:all`
Check quality (history recorded automatically)	`npm run test:quality`
Check timing trends	`npm run test:quality -- --tests`
Deep per-file coverage analysis	`npm run test:quality -- --coverage`

Goals

Signal	Metric	Target
Speed	Per-tier wall-clock	See speed targets below
Size	Total test LOC	Minimize — fewer tests with same coverage is better
Quality	Assertion density, weak%, isolation%	density ≥1.5, weak=0%, isolation ≥98%, source mapping ≥90%
Coverage	Lines, branches, functions by tier	src ≥90%/70%/80%, skills ≥80%/60%/70%, external ≥50%
Reliability	Flaky test count	0

Speed Targets

Tier	Target	Notes
Unit (smoke)	<30s	Already fast, maintain
Integration	<120s	Library cache pre-warming keeps per-test overhead <200ms
Acceptance	<5min	Dominated by child process startup and hub git/npm operations
Full suite	<10min	Sum of all tiers + build

Integration tests use library cache pre-warming (test/integration/helpers.ts) to avoid redundant module loading. The library's 170+ skill files are loaded once per vitest worker process and shared across all createRealRuntime() calls. This keeps per-test overhead under 200ms instead of ~1-2s.

What Gets Tested

Tests cover src/ (runtime) and skills/ (built-in skills including extensions). Nothing else — .kiro/, scripts/, docs/, examples/ are not tested.

Test Organization

Unit tests mirror the source tree

Unit tests live in test/src/ and test/skills/, mirroring the source file they test. One test file per source file.

Source	Test
`src/cascade/merge.ts`	`test/src/cascade/merge.test.ts`
`skills/pipeline/cwd.skill.ts`	`test/skills/pipeline/cwd.test.ts`
`skills/extensions/auth/auth.skill.ts`	`test/skills/extensions/auth.test.ts`

When adding or updating a unit test, place it at the mirrored path. If no test file exists, create one there. If one exists, add to it.

npm run test:quality shows the full report by default — all files, reverse map, per-tier breakdown. Use --brief for abbreviated output.

Integration and acceptance tests are organized by feature

Integration (test/integration/): Named by subsystem — bootstrap.test.ts, auth.test.ts, external-services.test.ts
Acceptance (test/acceptance/): Named by feature — init.test.ts, agent.test.ts, compile.test.ts

Directory structure

test/
├── src/                    # Unit tests mirroring src/
├── skills/                 # Unit tests mirroring skills/
├── integration/            # Integration tests (by subsystem)
├── acceptance/             # Acceptance tests (by scenario, spawns CLI)
└── fixtures/               # Shared test fixtures

File header

Every test file has a metadata header:

/**
 * @module-tag unit
 * @invariants
 * - mergeConfigs replaces by default, applies policy per $merge
 * - $-prefixed keys are never merged
 * @relationships
 * - test/src/cascade/orchestrator.test.ts tests cascade-level merge
 * @last-audit 2026-04-15
 */

Suites

Suite	What it runs	When to use
smoke	Unit tests, bail after 3	Quick feedback during development
coverage	Unit tests + V8 coverage report	When you need coverage numbers
regression	Unit + integration, bail after 3	Before committing runtime changes
everything	All tiers + build + acceptance	Final validation

Skip Flags

Env var	Effect
`AGENT_APPS_SKIP_EXTERNAL=1`	Skip tests requiring AWS credentials (Bedrock, MCP servers)
`AGENT_APPS_INCLUDE_SLOW=1`	Include slow tests (distribution, hub, compile)
`AGENT_APPS_PROXY_MODE=replay`	Replay recorded LLM responses (no Bedrock calls)
`AGENT_APPS_PROXY_MODE=record`	Record LLM responses to fixtures for future replay

Acceptance Test Modes

Acceptance tests use agent-proxy record/replay to avoid hitting the LLM on every run. The default npm run test:acceptance replays recorded fixtures — no AWS credentials needed, runs in ~3 minutes instead of ~30.

Script	Mode	Slow tests	Use case
`test:acceptance`	replay	skipped	Fast default for development
`test:acceptance:record`	record	skipped	Re-record LLM fixtures after prompt/behavior changes
`test:acceptance:live`	passthrough	skipped	Real LLM validation (requires AWS credentials)
`test:acceptance:all`	passthrough	included	Full end-to-end validation (real LLM + distribution)

How record/replay works

The agent-proxy skill (skills/extensions/agent/proxy/agent-proxy.skill.ts) intercepts the provider interface between agent.skill.ts and the real LLM provider (agent-strands). It records the LLM's decisions (tool calls, messages) while letting tool execution happen for real.

Recording: The proxy delegates to the real provider, captures all hook events and tool call args/results, writes them to a JSON fixture file.

Replay: The proxy reads the fixture, fires recorded events to the hook callback, and executes recorded tool calls against the real pipeline. Only the LLM's decisions are canned — side effects (file writes, shell commands) happen for real.

Nesting: Nested agent calls (sub-agents, recursive chains) go through the proxy too. Start-order indexing ensures each nested call gets the correct recording slot regardless of completion order.

Writing LLM-dependent acceptance tests

All tests that invoke the LLM must be acceptance tests and support record/replay:

import { useRecordings, runCli, suiteDir } from "./helpers.ts";

useRecordings("my-test-name");  // Must be at module top level, before beforeAll

let dir: string;
beforeAll(async () => { dir = await suiteDir("my-test-name"); });

Key rules:

Call useRecordings("name") at module scope — this enables proxy injection
Use suiteDir("name") for temp directories — gives stable paths for record/replay
Use runCli() or runCliRetry() for CLI invocations — proxy args are injected automatically
Use getServerProxyArgs() for long-lived server processes (MCP, web servers)
After changing prompts or skill behavior, re-record with npm run test:acceptance:record
Recordings live in test/fixtures/recordings/<name>/

Coverage Tiers

Tier	Scope	Target
Runtime	`src/*/.ts`	90% lines, 70% branches, 80% functions
Skills	`skills/*/.ts` (minus exempt)	80% lines, 60% branches, 70% functions
Exempt	Infrastructure-dependent code	25% lines, 10% branches, 10% functions

Exempt files require real infrastructure (AWS credentials, LLM access, interactive terminals, git/npm operations) and cannot be meaningfully unit tested. They are covered by integration and acceptance tests instead. The exempt list is maintained in scripts/coverage-summary.mjs.

These are the authoritative targets. The coverage-summary script and vitest.config.ts thresholds should match these. Thresholds are for visibility — they do not block commits.

Coverage output: npm run test:coverage writes V8 coverage to coverage/ (JSON, HTML, text). Read the summary with npm run test:coverage:summary.

Quality Metrics

npm run test:quality reports assertion density, weak%, isolation%, source mapping%, code metrics, per-tier breakdown, cross-subtree import hints, and test timing trends. Full detail by default — use --brief for abbreviated output. With --coverage, also runs deep per-file analysis. History is recorded automatically to test/.history/quality.jsonl — use --no-history to suppress.

Timing history is recorded to test/.history/timings.jsonl every time tests run through scripts/test-run.sh (all npm run test:* scripts except bare test and test:coverage). The quality report reads this history and shows: per-tier duration summary (unit, integration, acceptance), trend vs previous run and 5-run rolling average, tests whose runtime changed >2x vs their historical median, and the slowest tests. Outlier detection requires 3+ historical runs.

Source mapping uses two detection methods:

Import scanning — static from "..." and dynamic import("...") patterns in test files.
Location heuristic — if test/<path>.test.ts exists for <path>.ts, it counts as mapped (catches child-process tests that don't import the source directly).

Cross-subtree imports are flagged as hints, not violations. A test importing source files from the same subsystem directory (e.g., channel.test.ts importing gateway-send.skill.ts) is normal. Only cross-subsystem imports (e.g., a cascade test importing a pipeline skill) are flagged.

Test File Headers

Every test file — all tiers — must have a metadata header. See the tests-audit skill for the full vocabulary.

/**
 * @module-tag unit | integration | acceptance
 * @invariants
 * - <architectural guarantee>
 * @relationships
 * - <related test file>
 * @last-audit YYYY-MM-DD
 */

@last-audit enables incremental audits — only files with stale dates are re-reviewed.

Writing Tests

See the tests-update skill.

Test Auditing

See the tests-audit skill.

Tag	Meaning	Timeout
`unit`	Pure logic, mocked deps	10s
`integration`	Real bootstrap + disk I/O	30s
`acceptance`	CLI acceptance tests against dist/	120s
`external`	Requires AWS credentials	60s
`slow`	>30s per test (hub, compile)	180s
`network`	Uses network ports	30s

Test Context

Normal skills: mockContext() — ctx.target === ctx
Middleware skills: mockMiddlewareContext() — ctx.target is the served context
Integration: Real Runtime with bootstrap and disk I/O
Acceptance: runCli() — spawns dist/src/cli.js

npm Scripts

Script	What it runs
`test:suite:smoke`	Unit tests, bail after 3
`test:suite:regression`	Unit + integration, bail after 3
`test:suite:all`	All tiers + build + acceptance
`test:unit`	Unit tests only
`test:coverage`	Unit tests with coverage report
`test:coverage:summary`	Coverage by tier from last run
`test:integration`	Integration only
`test:acceptance`	Acceptance — replay mode, skip slow (fast default)
`test:acceptance:record`	Acceptance — record LLM fixtures
`test:acceptance:live`	Acceptance — real LLM, skip slow
`test:acceptance:all`	Acceptance — real LLM + slow tests (full validation)
`test:quality`	Quality report (tests + code)
`test:mutation`	Stryker mutation testing
`check`	tsc + biome + knip + eslint + madge
`check:all`	check + clone detection + npm audit