Testing Strategy

Philosophy

The test suite must be compact, fast, and trustworthy. We optimize for five signals: speed (smoke under 30s), size (fewer LOC with same coverage), quality (strong assertions, proper isolation), coverage (90%+ on runtime and skills), and reliability (zero flaky tests). More tests is not better — each test must earn its place by proving a specific architectural invariant that would catch a real regression.

Tests are organized by tier. Unit tests mirror the source tree 1:1 so you always know where to find and add tests. Integration and acceptance tests are organized by feature. Quality tooling (npm run test:quality) tracks all metrics automatically and shows trends over time.

Quick Reference

I want to... Command
Run unit tests (fast feedback) npm run test:suite:smoke
Run unit tests with coverage npm run test:coverage
Run unit + integration npm run test:suite:regression
Run all tiers + build + acceptance npm run test:suite:all
Run one test by name npx vitest run --project unit -t "pattern"
Run one project npx vitest run --project integration
Run acceptance tests npm run test:acceptance
Re-record LLM fixtures npm run test:acceptance:record
Real LLM acceptance npm run test:acceptance:live
Full validation (real LLM + slow) npm run test:acceptance:all
Check quality (history recorded automatically) npm run test:quality
Check timing trends npm run test:quality -- --tests
Deep per-file coverage analysis npm run test:quality -- --coverage

Goals

Signal Metric Target
Speed Per-tier wall-clock See speed targets below
Size Total test LOC Minimize — fewer tests with same coverage is better
Quality Assertion density, weak%, isolation% density ≥1.5, weak=0%, isolation ≥98%, source mapping ≥90%
Coverage Lines, branches, functions by tier src ≥90%/70%/80%, skills ≥80%/60%/70%, external ≥50%
Reliability Flaky test count 0

Speed Targets

Tier Target Notes
Unit (smoke) <30s Already fast, maintain
Integration <120s Library cache pre-warming keeps per-test overhead <200ms
Acceptance <5min Dominated by child process startup and hub git/npm operations
Full suite <10min Sum of all tiers + build

Integration tests use library cache pre-warming (test/integration/helpers.ts) to avoid redundant module loading. The library's 170+ skill files are loaded once per vitest worker process and shared across all createRealRuntime() calls. This keeps per-test overhead under 200ms instead of ~1-2s.

What Gets Tested

Tests cover src/ (runtime) and skills/ (built-in skills including extensions). Nothing else — .kiro/, scripts/, docs/, examples/ are not tested.

Test Organization

Unit tests mirror the source tree

Unit tests live in test/src/ and test/skills/, mirroring the source file they test. One test file per source file.

Source Test
src/cascade/merge.ts test/src/cascade/merge.test.ts
skills/pipeline/cwd.skill.ts test/skills/pipeline/cwd.test.ts
skills/extensions/auth/auth.skill.ts test/skills/extensions/auth.test.ts

When adding or updating a unit test, place it at the mirrored path. If no test file exists, create one there. If one exists, add to it.

npm run test:quality shows the full report by default — all files, reverse map, per-tier breakdown. Use --brief for abbreviated output.

Integration and acceptance tests are organized by feature

  • Integration (test/integration/): Named by subsystem — bootstrap.test.ts, auth.test.ts, external-services.test.ts
  • Acceptance (test/acceptance/): Named by feature — init.test.ts, agent.test.ts, compile.test.ts

Directory structure

test/
├── src/                    # Unit tests mirroring src/
├── skills/                 # Unit tests mirroring skills/
├── integration/            # Integration tests (by subsystem)
├── acceptance/             # Acceptance tests (by scenario, spawns CLI)
└── fixtures/               # Shared test fixtures

File header

Every test file has a metadata header:

/**
 * @module-tag unit
 * @invariants
 * - mergeConfigs replaces by default, applies policy per $merge
 * - $-prefixed keys are never merged
 * @relationships
 * - test/src/cascade/orchestrator.test.ts tests cascade-level merge
 * @last-audit 2026-04-15
 */

Suites

Suite What it runs When to use
smoke Unit tests, bail after 3 Quick feedback during development
coverage Unit tests + V8 coverage report When you need coverage numbers
regression Unit + integration, bail after 3 Before committing runtime changes
everything All tiers + build + acceptance Final validation

Skip Flags

Env var Effect
AGENT_APPS_SKIP_EXTERNAL=1 Skip tests requiring AWS credentials (Bedrock, MCP servers)
AGENT_APPS_INCLUDE_SLOW=1 Include slow tests (distribution, hub, compile)
AGENT_APPS_PROXY_MODE=replay Replay recorded LLM responses (no Bedrock calls)
AGENT_APPS_PROXY_MODE=record Record LLM responses to fixtures for future replay

Acceptance Test Modes

Acceptance tests use agent-proxy record/replay to avoid hitting the LLM on every run. The default npm run test:acceptance replays recorded fixtures — no AWS credentials needed, runs in ~3 minutes instead of ~30.

Script Mode Slow tests Use case
test:acceptance replay skipped Fast default for development
test:acceptance:record record skipped Re-record LLM fixtures after prompt/behavior changes
test:acceptance:live passthrough skipped Real LLM validation (requires AWS credentials)
test:acceptance:all passthrough included Full end-to-end validation (real LLM + distribution)

How record/replay works

The agent-proxy skill (skills/extensions/agent/proxy/agent-proxy.skill.ts) intercepts the provider interface between agent.skill.ts and the real LLM provider (agent-strands). It records the LLM's decisions (tool calls, messages) while letting tool execution happen for real.

Recording: The proxy delegates to the real provider, captures all hook events and tool call args/results, writes them to a JSON fixture file.

Replay: The proxy reads the fixture, fires recorded events to the hook callback, and executes recorded tool calls against the real pipeline. Only the LLM's decisions are canned — side effects (file writes, shell commands) happen for real.

Nesting: Nested agent calls (sub-agents, recursive chains) go through the proxy too. Start-order indexing ensures each nested call gets the correct recording slot regardless of completion order.

Writing LLM-dependent acceptance tests

All tests that invoke the LLM must be acceptance tests and support record/replay:

import { useRecordings, runCli, suiteDir } from "./helpers.ts";

useRecordings("my-test-name");  // Must be at module top level, before beforeAll

let dir: string;
beforeAll(async () => { dir = await suiteDir("my-test-name"); });

Key rules:

  • Call useRecordings("name") at module scope — this enables proxy injection
  • Use suiteDir("name") for temp directories — gives stable paths for record/replay
  • Use runCli() or runCliRetry() for CLI invocations — proxy args are injected automatically
  • Use getServerProxyArgs() for long-lived server processes (MCP, web servers)
  • After changing prompts or skill behavior, re-record with npm run test:acceptance:record
  • Recordings live in test/fixtures/recordings/<name>/

Coverage Tiers

Tier Scope Target
Runtime src/**/*.ts 90% lines, 70% branches, 80% functions
Skills skills/**/*.ts (minus exempt) 80% lines, 60% branches, 70% functions
Exempt Infrastructure-dependent code 25% lines, 10% branches, 10% functions

Exempt files require real infrastructure (AWS credentials, LLM access, interactive terminals, git/npm operations) and cannot be meaningfully unit tested. They are covered by integration and acceptance tests instead. The exempt list is maintained in scripts/coverage-summary.mjs.

These are the authoritative targets. The coverage-summary script and vitest.config.ts thresholds should match these. Thresholds are for visibility — they do not block commits.

Coverage output: npm run test:coverage writes V8 coverage to coverage/ (JSON, HTML, text). Read the summary with npm run test:coverage:summary.

Quality Metrics

npm run test:quality reports assertion density, weak%, isolation%, source mapping%, code metrics, per-tier breakdown, cross-subtree import hints, and test timing trends. Full detail by default — use --brief for abbreviated output. With --coverage, also runs deep per-file analysis. History is recorded automatically to test/.history/quality.jsonl — use --no-history to suppress.

Timing history is recorded to test/.history/timings.jsonl every time tests run through scripts/test-run.sh (all npm run test:* scripts except bare test and test:coverage). The quality report reads this history and shows: per-tier duration summary (unit, integration, acceptance), trend vs previous run and 5-run rolling average, tests whose runtime changed >2x vs their historical median, and the slowest tests. Outlier detection requires 3+ historical runs.

Source mapping uses two detection methods:

  1. Import scanning — static from "..." and dynamic import("...") patterns in test files.
  2. Location heuristic — if test/<path>.test.ts exists for <path>.ts, it counts as mapped (catches child-process tests that don't import the source directly).

Cross-subtree imports are flagged as hints, not violations. A test importing source files from the same subsystem directory (e.g., channel.test.ts importing gateway-send.skill.ts) is normal. Only cross-subsystem imports (e.g., a cascade test importing a pipeline skill) are flagged.

Test File Headers

Every test file — all tiers — must have a metadata header. See the tests-audit skill for the full vocabulary.

/**
 * @module-tag unit | integration | acceptance
 * @invariants
 * - <architectural guarantee>
 * @relationships
 * - <related test file>
 * @last-audit YYYY-MM-DD
 */

@last-audit enables incremental audits — only files with stale dates are re-reviewed.

Writing Tests

See the tests-update skill.

Test Auditing

See the tests-audit skill.

Tags

Tag Meaning Timeout
unit Pure logic, mocked deps 10s
integration Real bootstrap + disk I/O 30s
acceptance CLI acceptance tests against dist/ 120s
external Requires AWS credentials 60s
slow >30s per test (hub, compile) 180s
network Uses network ports 30s

Test Context

  • Normal skills: mockContext()ctx.target === ctx
  • Middleware skills: mockMiddlewareContext()ctx.target is the served context
  • Integration: Real Runtime with bootstrap and disk I/O
  • Acceptance: runCli() — spawns dist/src/cli.js

npm Scripts

Script What it runs
test:suite:smoke Unit tests, bail after 3
test:suite:regression Unit + integration, bail after 3
test:suite:all All tiers + build + acceptance
test:unit Unit tests only
test:coverage Unit tests with coverage report
test:coverage:summary Coverage by tier from last run
test:integration Integration only
test:acceptance Acceptance — replay mode, skip slow (fast default)
test:acceptance:record Acceptance — record LLM fixtures
test:acceptance:live Acceptance — real LLM, skip slow
test:acceptance:all Acceptance — real LLM + slow tests (full validation)
test:quality Quality report (tests + code)
test:mutation Stryker mutation testing
check tsc + biome + knip + eslint + madge
check:all check + clone detection + npm audit

Ask AI