Agent Eval Harness

The Agent Eval Harness adds deterministic contract checks on top of structural validation.

It verifies behavior-critical metadata and instruction markers for:

  • Planner read-only posture
  • Orchestrator plan-first and loop protocol markers
  • Review verification-gate markers
  • Command routing/frontmatter/argument consistency
  • Permission skill/task invariants

Commands

npm run eval:agents
npm run eval:agents:json
npm run eval:agents:trend

Output

  • Human-readable summary in terminal
  • Optional JSON report at evals/reports/latest.json
  • Trend snapshot markdown at evals/reports/trend-summary.md

In CI (validate-agent-evals), the JSON report and trend snapshot are uploaded as workflow artifacts.

Design Notes

  • Static and deterministic (no model calls)
  • No external dependencies
  • Designed as a lightweight contract gate, not a benchmark framework

Fixtures

Regression fixtures for harness and validator tests live under scripts/fixtures/. This keeps golden inputs explicit and reusable across tests.


Copyright © 2025-2026 Shehab Elhadidy. Licensed under the MIT License.

This site uses Just the Docs, a documentation theme for Jekyll.