`soup eval design` — derive evals from your training data

Before v0.55 you wrote your eval suite by hand. Now Soup drafts one from your data.

`soup eval design`

bash
soup eval design data.jsonl --goal "polite customer support chat" --num-dimensions 5

How it works:

1. TF-IDF salience picks up to num_dimensions (default 5) salient terms over the dataset. 10,000-row DoS-capped subsample.

2. Goal-keyword dispatch maps each dimension to a scorer:

- json / code / mathrlvr (verifiable reward)

- classifyexact_match

- extractregex

- default → judge

3. Output: a frozen EvalDesign (JSON) with one EvalDimension per row.

Scorer allowlist: {exact_match, regex, judge, rlvr}.

`soup eval discover` — canaries

bash
soup eval discover data.jsonl --num-clusters 5 --per-cluster 3

Three sets:

  • Held-out canaries — greedy farthest-first Jaccard-distance clustering (_CLUSTER_SUBSAMPLE = 10_000).
  • Adjacent-skill probes — neighbours that fall just outside training distribution.
  • Memorization probes — 25%-prefix truncation. If the trained model can continue the rest of a training row from its prefix, it memorized.

Per-group cap: _MAX_CANARIES_PER_GROUP = 1024.

`soup eval lock` — pin the suite

bash
soup eval lock my-design.json

Locks the design as a SHA-256-checksummed eval_suite artifact via canonicalise_design_bytes (canonical-JSON for stable hashes across runs). The frozen LockedSuite (path / sha256 / dimension_count) is registered in the v0.26 registry alongside the new canaries artifact kind.

`soup eval coverage` — gap analysis

bash
soup eval coverage my-design.json --task reasoning

Checks the locked design against the v0.54.0 TASK_CATEGORIES taxonomy and the _RECOMMENDED_SCORERS allowlist (e.g. reasoning → (rlvr, judge), format_conversion → (regex, rlvr)). Returns a CoverageReport with concrete gap recommendations.

`soup eval gate-install` — git regression gate

bash
soup eval gate-install --baseline run-id-7f3a

Writes .git/hooks/pre-push (atomic, POSIX 0o755) that:

1. Runs your locked eval suite on the current head.

2. Compares each GateThresholds metric (task_accuracy / refusal_rate / format_validity / p95_latency_ms) against the baseline via paired_bootstrap_ci(baseline, candidate, n_samples, ci_level, seed).

- n_samples ∈ [100, 100_000]

- ci_level ∈ (0, 1)

3. decide_regression uses direction-aware metric handling via _METRIC_DIRECTION — higher-is-better for accuracy, lower-is-better for latency.

4. Refuses the push on a RegressionVerdict of REGRESSED.

The hook script is rendered via render_pre_push_hook with shlex.quote for every interpolated path — no shell injection.

See also

  • [Quant-check](/docs/quant-check) — same idea, but for quant-induced regression
  • [Eval-gated training](/docs/eval-gate) — halt training when quality drops
  • [Registry](/docs/registry) — where eval_suite and canaries artifacts live