Eval-Gated Training
v0.26.0 adds declarative eval suites that run at epoch boundaries. If a task score falls below its threshold — or regresses against a baseline — training halts before you waste another epoch.
The gate file
Every entry in tasks: is one of three types: custom, judge, or benchmark. Each task has a name (used as the baseline key) and a numeric threshold.
# evals/gate.yaml
suite: chat-quality
tasks:
- type: custom
name: tool_calls
tasks: evals/tool_calls.jsonl
scorer: exact # exact | contains | regex | semantic
threshold: 0.70
- type: judge
name: chat_judge
prompts: evals/judge_prompts.jsonl
judge_model: ollama://llama3.1 # or https://... or http://localhost
threshold: 7.0
- type: benchmark
name: mmlu
benchmark: mini_mmlu
threshold: 0.60Enable it
Either inline in soup.yaml:
training:
eval_gate: ./evals/gate.yaml…or on the command line:
soup train --config soup.yaml --gate ./evals/gate.yamlPost-hoc verdict
Run the gate standalone against any model:
soup eval gate --suite ./evals/gate.yaml
# ✓ mmlu: 0.648 (baseline 0.643, +0.005) PASS
# ✗ chat_judge: 7.1 (baseline 8.2, -1.1) REGRESSION
# → verdict: FAILA task fails if its score is below its threshold *or* if it drops more than the configured regression threshold (default 0.05) below the supplied baseline.
Baselines
registry://<id>— pulls eval results for the referenced [registry](/docs/registry) entry./baseline.json— a JSON map of{task_name: score}- omitted — tasks are judged only against their
threshold
Judge URL allowlist
judge_model must use one of: ollama://, https://, or http://localhost / http://127.0.0.1. Any other scheme is rejected at load time — an SSRF guard on the eval path.
See also
- [Registry](/docs/registry) — the typical baseline source
- [Evaluation](/docs/experiments) — the broader eval platform