`soup diagnose` — six failure-mode probes, one verdict
After you finish training, the question is no longer "did the loss go down" — it's "did I quietly break something." soup diagnose (v0.56.0) is a post-training model report card that runs 6 pure-function failure-mode probes:
| Probe | What it catches | ||||
|---|---|---|---|---|---|
forgetting | Per-task Δ accuracy with tolerance band — extends v0.25 eval/forgetting | ||||
refusal | advbench / xstest delta over caller-supplied generators (_MAX_REFUSAL_SCAN = 8192) | ||||
format | JSON / regex / tool-call validity over the RLVR verifier set, with an explicit ReDoS probe (compiled.search("a" * 128)) | ||||
mode_collapse | Pairwise n-gram-Jaccard distance over K completions (k ∈ [2, 32], ngram_n ∈ [1, 8]) | ||||
memorization | Training-prefix echo via partial-prompt continuation (_MAX_SCAN_ROWS = 1000) | ||||
contamination | n-gram overlap with public benchmarks (combined-complexity cap rejects when ` | training_rows | × | benchmark_corpus | > 1e9`) |
All probes are pure functions over caller-supplied generators — soup diagnose doesn't run the model itself. You wire in a GeneratorFn callable that wraps your serve backend.
Usage
soup diagnose <run-id> \
--evidence ./evidence.json \
--output ./diagnose.json \
--badge ./badge.svg \
--attach-to-registry registry-idVerdict thresholds (composed in compose_report):
≥ 0.85→ OK≥ 0.60→ MINOR< 0.60→ MAJOR
Output: a frozen FailureReport (run_id / base / adapter / scores / overall / soup_version / extras) plus an embeddable 6-cell SVG badge from render_badge_svg (html-escaped — safe for HF model cards). Missing modes fill via neutral_score.
`soup train --diagnose-gate`
soup train --diagnose-gate ./evidence.jsonRefuses the final checkpoint save on a MAJOR regression (typer.Exit(code=2)). The new diagnose_report artifact kind is registered in the v0.26 registry alongside eval_suite and canaries from v0.55.
See also
- [Eval-gated training](/docs/eval-gate) — pre-training gate at epoch boundaries
- [Quant-check](/docs/quant-check) — quant-specific regression check
- [Eval design](/docs/eval-design) — design the evidence your gate scores