`soup diagnose` — six failure-mode probes, one verdict

After you finish training, the question is no longer "did the loss go down" — it's "did I quietly break something." soup diagnose (v0.56.0) is a post-training model report card that runs 6 pure-function failure-mode probes:

ProbeWhat it catches
forgettingPer-task Δ accuracy with tolerance band — extends v0.25 eval/forgetting
refusaladvbench / xstest delta over caller-supplied generators (_MAX_REFUSAL_SCAN = 8192)
formatJSON / regex / tool-call validity over the RLVR verifier set, with an explicit ReDoS probe (compiled.search("a" * 128))
mode_collapsePairwise n-gram-Jaccard distance over K completions (k ∈ [2, 32], ngram_n ∈ [1, 8])
memorizationTraining-prefix echo via partial-prompt continuation (_MAX_SCAN_ROWS = 1000)
contaminationn-gram overlap with public benchmarks (combined-complexity cap rejects when `training_rows×benchmark_corpus> 1e9`)

All probes are pure functions over caller-supplied generators — soup diagnose doesn't run the model itself. You wire in a GeneratorFn callable that wraps your serve backend.

Usage

bash
soup diagnose <run-id> \
  --evidence ./evidence.json \
  --output ./diagnose.json \
  --badge ./badge.svg \
  --attach-to-registry registry-id

Verdict thresholds (composed in compose_report):

  • ≥ 0.85OK
  • ≥ 0.60MINOR
  • < 0.60MAJOR

Output: a frozen FailureReport (run_id / base / adapter / scores / overall / soup_version / extras) plus an embeddable 6-cell SVG badge from render_badge_svg (html-escaped — safe for HF model cards). Missing modes fill via neutral_score.

`soup train --diagnose-gate`

bash
soup train --diagnose-gate ./evidence.json

Refuses the final checkpoint save on a MAJOR regression (typer.Exit(code=2)). The new diagnose_report artifact kind is registered in the v0.26 registry alongside eval_suite and canaries from v0.55.

See also

  • [Eval-gated training](/docs/eval-gate) — pre-training gate at epoch boundaries
  • [Quant-check](/docs/quant-check) — quant-specific regression check
  • [Eval design](/docs/eval-design) — design the evidence your gate scores