Eval Depth (v0.65.0)

Failure-mode coverage goes from 6 to 10. The v0.56 diagnose report card stays the daily driver; v0.65 adds four optional deeper probes you opt into when you actually care.

`soup eval behavior` — pre/post safety diff

bash
soup eval behavior <run-id> --battery xstest \
  --evidence ./responses.json --output ./behavior.json

Bundled safety / refusal / jailbreak / sycophancy probe sets, scored pre-FT vs. post-FT with the v0.26 / v0.56 OK ≥0.85 / MINOR ≥0.60 / MAJOR thresholds.

Five batteries at launch:

BatteryWhat it catches
xstestOver-refusal on benign prompts
harmbenchJailbreak resistance
jailbreakbenchJailbreak prompt-pair contrasts
elephantSycophancy / opinion-shifting
sycevalSycophantic alignment to user

Evidence is operator-supplied JSON {pre_responses, post_responses, oracle} — no auto-rollout. 16 MiB file cap, O_NOFOLLOW open against symlink swap.

`soup eval capability` — lm-eval-harness task surface

bash
soup eval capability <run-id> --suite full --output ./capability.json

Validated lm-eval-harness task IDs for 7 bundled benchmarks — MMLU-Pro, GPQA, BBEH, AIME, MATH-500, HumanEval+, SWE-bench-Verified. Four profiles: full, fast, math, code.

Operator runs the harness; soup eval capability validates the task IDs and emits the runbook.

`soup eval checklist` — MFT / INV / DIR DSL

yaml
# spec.yaml
tests:
  - name: capital_facts
    kind: mft                     # minimum functionality
    prompts: ["What is the capital of France?"]
    expected: ["Paris"]
  - name: paraphrase_stable
    kind: inv                     # invariance under paraphrase
    prompts: ["Capital of FR?", "Tell me FR capital"]
    expected: ["Paris"]
  - name: more_polite
    kind: dir                     # directional perturbation
    prompts: ["You're rude.", "You're being unhelpful."]
    expected: ["apolog"]

CheckList-style behavioral DSL. Up to 1,000 tests per spec; 1 MiB YAML cap; enforce_under_cwd_and_no_symlink on file open.

`soup eval irt-subset` — Rasch IRT cost-cut

bash
soup eval irt-subset ./responses.jsonl --size small --output ./plan.json

Fits a 1-parameter Rasch IRT model to per-item correctness signals and selects a minimum-cost subset that preserves ranking power:

  • full = 100% of items
  • small = ~30% (Rasch information-weighted)
  • tiny = ~10%

The math: P(correct | θ, β) = σ(θ - β), item information I(β) = σ(-β) · σ(β). Information peaks at β≈0 (50/50 items most discriminate). Pure-Python kernel — no numpy/scipy.

256 MiB JSONL cap; item-ID validated (≤256 chars, no null bytes).

See also

  • [Diagnose](/docs/diagnose) — v0.56 6-probe report card, the lighter daily driver.
  • [Post-train x-rays](/docs/post-train-xrays) — v0.66 mechanistic interpretability probes that stack on top of v0.65 behaviour evals.