Observability & Dev UX (v0.34.0)

Tools that explain *why* a run misbehaved instead of dumping a stack trace.

`soup why`

Heuristic explainer — reads the most recent (or named) run and surfaces plain-English diagnoses with concrete next steps.

bash
soup why                 # most recent run
soup why run_2026_abc    # specific run id (or prefix)

Detects: NaN/Inf loss, plateau (≥30 steps with <0.5% change), divergence (loss > 3× initial), persistent high gradient norm, learning rate outside the typical [1e-6, 5e-3] band. Pure rule-based — no model calls.

`soup tui`

Full-screen Textual dashboard. Two-pane: run list (left) + selected-run detail (right). r refreshes, q quits.

bash
pip install 'soup-cli[tui]'
soup tui --refresh 1.0 --limit 50

`soup train --profile`

Records a torch.profiler Chrome-trace over an early-steps window (default wait=1, warmup=1, active=5, repeat=1).

bash
soup train --config soup.yaml --profile
# → ./output/profiles/<run_id>.trace.json

Open in chrome://tracing or Perfetto.

Crash bundles — `.crash` files

When training fails, Soup auto-writes a self-contained .crash JSON to ./.soup-crashes/crash_<utc>_<hex>.crash containing:

  • Redacted error trace
  • Classified failure kind (oom / nan / cuda / dataloader / nccl / other)
  • GPU state at crash time
  • Env summary
  • Last-50 metric rows
  • Recursively-redacted config (hf_* / sk-* / Bearer ... tokens become <redacted>)

The output_dir is reduced to os.path.basename so $HOME doesn't leak. Bundle truncated to 1 MB.

Per-run cost

Every completed run stores an estimated cost ($ per run) computed from the captured GPU device name and duration.

bash
soup runs show <run_id>
# Cost: $4.21  Duration: 1h 38m  GPU: RTX 4090

CPU / MPS / unknown GPUs render (no fabricated zeros).

bash
soup cost --config soup.yaml             # estimate before training
soup cost --config soup.yaml --gpu H100  # specific GPU

`soup runs replay`

bash
soup runs replay <run_id>

Replay summary + downsampled loss curve from history (no live training restart).

Global `--log-level`

bash
soup --log-level verbose train --config soup.yaml
soup --log-level debug runs show <id>

Tiers: quiet | normal | verbose | debug. Wires a Rich-formatted logger on the soup namespace; debug enables timestamps + module paths.