Observability & Dev UX (v0.34.0)
Tools that explain *why* a run misbehaved instead of dumping a stack trace.
`soup why`
Heuristic explainer — reads the most recent (or named) run and surfaces plain-English diagnoses with concrete next steps.
soup why # most recent run
soup why run_2026_abc # specific run id (or prefix)Detects: NaN/Inf loss, plateau (≥30 steps with <0.5% change), divergence (loss > 3× initial), persistent high gradient norm, learning rate outside the typical [1e-6, 5e-3] band. Pure rule-based — no model calls.
`soup tui`
Full-screen Textual dashboard. Two-pane: run list (left) + selected-run detail (right). r refreshes, q quits.
pip install 'soup-cli[tui]'
soup tui --refresh 1.0 --limit 50`soup train --profile`
Records a torch.profiler Chrome-trace over an early-steps window (default wait=1, warmup=1, active=5, repeat=1).
soup train --config soup.yaml --profile
# → ./output/profiles/<run_id>.trace.jsonOpen in chrome://tracing or Perfetto.
Crash bundles — `.crash` files
When training fails, Soup auto-writes a self-contained .crash JSON to ./.soup-crashes/crash_<utc>_<hex>.crash containing:
- Redacted error trace
- Classified failure kind (
oom/nan/cuda/dataloader/nccl/other) - GPU state at crash time
- Env summary
- Last-50 metric rows
- Recursively-redacted config (
hf_*/sk-*/Bearer ...tokens become<redacted>)
The output_dir is reduced to os.path.basename so $HOME doesn't leak. Bundle truncated to 1 MB.
Per-run cost
Every completed run stores an estimated cost ($ per run) computed from the captured GPU device name and duration.
soup runs show <run_id>
# Cost: $4.21 Duration: 1h 38m GPU: RTX 4090CPU / MPS / unknown GPUs render — (no fabricated zeros).
soup cost --config soup.yaml # estimate before training
soup cost --config soup.yaml --gpu H100 # specific GPU`soup runs replay`
soup runs replay <run_id>Replay summary + downsampled loss curve from history (no live training restart).
Global `--log-level`
soup --log-level verbose train --config soup.yaml
soup --log-level debug runs show <id>Tiers: quiet | normal | verbose | debug. Wires a Rich-formatted logger on the soup namespace; debug enables timestamps + module paths.