Training Intelligence

v0.25.0 adds two training-time subsystems that no other CLI ships: catastrophic forgetting detection and checkpoint intelligence.

Catastrophic forgetting detection

Runs a mini benchmark on the base model *before* training, then repeats it every N steps on the current checkpoint. If general-knowledge accuracy drops more than your threshold, the Rich dashboard turns yellow; cross the red line and you can auto-stop training.

Config

yaml
training:
  forgetting_detection: true
  forgetting_eval_steps: 100
  forgetting_threshold: 0.10       # warn if accuracy drops >10%
  forgetting_benchmark: mini_mmlu  # mini_mmlu | mini_common_sense | mini_instruction
  forgetting_stop: false           # auto-stop on severe forgetting

Built-in benchmarks

Each is a 100-question set embedded in the source (no external downloads):

  • mini_mmlu — diverse MMLU coverage (STEM / humanities / social sciences)
  • mini_common_sense — common-sense reasoning
  • mini_instruction — instruction-following quality

Checkpoint intelligence

HF Trainer's "best checkpoint" is the one with lowest loss. But lower loss ≠ better model — overfitted checkpoints hit low loss with bad real-world quality. Checkpoint intelligence runs a quality eval *during* training and tags best_quality separately from best_loss.

Config

yaml
training:
  checkpoint_intelligence: true
  checkpoint_eval_steps: 200
  checkpoint_eval_metric: composite   # judge | mmlu | custom | composite
  checkpoint_eval_tasks: eval.jsonl   # optional custom eval
  checkpoint_keep_top: 3              # delete the rest
  early_stop_on_regression: true
  early_stop_patience: 2

Dashboard

Epoch 2/3 ████████ loss: 0.89  step 450/720
  → Loss best:     step-450 (loss 0.89)
  → Quality best:  step-300 (judge 8.2/10) ⭐
  → Gen knowledge: 91.2% (baseline 95.0%, -3.8% ⚠)
  → Last eval:     +0.4 judge points (improving)

After training, the best_quality checkpoint is linked at ./output/best_quality/.

Storage

Both subsystems extend the SQLite experiment tracker (~/.soup/experiments.db) with two new tables:

  • checkpoint_quality(run_id, step, metric, score, is_best, created_at)
  • forgetting_eval(run_id, step, benchmark, accuracy, baseline, delta, warning_level)

Inspect them with soup runs show <run_id> or query directly.

Safety

  • Benchmark data is embedded in code — no external file loading at runtime.
  • Eval intervals are bounded (10 ≤ steps ≤ 10,000) to prevent runaway eval overhead.
  • Pruning only deletes files inside the run's output_dir and never follows symlinks outside.

Defaults

Autopilot turns both subsystems on by default. If you hand-write soup.yaml, flip the flags above.