Loop Hardening (v0.70.0)
Six surfaces that protect the training loop from the failure modes that cost a real GPU-hour. Schema-only today — every live callback / math kernel raises NotImplementedError with explicit v0.70.1 marker after validating inputs, except the math kernels that don't need a Trainer hook (cluster-separation, RM-ensemble divergence, echo-trap n-gram repetition — all LIVE).
`--reward-hack-detector` — InfoRM + RM-ensemble divergence
soup train --task grpo --base-model meta-llama/Llama-3.1-8B \
--reward-model registry://rm-v1 \
--reward-hack-detector info_rm \
--reward-hack-haltTwo detectors:
info_rm— InfoRM Cluster-Separation Index (Wang et al. 2024, [arXiv 2402.09345](https://arxiv.org/abs/2402.09345)). Drops when the policy collapses onto a degenerate reward-maximising subspace.rm_ensemble— mean pairwise variance across an RM ensemble (cap 32). When ensemble members disagree, the policy is exploiting one of them.
Math kernels compute_cluster_separation, compute_rm_ensemble_divergence, classify_hack_signal are LIVE with OK / WARN / HACK bands at 0.10 / 0.30 relative drop. --reward-hack-halt auto-stops on HACK (exit 2). Cross-validator: task in {grpo, ppo} only, halt=True requires detector, rejects mlx. Composes with v0.34 soup why for anomaly explanation. Live HF Trainer callback ships in v0.70.1.
`--uld-strategy` — cross-tokenizer Universal Logit Distillation
# soup.yaml
task: distill
training:
uld_strategy: wasserstein # or: topk_align
uld_top_k: 32 # required for topk_alignBoizard et al. 2024 ([arXiv 2402.12030](https://arxiv.org/abs/2402.12030)). Llama → Mistral, Llama → Qwen — no shared vocabulary required.
wasserstein— 1-D Wasserstein distance over sorted teacher / student logits, no alignment (cheap, robust default)topk_align— top-K teacher logits matched via BPE-overlap heuristic alignment (use when you have a good vocab-overlap heuristic and want sharper signal)_MAX_VOCAB_SIZE=262144covers multilingual SentencePiece + GPT-OSS 200K vocabularies- Gated to
task='distill'and rejects mlx backend - Live projection module ships in v0.70.1
`--minillm-enabled` — reverse-KL with 3 stability tricks bundled
task: distill
training:
minillm_enabled: true
minillm_teacher_mix_ratio: 0.3
minillm_length_normalize: true
minillm_pretrain_anchor_weight: 0.1
minillm_pretrain_anchor_path: ./pretrain.jsonlGu et al. 2024 ([arXiv 2306.08543](https://arxiv.org/abs/2306.08543)). All three §3 stability tricks bundled: teacher-mixed sampling (mix teacher samples into the on-policy rollout), length normalisation (per-token KL averaged), pretrain-loss anchor (regularise toward an anchor distribution at weight α).
Cross-validators reject silent no-ops:
anchor_weight=0withanchor_pathset → erroranchor_weight > 0withpath = None→ error
Gated to task='distill'. Live callback ships in v0.70.1.
`--rl-checkpoint-save-every-steps` — mid-epoch PPO/GRPO ckpt
soup train --task ppo --base-model ... \
--rl-checkpoint-save-every-steps 200 \
--rl-checkpoint-keep-last 4 \
--rl-checkpoint-include-optimizer \
--rl-checkpoint-include-ref-model \
--rl-checkpoint-include-rollout-bufferTorchTune explicitly punts mid-epoch checkpointing. Soup ships the schema today with bounds save_every_steps ∈ [1, 10M], keep_last ∈ [1, 100] (oldest pruned).
Composes with v0.32 spike recovery + v0.40 reference-model regen — recovery now hops to the most recent mid-epoch ckpt instead of restarting the epoch on a PPO crash. Live save_state / load_state ships in v0.70.1.
`soup iterative-dpo` — sample → score → re-pair → retrain driver
soup iterative-dpo --base-model registry://policy-v3 \
--reward-model registry://rm-v1 \
--prompts ./prompts.jsonl \
--output-dir ./iter-dpo \
--rounds 4 --pairs-per-round 4000Frozen IterativeDPOPlan with a consecutive-`round_index` invariant and canonical per-round artifacts:
./iter-dpo/round-01/pairs.jsonl
./iter-dpo/round-01/adapter/
./iter-dpo/round-02/pairs.jsonl
./iter-dpo/round-02/adapter/
...So a crashed run resumes cleanly. --plan-only renders the validated plan and exits 0; live runner (subprocess soup train --task dpo --resume between rounds) ships in v0.70.1.
`--echo-trap-enabled` — RAGEN multi-turn n-gram repetition detector
soup train --task grpo ... \
--echo-trap-enabled \
--echo-trap-threshold 0.6 \
--echo-trap-haltZhu et al. 2025 ([arXiv 2504.14437](https://arxiv.org/abs/2504.14437)). Pure-Python n-gram repetition rate per trajectory + a batch mean — when an agent's rollout collapses into "echoing itself" (the same n-gram pattern appearing repeatedly within and across turns), this catches it before the reward model rewards the degenerate policy.
OK / WARN / TRAP bands at 0.30 / 0.60. DoS caps _MAX_NGRAM_N=32, _MAX_TRAJECTORY_TOKENS=1M, _MAX_BATCH_TRAJECTORIES=100k. Gated to task in {grpo, ppo} non-mlx. Composes with v0.53.11 GRPOStabilityCallback. Live callback ships in v0.70.1; math kernel is LIVE.
Numbers
+337 tests in v0.70.0 (11,487 → 11,824) across 6 new test files plus follow-up boundary tests. Combined v0.68 → v0.70 net: +803 tests across 17 new test files (251 → 268).
See also
- [Adapter lifecycle (v0.67)](/docs/adapter-lifecycle) —
soup adapters bisectfinds which mid-epoch ckpt regressed. - [Anti-trend insurance (v0.68)](/docs/anti-trend-insurance) —
soup distill-prompt+ ULD pair up to bridge tokeniser gaps. - [Soup Loop (v0.58)](/docs/soup-loop) — iterative-DPO runs inside a
soup loopiteration.