Multi-Trainer Speed & Memory (v0.35.0)
v0.35 finishes the v0.28 → v0.33 → v0.35 expansion: the four big speed-and-memory features now wire into every transformer-backend trainer.
What's now multi-trainer
| Feature | v0.28 | v0.33 | v0.35 |
|---|---|---|---|
use_cut_ce | SFT | SFT, DPO, Pretrain | All 11 |
quantization_aware: "fp8" | SFT | SFT, DPO, Pretrain | All 11 |
kernel_auto_compose | SFT | SFT, DPO, Pretrain | All 11 |
activation_offloading | SFT | SFT, DPO, Pretrain | All 11 |
"All 11" = SFT, DPO, GRPO, KTO, ORPO, SimPO, IPO, PPO, Reward-Model, Embedding, Pretrain. MLX backend and unknown tasks each emit distinct error messages so users get the right fix.
fp8 / int8 QAT guard
Six trainer wrappers (GRPO / KTO / ORPO / SimPO / IPO / PPO) had if tcfg.quantization_aware: which would route the string "fp8" into the legacy int8 prepare_model_for_qat path. Each guard is now quantization_aware != "fp8" (matches the DPO / Pretrain pattern).
Activation offloading hooks installed everywhere
The shared helper in utils/v028_features.py now enforces is_under_cwd containment for activation_offloading="disk" and reduces the offending path to os.path.basename in the ValueError message so absolute $HOME paths don't leak. All 11 trainer wrappers wrap trainer.train() with this context manager — closes a v0.33 oversight where DPO / Pretrain accepted the flag but never installed offload hooks.
Kernel benchmarking forward-only
benchmark_kernel_combos runs forward-only under torch.no_grad() — corruption hazard from gradient accumulation onto the live training model is eliminated. Caller-supplied bounds are clamped (bs ≤ 32, sl ≤ 512, steps ≤ 50, vocab ≤ 200_000) so a misconfigured caller cannot OOM the CI runner. Returns a NEW list — input candidates are never mutated.
Quant menu kwarg allowlist
vLLM quantization kwarg is now an explicit named parameter on create_vllm_engine (not a **kwargs splat) and is validated against the closed allowlist {awq, gptq, fp8}. The earlier kwarg-splat path could have leaked arbitrary AsyncEngineArgs fields to the engine constructor — eliminated by the named-param refactor.
Error redaction on model load fallback
When every candidate fails to load, the RuntimeError message includes only type(last_error).__name__ + ": " + str(last_error) instead of repr(last_error) — repr of a FileNotFoundError would embed $HOME-prefixed checkpoint paths.
See also
- [Training speed & memory](/docs/training-speed-memory) — original v0.28 doc
- [Quant menu](/docs/quant-menu) — original 9-format menu · [Quant Menu II](/docs/quant-menu-ii) — UD GGUF, NVFP4, BitNet, KV cache (v0.53)