Training Speed & Memory
Soup v0.28.0 ships six production-grade throughput + memory features. All six are SFT-only in v0.28 — non-SFT trainers are rejected at config-load time, so there are no silent no-ops.
Cut Cross-Entropy
training:
use_cut_ce: trueCut CE avoids materialising the full [seq_len × vocab_size] logits tensor during the cross-entropy backward pass. Biggest win on models with vocab ≥ 128k (Llama 3, Qwen 3). Architecture detection matches on the last path component (model_name.rsplit("/", 1)[-1]), so org prefixes like deepseek-ai/...-phi-... can't trigger the wrong patcher.
Install the optional dep:
pip install 'soup-cli[cce]'FP8 training (Hopper+)
training:
quantization_aware: fp8Configures float8 linear layers via torchao. The validator requires CUDA, a Hopper+ GPU (SM capability check), and the transformers backend. Pydantic accepts only true / false / "fp8" — unknown strings like "fp16" are rejected at config load.
Tiered gradient checkpointing
training:
gradient_checkpointing: auto # or: selective | medium | full| Tier | Memory saved | Throughput cost |
|---|---|---|
selective | Small | ~5% |
medium | Medium | ~15% |
full | Large | ~30% |
auto | Auto-picked from detected VRAM | — |
resolve_gradient_checkpointing returns only HF-supported keys. Granularity is exposed via a separate helper so private markers never leak into TrainingArguments.
Kernel auto-composition
training:
kernel_auto_compose: trueEnumerates Liger Kernel + FlashAttention combos, micro-benchmarks each, and picks the fastest. pick_best_kernel raises ValueError when every candidate is missing a finite time_ms — prevents silent promotion of an untimed combo when benchmarking infrastructure fails. NaN / None times are treated as +inf.
Cross-document attention masking
training:
packing: true
packing_cross_doc_attn_mask: trueWhen sample packing is on, a block-diagonal causal mask prevents cross-document attention leakage without losing the packer's throughput win. The mask is built via a numpy np.tril slice, so it scales to the DataConfig.max_length bound of 1M tokens. A @model_validator errors out if the flag is enabled without packing: true — no silent no-op.
Activation offloading
training:
activation_offloading: cpu # or: diskSaved-tensor hooks offload activations to pinned RAM (cpu) or an on-disk scratch dir (disk). The disk variant keeps the mkstemp file descriptor open until torch.save flushes, closing the TOCTOU window between os.close(fd) and torch.save(path). torch.load(weights_only=True) prevents arbitrary Python deserialization on reload. Scratch dirs are containment-checked against cwd. Unsloth and MLX backends are rejected.
SFT-only guard (v0.28)
SoupConfig rejects use_cut_ce, quantization_aware="fp8", kernel_auto_compose, and activation_offloading when task != "sft". Multi-trainer wiring is planned for v0.28.1+.
See also
- [Multi-GPU mastery](/docs/multi-gpu) — ZeRO++ / FSDP2 / pipeline
- [Backends](/docs/backends) — transformers / unsloth / MLX