Training Speed & Memory

Soup v0.28.0 ships six production-grade throughput + memory features. All six are SFT-only in v0.28 — non-SFT trainers are rejected at config-load time, so there are no silent no-ops.

Cut Cross-Entropy

yaml
training:
  use_cut_ce: true

Cut CE avoids materialising the full [seq_len × vocab_size] logits tensor during the cross-entropy backward pass. Biggest win on models with vocab ≥ 128k (Llama 3, Qwen 3). Architecture detection matches on the last path component (model_name.rsplit("/", 1)[-1]), so org prefixes like deepseek-ai/...-phi-... can't trigger the wrong patcher.

Install the optional dep:

bash
pip install 'soup-cli[cce]'

FP8 training (Hopper+)

yaml
training:
  quantization_aware: fp8

Configures float8 linear layers via torchao. The validator requires CUDA, a Hopper+ GPU (SM capability check), and the transformers backend. Pydantic accepts only true / false / "fp8" — unknown strings like "fp16" are rejected at config load.

Tiered gradient checkpointing

yaml
training:
  gradient_checkpointing: auto   # or: selective | medium | full
TierMemory savedThroughput cost
selectiveSmall~5%
mediumMedium~15%
fullLarge~30%
autoAuto-picked from detected VRAM

resolve_gradient_checkpointing returns only HF-supported keys. Granularity is exposed via a separate helper so private markers never leak into TrainingArguments.

Kernel auto-composition

yaml
training:
  kernel_auto_compose: true

Enumerates Liger Kernel + FlashAttention combos, micro-benchmarks each, and picks the fastest. pick_best_kernel raises ValueError when every candidate is missing a finite time_ms — prevents silent promotion of an untimed combo when benchmarking infrastructure fails. NaN / None times are treated as +inf.

Cross-document attention masking

yaml
training:
  packing: true
  packing_cross_doc_attn_mask: true

When sample packing is on, a block-diagonal causal mask prevents cross-document attention leakage without losing the packer's throughput win. The mask is built via a numpy np.tril slice, so it scales to the DataConfig.max_length bound of 1M tokens. A @model_validator errors out if the flag is enabled without packing: true — no silent no-op.

Activation offloading

yaml
training:
  activation_offloading: cpu    # or: disk

Saved-tensor hooks offload activations to pinned RAM (cpu) or an on-disk scratch dir (disk). The disk variant keeps the mkstemp file descriptor open until torch.save flushes, closing the TOCTOU window between os.close(fd) and torch.save(path). torch.load(weights_only=True) prevents arbitrary Python deserialization on reload. Scratch dirs are containment-checked against cwd. Unsloth and MLX backends are rejected.

SFT-only guard (v0.28)

SoupConfig rejects use_cut_ce, quantization_aware="fp8", kernel_auto_compose, and activation_offloading when task != "sft". Multi-trainer wiring is planned for v0.28.1+.

See also

  • [Multi-GPU mastery](/docs/multi-gpu) — ZeRO++ / FSDP2 / pipeline
  • [Backends](/docs/backends) — transformers / unsloth / MLX