Multi-GPU Mastery
Soup v0.27.0 turns multi-GPU training from a research paper into a CLI flag. Topology detection picks the right strategy, ZeRO++ lands quantized weights and grads, FSDP2 can opt into torch.compile, and pipeline parallelism is scaffolded for models that don't fit a single node.
One-flag launch
soup train --gpus auto
soup train --gpus 8--gpus inspects GPU count and interconnect (NVLink vs PCIe), then recommends the strategy for your config — DeepSpeed ZeRO-3, ZeRO++, FSDP2, or pipeline. If the topology doesn't match what your config asked for, Soup prints the corrected accelerate launch command you can paste.
ZeRO++
base: meta-llama/Llama-3.1-70B
task: sft
training:
quantization: 4bit
lora:
r: 16
alpha: 32soup train --gpus 8 --deepspeed zero++
# aliases: zero_ppZeRO++ cuts inter-GPU communication by quantizing the broadcasted weights and gradients. Integer fields (sub_group_size, stage3_max_live_parameters, stage3_max_reuse_distance) are written as int(1e9) so DeepSpeed's strict JSON validator accepts them.
FSDP2 + torch.compile
training:
use_fsdp2_compile: truesoup train --gpus 4 --fsdp full_shardThe validator requires an FSDP preset, CUDA, backend=transformers, and torch>=2.2 + accelerate>=0.27. DeepSpeed + torch.compile is explicitly rejected — the two own their own compile paths and mixing them produces a cryptic runtime crash.
Pipeline parallelism
training:
parallelism: pipeline
pipeline_stages: 4The validator requires pipeline_stages >= 2, CUDA, and gpu_count >= pipeline_stages. pipeline_stages is bounded [1, 16].
DeepSpeed-MII backend
soup serve --backend mii --model ./outputRegistered in v0.27.0, live in v0.27.1. A misconfigured --backend mii always exits non-zero (never silently accepts).
Multi-GPU recipes
llama3-70b-fsdp2— 70B SFT on 8×A100 via FSDP2 + compileqwen3-32b-zeropp— Qwen 3 32B SFT via DeepSpeed ZeRO++deepseek-v3-pipeline— DeepSeek V3 SFT via pipeline parallelism
soup recipes use llama3-70b-fsdp2
soup train --gpus 8Security & guardrails
--gpusrejectsbool, negatives, zero, non-digits, and values aboveMAX_GPU_COUNT=128--gpus autoon a CPU-only host prints an explicit yellow warning instead of silently single-processing- The
accelerate launchargv is built viashlex.quoteso copy-pasted commands can't inject via a crafted config path - NCCL env hints are applied via
os.environ.setdefault— user / launcher overrides always win - Rich markup on
--configpaths is escaped before embedding in advice panels
See also
- [Training speed & memory](/docs/training-speed-memory) — Cut CE, FP8, activation offloading
- [Backends](/docs/backends) — transformers / unsloth / MLX