Multi-GPU Mastery

Soup v0.27.0 turns multi-GPU training from a research paper into a CLI flag. Topology detection picks the right strategy, ZeRO++ lands quantized weights and grads, FSDP2 can opt into torch.compile, and pipeline parallelism is scaffolded for models that don't fit a single node.

One-flag launch

bash
soup train --gpus auto
soup train --gpus 8

--gpus inspects GPU count and interconnect (NVLink vs PCIe), then recommends the strategy for your config — DeepSpeed ZeRO-3, ZeRO++, FSDP2, or pipeline. If the topology doesn't match what your config asked for, Soup prints the corrected accelerate launch command you can paste.

ZeRO++

yaml
base: meta-llama/Llama-3.1-70B
task: sft

training:
  quantization: 4bit
  lora:
    r: 16
    alpha: 32
bash
soup train --gpus 8 --deepspeed zero++
# aliases: zero_pp

ZeRO++ cuts inter-GPU communication by quantizing the broadcasted weights and gradients. Integer fields (sub_group_size, stage3_max_live_parameters, stage3_max_reuse_distance) are written as int(1e9) so DeepSpeed's strict JSON validator accepts them.

FSDP2 + torch.compile

yaml
training:
  use_fsdp2_compile: true
bash
soup train --gpus 4 --fsdp full_shard

The validator requires an FSDP preset, CUDA, backend=transformers, and torch>=2.2 + accelerate>=0.27. DeepSpeed + torch.compile is explicitly rejected — the two own their own compile paths and mixing them produces a cryptic runtime crash.

Pipeline parallelism

yaml
training:
  parallelism: pipeline
  pipeline_stages: 4

The validator requires pipeline_stages >= 2, CUDA, and gpu_count >= pipeline_stages. pipeline_stages is bounded [1, 16].

DeepSpeed-MII backend

bash
soup serve --backend mii --model ./output

Registered in v0.27.0, live in v0.27.1. A misconfigured --backend mii always exits non-zero (never silently accepts).

Multi-GPU recipes

  • llama3-70b-fsdp2 — 70B SFT on 8×A100 via FSDP2 + compile
  • qwen3-32b-zeropp — Qwen 3 32B SFT via DeepSpeed ZeRO++
  • deepseek-v3-pipeline — DeepSeek V3 SFT via pipeline parallelism
bash
soup recipes use llama3-70b-fsdp2
soup train --gpus 8

Security & guardrails

  • --gpus rejects bool, negatives, zero, non-digits, and values above MAX_GPU_COUNT=128
  • --gpus auto on a CPU-only host prints an explicit yellow warning instead of silently single-processing
  • The accelerate launch argv is built via shlex.quote so copy-pasted commands can't inject via a crafted config path
  • NCCL env hints are applied via os.environ.setdefault — user / launcher overrides always win
  • Rich markup on --config paths is escaped before embedding in advice panels

See also

  • [Training speed & memory](/docs/training-speed-memory) — Cut CE, FP8, activation offloading
  • [Backends](/docs/backends) — transformers / unsloth / MLX