Multi-GPU training with DeepSpeed ZeRO-3

Use Soup CLI with DeepSpeed ZeRO-3 to train large models across multiple GPUs. This guide shows how to fine-tune a 70B model across 4–8 GPUs.

Install

bash
pip install 'soup-cli[deepspeed]'

When to use which ZeRO stage

StageWhat it shardsWhen to use
ZeRO-2Optimizer states + gradients2–4 GPUs, 7B–13B models
ZeRO-3Everything incl. parameters4+ GPUs, 30B+ models
FSDP2Fully sharded (PyTorch native)Alternative to ZeRO-3

Config for Llama 3.1 70B on 8× A100

yaml
base:
  model: meta-llama/Meta-Llama-3.1-70B-Instruct

task: sft

data:
  train: train.json
  format: alpaca

training:
  backend: transformers
  epochs: 2
  learning_rate: 1.0e-4
  batch_size: 1
  gradient_accumulation_steps: 16
  max_seq_length: 4096
  gradient_checkpointing: true
  bf16: true
  distributed:
    strategy: deepspeed
    zero_stage: 3
    offload_optimizer: cpu
    offload_params: cpu
  lora:
    enabled: true
    r: 32
    alpha: 64

Launch on 8 GPUs

bash
soup train --config llama70b.yaml --gpus 8

Soup handles torchrun / deepspeed launcher config automatically.

Alternative: FSDP2

yaml
training:
  distributed:
    strategy: fsdp2
    sharding: full

FSDP2 is PyTorch-native and often simpler for LoRA workloads.

Ring FlashAttention for 128k+ context

For very long sequences:

bash
pip install 'soup-cli[ring-attn]'
yaml
training:
  attention: ring
  max_seq_length: 131072

Tips

  • Always enable gradient_checkpointing: true for 70B+ models
  • CPU offload trades speed for VRAM — use only if OOM
  • Profile first: soup profile --config llama70b.yaml --gpus 8 estimates memory + throughput before you spend GPU hours

Related

  • [Backends reference](/docs/backends)
  • [Training methods](/docs/training)