Autopilot

soup autopilot is the zero-config entry point added in v0.25.0. You pass a model, a dataset, and a goal. Autopilot profiles the dataset, model, and GPU, then emits a soup.yaml with every hyperparameter chosen and justified.

Quick start

bash
soup autopilot \
  --model meta-llama/Llama-3.1-8B \
  --data chats.jsonl \
  --goal chat \
  --gpu-budget 24GB \
  --time-budget 4h

Inputs

FlagMeaning
--modelAny HuggingFace repo or local path
--dataJSONL dataset (alpaca / sharegpt / chatml / dpo / kto / tool-calling)
--goalchat · reasoning · code · classification · tool-calling · alignment · domain-adapt
--gpu-budgetVRAM budget (defaults to detected GPU)
--time-budgetMaximum wall-clock time
--outputConfig path (default: ./soup.yaml)
--dry-runPrint decisions without writing soup.yaml
--runRun soup train immediately after confirmation

What Autopilot decides

1. Task — maps goal → SFT / DPO / GRPO / KTO / pretrain, including the tool-calling format where appropriate.

2. Quantization — chooses none, 8bit, or 4bit based on model_size × 1.2 vs VRAM.

3. PEFT — picks LoRA r=8/16/32, DoRA, or VeRA based on dataset size and VRAM headroom.

4. Batch size × grad_accum — targets effective batch 16–32, computed from VRAM headroom.

5. Learning rate — scales with rank and quantization (e.g. LoRA r=16 → 2e-4, 4bit → ×0.8).

6. Epochs — 5 → 3 → 2 → 1 depending on dataset size.

7. `max_length` — ceil(p95 × 1.1) clamped to model context.

8. Perf flags — auto-enables FlashAttention v2 on Ampere+, Liger Kernel on modern Llama arch, gradient_checkpointing for long context, MLX backend on Apple Silicon.

9. Training intelligence — turns on forgetting detection and checkpoint intelligence by default.

Every decision is printed with a short reason in a Rich panel, so you can see *why* r=16 beat r=32 or why quantization dropped to 4bit.

Example output

╭─ Autopilot Decisions ─────────────────────────────────╮
│ ✓ Quantization: 4bit                                  │
│   reason: 8B model needs ~5GB in 4bit, leaves 19GB    │
│                                                        │
│ ✓ PEFT: LoRA r=16, alpha=32                           │
│   reason: 15k samples — r=16 balances capacity /      │
│           overfitting risk                            │
│                                                        │
│ ✓ Batch size: 4 × grad_accum 8 = effective 32         │
│ ✓ Learning rate: 2e-4                                 │
│ ✓ Epochs: 2                                           │
│ ✓ Max length: 2048 (p95=1820 + 10% margin)            │
│ ✓ Flash Attention v2 (Ampere GPU)                     │
│ ✓ Liger Kernel (modern Llama arch)                    │
│ ✓ Forgetting detection (mini_mmlu)                    │
│ ✓ Checkpoint intelligence (judge metric)              │
│                                                        │
│ Estimated time: 1h 42min                              │
│ Estimated VRAM: 18.2GB / 24GB ✓                       │
╰────────────────────────────────────────────────────────╯

Safety

  • Dataset and output paths are resolved and constrained to the working directory (no path traversal).
  • GPU budget is bounded to 1GB–1TB; time budget to 60s–30 days.
  • Model names are validated against HuggingFace Hub naming rules.
  • Model code is never executed during analysis — Autopilot uses HF Hub metadata only.

See also

  • [Training methods](/docs/training)
  • [Backends](/docs/backends) — including MLX
  • [Training intelligence](/docs/training-intelligence) — forgetting detection + checkpoint quality
  • [Recipes](/docs/recipes) — start here if you don't need Autopilot