Fine-tune Gemma 3 with QLoRA (single GPU)

QLoRA combines 4-bit base model quantization with LoRA adapters, making it possible to fine-tune Gemma 3 12B on a single 16GB GPU (RTX 4080, A4000).

Why QLoRA?

  • 4× memory reduction vs full LoRA
  • Same quality as full fine-tuning (~99% of benchmark scores per the QLoRA paper)
  • Works on consumer hardware

1. Install

bash
pip install 'soup-cli[fast]'

2. Config

yaml
base:
  model: google/gemma-3-12b-it

task: sft

data:
  train: train.json
  format: alpaca

training:
  backend: unsloth
  quant: 4bit
  epochs: 3
  learning_rate: 2.0e-4
  batch_size: 1
  gradient_accumulation_steps: 16
  max_seq_length: 2048
  lora:
    enabled: true
    r: 16
    alpha: 16
    use_rslora: true
    target_modules: [q_proj, k_proj, v_proj, o_proj]

Note the key flags:

  • quant: 4bit — 4-bit NF4 quantization of base model
  • use_rslora: true — rank-stabilized LoRA (v0.21.0+), better for larger models
  • batch_size: 1 with gradient_accumulation_steps: 16 — effective batch of 16 on tight VRAM

3. Train

bash
soup train --config gemma3.yaml

Monitor VRAM with nvidia-smi in another terminal. You should see ~14GB peak on Gemma 3 12B.

4. Merge and export

bash
# Dequantize, merge LoRA, save full model
soup export --adapter ./runs/gemma3/latest --format hf --output ./gemma3-merged

# Or export directly to GGUF q4_k_m
soup export --adapter ./runs/gemma3/latest --format gguf --quant q4_k_m

Common issues

OOM during backward pass? Reduce max_seq_length to 1024 or enable gradient checkpointing:

yaml
training:
  gradient_checkpointing: true

Loss spikes? Enable loss watchdog (v0.24.0+):

yaml
training:
  loss_watchdog:
    enabled: true
    max_spike: 2.0

Related

  • [Export to GGUF and Ollama](/docs/export-to-gguf-ollama)
  • [Training backends](/docs/backends)