Fine-tune Llama 3.1 with LoRA using Soup CLI
This guide shows how to fine-tune Meta Llama 3.1 8B with LoRA adapters on a custom dataset using Soup CLI. End-to-end from install to inference in under 10 minutes on a single GPU.
Why LoRA on Llama 3.1?
LoRA (Low-Rank Adaptation) trains only a small fraction (~0.1%) of model parameters, which means:
- Train 8B parameter Llama 3.1 on a single 24GB GPU (RTX 4090, A10)
- Checkpoints are tiny (~100MB instead of 16GB)
- Faster training and easier experimentation
1. Install
bash
pip install 'soup-cli[fast]'The [fast] extra adds Unsloth for 2–5× training speedup on Llama-family models.
2. Prepare your dataset
Use the Alpaca format (JSON list of instruction/input/output triples):
json
[
{
"instruction": "Summarize the following text.",
"input": "Soup CLI is a fine-tuning toolkit...",
"output": "Soup CLI is an open-source LLM fine-tuning tool."
}
]Save as train.json.
3. Create the config
Save as llama31.yaml:
yaml
base:
model: meta-llama/Meta-Llama-3.1-8B-Instruct
task: sft
data:
train: train.json
format: alpaca
training:
backend: unsloth
epochs: 3
learning_rate: 2.0e-4
batch_size: 2
gradient_accumulation_steps: 8
lora:
enabled: true
r: 16
alpha: 32
dropout: 0.05
target_modules: [q_proj, k_proj, v_proj, o_proj]4. Train
bash
soup train --config llama31.yamlSoup auto-detects GPU, enables FlashAttention, and trains LoRA adapters. Expect ~20 minutes for 1k examples on an RTX 4090.
5. Chat with your fine-tuned model
bash
soup chat --adapter ./runs/llama31/latest6. Export for deployment
bash
# Merge LoRA into base model and export GGUF for Ollama
soup export --adapter ./runs/llama31/latest --format gguf --quant q4_k_mTroubleshooting
Out of memory? Enable QLoRA (4-bit base model):
yaml
training:
quant: 4bit
lora:
enabled: trueSlow training? Ensure backend: unsloth is set and you installed soup-cli[fast].
Next steps
- [Export to GGUF and Ollama](/docs/export-to-gguf-ollama)
- [Multi-GPU training with DeepSpeed](/docs/multi-gpu-deepspeed)
- [Training methods reference](/docs/training)