Experiment Tracking
Every soup train run is automatically tracked in a local SQLite database (~/.soup/experiments.db).
List Runs
bash
soup runsShows all training runs with task, model, status, and final loss.
Run Details
bash
soup runs show run_20260223_143052_a1b2Detailed info including config, metrics, and an ASCII loss curve.
Compare Runs
bash
soup runs compare run_1 run_2Side-by-side comparison of two runs with loss curves and metrics.
Delete Runs
bash
soup runs delete run_1Model Evaluation
Soup includes a comprehensive evaluation platform (v0.19.0+):
bash
pip install 'soup-cli[eval]'
# Run benchmarks (mmlu, gsm8k, hellaswag, etc.)
soup eval benchmark --model ./output --benchmarks mmlu,gsm8k
# Custom eval tasks from JSONL
soup eval custom --model ./output --tasks ./eval_tasks.jsonl
# LLM-as-a-judge evaluation
soup eval judge --model ./output --prompts ./prompts.jsonl --judge gpt-4o
# Auto-eval from soup.yaml config
soup eval auto --config soup.yaml
# Compare eval results between runs
soup eval compare run_1 run_2
# Local leaderboard across models
soup eval leaderboard
# Human A/B evaluation with Elo ratings
soup eval human --model-a ./model_v1 --model-b ./model_v2 --prompts ./prompts.jsonlHyperparameter Sweep
Search for the best hyperparameters:
bash
# Grid search
soup sweep --config soup.yaml --param lr=1e-5,2e-5,5e-5 --param lora_r=8,16,32
# Random search with max runs
soup sweep --config soup.yaml --param lr=1e-5,2e-5,5e-5 --strategy random --max-runs 5
# Preview without running
soup sweep --config soup.yaml --param lr=1e-5,2e-5 --dry-run
# Early stopping: skip remaining runs if loss exceeds 1.5x best
soup sweep --config soup.yaml --param lr=1e-5,2e-5,5e-5 --early-stop 1.5Model Comparison
Compare outputs of two models side-by-side:
bash
soup diff --model-a ./model_v1 --model-b ./model_v2 --prompt "Explain gravity"
soup diff --model-a ./base --model-b ./finetuned --prompts test_prompts.jsonl
soup diff --model-a ./a --model-b ./b --prompts prompts.txt --output results.jsonlBatch Inference
Run a model on a list of prompts:
bash
soup infer --model ./output --input prompts.jsonl --output results.jsonl
soup infer --model ./output --input prompts.txt --output results.jsonl \
--max-tokens 512 --temperature 0.3Output is JSONL with prompt, response, and tokens_generated fields.
Training Profiler (v0.23.0+)
Estimate memory, speed, and GPU requirements before training:
bash
soup profile --model meta-llama/Llama-3.1-8B --task sft --quantization 4bit
soup profile --config soup.yamlShows estimated GPU memory, training speed, and hardware recommendations.
Adapter Management (v0.22.0+)
bash
# Scan directory for LoRA adapters
soup adapters list --path ./experiments
# Show adapter metadata (base model, rank, size)
soup adapters info ./output
# Compare two adapters side-by-side
soup adapters compare ./adapter_v1 ./adapter_v2Logging Integrations
TensorBoard
bash
soup train --config soup.yaml --tensorboard
tensorboard --logdir ./output/runs/Weights & Biases
bash
soup train --config soup.yaml --wandb> --tensorboard and --wandb cannot be used together.