Preference Variety (v0.40.0)
Five preference losses live behind one config knob. Pick a loss without renaming your task, anneal β over training, periodically refresh the frozen reference, and blend losses with a forward-looking multi-objective surface.
BCO (Binary Classifier Optimization)
Same input format as DPO; rows are split internally to TRL's BCO unpaired schema ({prompt, completion, label}).
task: bco
data:
train: ./data/preferences.jsonl
format: dpo
training:
bco_beta: 0.1
lora: { r: 64, alpha: 16 }
quantization: 4bitbco_beta defaults to 0.1 (gt=0). New soup init --template bco ships in the box.
Unified preference dispatcher
Use task: preference + training.preference_loss to swap losses without touching task. Hyperparameter sweeps over the loss type itself become trivial.
task: preference
data:
train: ./data/preferences.jsonl
format: dpo
training:
preference_loss: dpo # or simpo, orpo, ipo, bcoLegacy task: dpo / task: simpo / etc. remain first-class — the unified surface is additive.
KL-controlled DPO variants
Anneal β over training and periodically refresh the reference model.
task: dpo # or task: preference + preference_loss: dpo, or task: ipo
training:
dpo_beta: 0.1
dpo_beta_schedule: linear # linear | cosine | exponential
dpo_beta_end: 0.01
dpo_ref_regen_epochs: 2 # copy student → ref model every 2 epochsBoth controls are gated to DPO-family tasks (dpo, ipo, or preference with preference_loss in {dpo, ipo}). Transformers backend only.
The BetaScheduleCallback resolves total_steps lazily in on_train_begin so the schedule sees the real state.max_steps populated by HF Trainer. Epoch-0 regen is suppressed (avoids copying the untrained student).
Multi-objective preference loss (schema-only in v0.40.0)
task: preference
training:
preference_loss_weights: {dpo: 0.7, bco: 0.3}The schema validates 2–5 entries summing to 1.0 (±1e-6). Single-entry rejected with an actionable message pointing at scalar preference_loss. Mutually exclusive with scalar preference_loss. Rejected on MLX backend.
Live runtime weighted-loss combination is wired in v0.40.1; v0.40.0 fails fast with an actionable NotImplementedError if you actually try to train (same stub-then-live pattern as v0.27.0 MII / v0.37.0 multipack / v0.38.0 quant menu / v0.39.0 ReLoRA).
Stats
- Net +118 tests (4538 → 4656 across 136 files)
- BCO trainer + dispatcher + β schedule math + ref-model regen TOCTOU + multi-objective schema bounds
See also
- [DPO training guide](/docs/dpo-training-guide) — preference dataset format
- [Trace-to-preference](/docs/trace-to-preference) — harvest pairs from production logs
- [Registry](/docs/registry) — track preference variants in the lineage DAG