RLVR — Reinforcement Learning from Verifiable Rewards
v0.25.0 adds deterministic reward signals for GRPO training on math, code, and JSON-schema tasks. No reward model, no human labels.
Enable it
yaml
base: Qwen/Qwen3-8B
task: grpo
data:
train: math_problems.jsonl
format: chatml
training:
reward_fn: verifiable
verifiable_domain: math # math | code | json_schema
lr: 1e-5
epochs: 1
lora: { r: 16, alpha: 32 }Built-in reward functions
| Domain | What it does | Sandbox |
|---|---|---|
math | Regex-extracts the final answer, compares numerically with tolerance | pure Python |
code | Executes Python against expected outputs | subprocess, timeout 5s, no network, restricted builtins, 10 KB output cap |
json_schema | Validates output against a JSON Schema and scores completeness | pure Python |
All three live in soup_cli/trainer/rewards.py and are routed through the existing GRPO trainer.
Generate training data
bash
soup data generate --template verifiable --domain math --count 500The verifiable template in soup_cli/data/templates/verifiable.py emits problems with ground-truth answers you can verify at training time.
Safety
code_execruns each completion in a short-lived subprocess with no network access and a restricted builtin set.math_verifynever useseval()on model output — answers are extracted by regex.verifiable_domainis a PydanticLiteralso arbitrary strings can't reach the dispatcher.
See also
- [Training methods](/docs/training) — GRPO
- [Autopilot](/docs/autopilot) —
--goal reasoningpicks RLVR when your data has ground truth