Quant Menu II (v0.53.0)

The full advanced-quantization surface. Schema-only release; live llama.cpp imatrix + serve / merge / export writers land in v0.53.1.

Unsloth Dynamic 2.0 GGUF ladder

14-entry closed allowlist:

UD-Q8_K_XL · UD-Q6_K_XL · UD-Q5_K_XL · UD-Q4_K_XL · UD-Q3_K_XL · UD-Q2_K_XL · UD-IQ4_XS · UD-IQ3_M · UD-IQ3_XXS · UD-IQ2_M · UD-IQ2_XS · UD-IQ2_XXS · UD-IQ1_M · UD-IQ1_S

validate_ud_gguf_format is case-insensitive with canonical normalisation.

bash
soup export --format gguf --quant UD-Q4_K_XL --output ./model.gguf

IQ + Apple/ARM GGUF flavours

  • 12-entry IQ family (IQ1/2/3/4 — including IQ4_NL non-linear)
  • 10-entry Apple/ARM-friendly set (Q4_0_4_4, Q4_NL, Q5_K_M, etc.)

Both wrapped in MappingProxyType metadata.

KV cache types

yaml
training:
  kv_cache_type: fp8   # q8_0 | bf16 | f16 | fp8

FP8 is Hopper-only — cross-validator rejects fp8 on the MLX backend; SM-capability check fires at serve construction.

FP8 attention, NVFP4, native unsloth_bnb_4bit

yaml
training:
  fp8_attention: true       # requires quantization_aware='fp8', non-MLX
  nvfp4: true               # CUDA + text only; Blackwell SM ≥ 12 (runtime check)
  unsloth_bnb_4bit: true    # requires backend='unsloth' + quantization='4bit'

LF / Axolotl parity

yaml
training:
  bnb_4bit_use_double_quant: true   # requires quantization='4bit'
  llm_int8: true                    # asserts quantization='8bit'
  quantize_ref_model: true          # extends quant to ref model (DPO/IPO/SimPO/ORPO/BCO/KTO/GRPO/PPO/preference)
  quantize_reward_model: true       # PPO + reward_model tasks

Advanced save formats

bash
soup merge --save-format 4bit          # | 4bit_forced
soup export --format torchao --quant-config quant.yaml

save-format 4bit_forced writes a single BNB-4bit merged checkpoint without a dequant / merge / requant cycle.

--quant-config accepts a closed TorchAO allowlist: Int4WeightOnly / Int8DynActInt4 / Float8DynActFloat8 / NVFP4.

Stats

  • Net +157 tests (7,453 → 7,610 across 179 files)
  • 154 tests in test_v0530.py
  • 5 review agents ran in parallel; every CRITICAL / HIGH / MEDIUM / LOW finding fixed or documented

See also

  • [Quant Menu (v0.38)](/docs/quant-menu) — the original 9-format menu
  • [Speed & Memory](/docs/training-speed-memory) — FP8 training, Cut CE, kernel auto-compose
  • [Multi-GPU](/docs/multi-gpu) — ZeRO++ / FSDP2 / pipeline