Multipack — FFD Bin-Packing Sampler (v0.37.0)
Soup's largest single throughput win on chat fine-tuning over uneven-length data. Instead of padding every sample to max_length, Multipack uses First-Fit-Decreasing bin packing to group variable-length samples into bins approaching batch_size × max_seq_length — eliminating padding waste.
training:
multipack: true
packing: false # mutually exclusive with multipackHow it composes
- Multipack picks WHICH samples go together (FFD packing).
- `packing_cross_doc_attn_mask` sets HOW the attention mask is built (block-diagonal causal).
- The two layer cleanly: enable both for FA-incompatible backends; the FA varlen path is auto-selected when FlashAttention is available.
Architecture allowlist
18 supported architectures: Llama 3.x, Qwen 2/3, Mistral, Gemma 2/3, Phi 3/4, DeepSeek V2/V3, Mixtral, Falcon, StableLM, SmolLM2.
Unknown architectures fail loudly at config-load instead of silently no-opping (critical fix vs Axolotl's silent-miss footgun).
Scope
Multipack is sft / pretrain only on the transformers backend. Preference / RLHF trainers and MLX backend get distinct error messages naming the actual reason. Live wiring of the sampler into HF Trainer's _get_train_sampler lands in v0.37.1; v0.37.0 ships the schema gate + helper builder (mirrors v0.27.0 MII stub-then-live pattern).
DoS hardening
- FFD packer caps at 1M items (algorithm is O(N²) worst-case)
- 4D mask builder caps allocations at 2³¹ cells
- Chat-template Jinja analyzer caps at 128 KB
- Every numeric input rejects
boolexplicitly
Jinja template analyzer
The JinjaTemplateAnalyzer (also v0.37.0) walks chat-template ASTs to discover non-standard message.<field> references (tool_calls, name, weight, train) — used by the v0.36.0 train_on_messages_with_train_field path so per-message training masks are aware of fields beyond role / content. The analyzer parses templates without rendering them, so a crafted soup.yaml cannot trigger SSRF.