Smart Inference Server (v0.30.0)

soup serve graduated from a basic OpenAI-compatible wrapper into a production-grade serving stack. Speculative decoding, prefix caching, structured output, dynamic LoRA hot-swap, a continuous-batching dashboard, and OpenTelemetry tracing all ship in v0.30.0.

Speculative decoding

Use a smaller draft model to speed up generation 2-3×.

bash
# Transformers backend — uses HF assisted generation
soup serve --model ./output --speculative-decoding small-draft-model --spec-tokens 5

# vLLM backend — uses vLLM native speculative decoding
soup serve --model ./output --backend vllm --speculative-decoding small-draft-model

# Auto-pair: Soup picks the draft for you based on the target family
soup serve --model meta-llama/Llama-3.1-70B-Instruct --backend vllm --auto-spec

--auto-spec handles Llama 3.1 / 3.3 / 4, Qwen 2.5 / 3, Mistral Large, Mixtral, DeepSeek V3 / R1, and Gemma 2 / 3. Models without a known draft pairing print a yellow "no draft" note and fall back to standard decoding.

Prefix caching (vLLM)

For RAG and agent workloads with a shared system prompt:

bash
soup serve --model ./output --backend vllm --prefix-cache

The first request with a given prefix warms the cache; subsequent requests skip the shared prefix compute entirely.

Structured output

Constrain output to a JSON schema or regex pattern.

bash
# JSON schema (file must live under cwd)
soup serve --model ./output --structured-output json --json-schema product.json

# Regex (length-capped at 2048 chars, null bytes rejected)
soup serve --model ./output --structured-output regex --regex-pattern '\\d{3}-\\d{4}'

Schemas serialised over 64 KB are rejected. JSON schemas must declare a top-level type field. Constraints are validated at startup, not per-request.

Dynamic LoRA hot-swap

Switch the active adapter at runtime without restarting the server.

bash
soup serve --model base-model --adapters chat=./chat-adapter code=./code-adapter
bash
curl -X POST http://localhost:8000/v1/adapters/activate/chat
# → {"active": "chat", "status": "ok"}

curl -X POST http://localhost:8000/v1/adapters/deactivate
# → {"active": null, "status": "ok"}

curl http://localhost:8000/v1/adapters
# → {"adapters": [{"name": "chat", "active": true}, ...], "active": "chat"}

Names match ^[a-zA-Z0-9][a-zA-Z0-9-]*$. Activate/deactivate is thread-safe behind a lock.

Continuous-batching dashboard + `/metrics`

bash
soup serve --model ./output --dashboard
bash
curl http://localhost:8000/metrics
# {
#   "requests_total": 1234,
#   "tokens_generated_total": 456789,
#   "active_requests": 3,
#   "latency_p50_ms": 185.2,
#   "latency_p95_ms": 720.0,
#   "latency_samples": 1000
# }

Latency percentiles are computed from the last 1000 requests; counters include failure paths so the dashboard reflects true reliability.

OpenTelemetry tracing

Emit per-request spans to your OTLP collector.

bash
pip install opentelemetry-sdk opentelemetry-exporter-otlp

soup serve --model ./output \
  --trace --trace-endpoint http://localhost:4317

OTLP endpoint hardening mirrors HF_ENDPOINT: scheme allowlist, plain HTTP only for loopback, RFC1918 / link-local / 0.0.0.0 rejected via ipaddress.ip_address. Missing SDK is a no-op with a warning.

DeepSpeed-MII backend

bash
soup serve --model ./output --backend mii

Loopback-only CORS, max_tokens capped at 16384, streaming disabled (no SSE for MII v0.x). Pipeline crashes return generic 500 with no stack-trace leak.

Auto-quant picker

bash
soup serve --model ./output --auto-quant

The picker API is registered; live evaluation soft-falls-back to the highest-scored candidate so the server still binds when no candidate clears min_score.