Inference Server

Deploy fine-tuned models as an OpenAI-compatible API server.

Transformers Backend

bash
pip install 'soup-cli[serve]'
soup serve --model ./output --port 8000

Simple HTTP API using HuggingFace Transformers. Good for testing and low-traffic use.

vLLM Backend (2-4x Faster)

bash
pip install 'soup-cli[serve-fast]'
soup serve --model ./output --backend vllm

# Multi-GPU with tensor parallelism
soup serve --model ./output --backend vllm --tensor-parallel 2

# Control GPU memory usage
soup serve --model ./output --backend vllm --gpu-memory 0.8

Recommended for production. Uses PagedAttention for high throughput.

SGLang Backend

bash
pip install 'soup-cli[sglang]'
soup serve --model ./output --backend sglang

# Multi-GPU
soup serve --model ./output --backend sglang --tensor-parallel 2

Alternative high-throughput backend with RadixAttention.

Speculative Decoding (2-3x Faster Generation)

bash
# Transformers backend
soup serve --model ./output --speculative-decoding small-draft-model --spec-tokens 5

# vLLM backend
soup serve --model ./output --backend vllm --speculative-decoding small-draft-model

Multi-Adapter Serving (v0.22.0+)

Serve multiple LoRA adapters on a single base model:

bash
soup serve --model ./base --adapters chat=./adapters/chat code=./adapters/code

Switch adapters per request via the model field:

json
{"model": "chat", "messages": [{"role": "user", "content": "Hello!"}]}

API Endpoints

All backends expose the same OpenAI-compatible API:

  • POST /v1/chat/completions — chat completions (streaming supported)
  • GET /v1/models — list available models
  • GET /health — health check
bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "output",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Compatible with OpenAI SDK:

python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="output",
    messages=[{"role": "user", "content": "Hello!"}],
)

> Note: max_tokens is capped at 16,384 per request. Error details are never exposed in HTTP responses.