Trace Ecosystem (v0.63.0)
Pull traces from any SaaS observability dashboard, mine prompts, sample the most uncertain rows, run sequential A/B with martingale-controlled Type-I error, and watch production for distributional drift — all from one CLI, all offline, no per-trace fees.
Five new top-level commands, every one of them LIVE in v0.63.0.
`soup ingest` — universal trace importer
soup ingest --source langfuse --logs ~/Downloads/langfuse_export.jsonl \
--output ./traces.jsonlSix sources at launch:
| Source | Env var | Notes |
|---|---|---|
langfuse | LANGFUSE_KEY | LangFuse dashboard export |
langsmith | LANGSMITH_API_KEY | LangSmith API traces |
helicone | HELICONE_API_KEY | Helicone observability |
openpipe | OPENPIPE_API_KEY | OpenPipe production traces |
otel | OTEL_EXPORTER_OTLP_HEADERS | OpenTelemetry OTLP |
openai-stored | OPENAI_API_KEY | OpenAI Stored Completions |
No network calls — operators export from the SaaS dashboard, then soup ingest normalises the file. PII warning prints once per invocation.
Output schema (frozen TraceRecord with MappingProxyType metadata):
{
"trace_id": "trace_abc123",
"prompt": "What is the capital of France?",
"output": "The capital of France is Paris.",
"source": "langfuse",
"signal": "none",
"metadata": {"user_id": "user_789", "timestamp": "2026-05-20T..."}
}Feeds directly into v0.26 soup data from-traces for preference-pair building.
`soup prune-prompt` — system-prompt mining
soup prune-prompt --input ./traces.jsonl --output ./traces_pruned.jsonl --min-frequency 0.95Detects the longest shared system-prompt prefix across rows via binary search over up to 32 candidate lengths (v0.63 fixed an O(N²) early-exit bug from the prototype). Strips it from training data so the fine-tuned model internalizes the boilerplate instead of repeating it — OpenPipe's signature trick.
Two-pass file read; capped at 100k rows to prevent DoS.
`soup data active-sample` — uncertainty sampling
soup data active-sample --input ./prod_traces.jsonl --budget 200Two modes auto-detected from the JSONL schema:
- Single score — max-entropy on
rm_score(peaks at 0.5). - Dual scores — pairwise disagreement on
rm_scoreslist.
Output is a drop-in eval prompt set for human judging.
`soup ab` — Wald-SPRT sequential A/B
soup ab --input ./ab_results.jsonl --metric judge_score \
--alpha 0.05 --beta 0.20 --effect-size 0.1Wald's SPRT (Sequential Probability Ratio Test) is a martingale under the null hypothesis — Type-I error is controlled at every stopping time, not just at a fixed sample size. Early-stops when the log-likelihood ratio crosses A = log((1-β)/α) (reject H0) or B = log(β/(1-α)) (accept H0).
Input rows:
{"arm": "control", "latency": 150.2}
{"arm": "treatment", "latency": 145.1}Decisions: reject_h0 / accept_h0 / continue. v0.63 ships a CRITICAL fix for a sign error in the historical mSPRT implementation.
Three metrics at launch: latency, judge_score, retry_rate.
`soup drift-alarm` — KL drift watch with webhooks
soup drift-alarm \
--reference ./ft_output_dist.jsonl --live ./prod_output_dist.jsonl \
--threshold 0.2 \
--slack-url "https://hooks.slack.com/services/T.../B.../..."Rolls KL divergence between token-distribution snapshots at fine-tune time vs. production. Surfaces both behavioral drift ("model now outputs JSON") and vocabulary drift ("same 20 phrases repeated"). Whitespace tokenization is the default; pluggable tokenizers ship in v0.63.1.
Input rows (one per token):
{"token": "json", "log_prob": -3.2}Optional Slack / Discord webhooks fire on drift, SSRF-validated (loopback-only, RFC1918 / link-local / cloud-metadata IPs rejected). Exit code 3 on drift for cron-friendly automation.
Numbers
+219 new tests in v0.63.0 (9816 → 10035). Security: 1 CRITICAL (mSPRT sign error → Wald SPRT), 2 HIGH, 3 MEDIUM, 2 LOW.
See also
- [Soup loop](/docs/soup-loop) — the
HarvestFnhalf of the v0.58 production flywheel is now powered bysoup ingest. - [Trace-to-preference](/docs/trace-to-preference) — convert normalised traces into DPO / KTO pairs.
- [Eval design](/docs/eval-design) — derive a goal-conditioned eval suite from the active-sampled prompts.