Post-train X-rays (v0.66.0)

Mechanistic-interpretability on every fine-tune. Four probe families that surface what *changed inside* the model, not just what changed in outputs. Pure descriptive — no auto-mitigation, designed for CI logging and model cards.

`soup probe sae-diff` — Sparse Autoencoder feature movement

bash
soup probe sae-diff ./gemma_scope.safetensors ./pre_acts.json ./post_acts.json \
  --top-k 20 --output ./sae_diff.json

Encode pre-FT and post-FT activation batches through a Sparse Autoencoder; report the top-K features whose mean activation moved the most. Bundled SAE-repo allowlist (no auto-download):

  • Gemma Scope (2B / 9B / 27B residual-stream)
  • EleutherAI Pythia SAEs
  • JBloomAus Llama SAEs
  • OpenAI GPT-2 SAE

Bounds: top-k [1, 10K], up to 1M features, up to 1M tokens per batch, 16 MiB evidence cap.

`soup probe sleeper` — defection-agent classifier

bash
soup probe sleeper llama-3-8b --evidence ./activations.json --output ./sleeper.json

Calibrated linear defection probe (per-base, deterministic) applied to a 2D activation tensor. Reports flagged-token rate and verdict:

Flagged rateVerdict
≤1%OK
≤5%MINOR
>5%MAJOR

Bundled-base allowlist (Llama-3-8B, Gemma-2-9B, …); no evidence → OK report with 0 tokens (matches v0.56 neutral-mode policy). 16 MiB cap; symlink rejection.

`soup probe interference` — N×N adapter compatibility matrix

bash
soup probe interference ./losses.json --output ./matrix.json

Input is operator-measured per-pair losses; output is an N×N catastrophic-interference matrix:

score(A → B) = (loss(A_target | A+B) - loss(A_target | A alone)) / loss(A alone)
scoreVerdict
<5%OK
<20%MINOR
≥20%MAJOR (exit 2, gates CI)

Bounds: 2..16 adapters (4..256 pairs). Adapter names ≤256 chars; markup-escaped before render against injection.

`soup probe pack` — bundled probes per base

bash
soup probe pack llama-3-8b --output ./pack.json
soup probe pack --list

Per-base manifest of calibrated probes (sleeper / sae / truth / harm). Metadata only — no weights embedded (v0.66 ships schema; weights fetcher in v0.66.x). 1..32 probes per pack; per-field caps against operator-controlled-input bloat.

Live influence-function blame

Bundled in v0.66 alongside the probe family: a DataInf-style row attribution runner that walks training data, computes per-example influence on a target output, and ranks the most causal rows. Composes with v0.67 soup adapters bisect — bisect tells you *which checkpoint* broke, blame tells you *which rows* caused it.

See also

  • [Diagnose](/docs/diagnose) — v0.56 6-probe report card; v0.66 adds 4 more probes on top.
  • [Adapter lifecycle](/docs/adapter-lifecycle) — v0.67 bisect uses v0.66 blame for row-level attribution.