Data Engineering Pro (v0.69.0)

Dataset prep stops being "throw a JSONL at the trainer" and becomes a first-class engineering discipline. dbt-shaped DAG, Great-Expectations suite, Magpie synth, Persona-Hub diversity, and the arXiv 2510.13928 brain-rot detector.

`soup build` — dbt-shaped DAG of dataset transforms

yaml
# manifest.yaml
models:
  - name: chats_raw
    kind: table
    sql: SELECT prompt, response FROM source.thumbs WHERE thumb = 'up'
  - name: chats_decontaminated
    kind: incremental
    refs: [chats_raw]
    sql: SELECT * FROM \`{chats_raw}\` WHERE NOT contaminated
  - name: chats_train
    kind: view
    refs: [chats_decontaminated]
    sql: SELECT * FROM \`{chats_decontaminated}\`
bash
soup build manifest.yaml --dry-run    # validates topology, exits 0 (LIVE today)
soup build manifest.yaml              # materialises (v0.69.1)
  • Closed SUPPORTED_MODEL_KINDS = {incremental, table, view}
  • Topo-sort via Kahn's algorithm
  • Re-tokenise only changed rows: compute_row_hash (SHA-256 over canonical-JSON, id field excluded) + incremental_diff(prev, new) → {added, changed, removed, unchanged}
  • DoS caps: _MAX_MODELS=256, _MAX_REFS_PER_MODEL=32, _MAX_FILE_BYTES=1 MiB
  • Live run_build materialiser (DuckDB / SQLite backend, live transform-resolver registry) ships in v0.69.1

`soup expect` — Great Expectations for chat data (LIVE)

yaml
# suite.yaml
expectations:
  - expect_no_pii
  - expect_token_length_between: {min: 32, max: 2048}
  - expect_no_refusal_pattern
  - expect_chosen_preferred_over_rejected_by_judge:
      judge: openai/gpt-4o-mini
      min_win_rate: 0.55
bash
soup expect ./data.jsonl ./suite.yaml
# exit 0 = pass; 2 = validation rejection; 3 = suite failure

Closed allowlist: expect_no_pii (reuses v0.47 Presidio), expect_token_length_between, expect_no_refusal_pattern (reuses v0.56 refusal detector), expect_chosen_preferred_over_rejected_by_judge (reuses v0.19 judge surface). Walks text, content, output, prompt, instruction, response top-level keys + messages[].content arrays. _MAX_SUITE_LEN=64. Drop into CI between soup data and soup train.

`soup data gen-magpie` — synthetic data via chat-template-prefix harvest

bash
soup data gen-magpie --base meta-llama/Llama-3-8B \
  --provider ollama --target 50000 \
  --output ./synth.jsonl

Magpie trick: prime the base model with just the assistant chat-template prefix and let it complete the prompt itself. Reuses the v0.20 provider stack (Ollama / Anthropic / vLLM). _MAX_TARGET_ROWS=1_000_000, _MAX_BASE_MODEL_LEN=512. Live run_magpie loop + v0.47 quality-filter chain integration ship in v0.69.1.

`soup data persona-mix` — Persona-Hub diversity sampler (LIVE)

bash
soup data persona-mix --prompts ./prompts.jsonl \
  --n 20000 --output ./diverse.jsonl
# Optional BYO: --personas tencent-200k.jsonl --styles styles.jsonl

Bundled 12 personas × 5 writing styles (BYO Tencent 200k corpus via --personas / --styles). Deterministic by seed (random.Random(seed)). compute_topic_diversity = Shannon entropy over pooled whitespace tokens. Atomic JSONL write + cwd-contained input + enforce_under_cwd_and_no_symlink on output. Caps: 100 MiB / 100k entries per loader.

`soup data brain-rot` — AI-slop detector (LIVE)

bash
soup data brain-rot ./data.jsonl --strict --max-major-fraction 0.25
# exit 3 if MAJOR fraction > 0.25

arXiv 2510.13928 brain-rot detector. Two pure-Python scorers:

  • score_triviality — token-diversity inversion + !! / ?? punctuation runs + low-effort token density + length penalty
  • score_popularity_signal — clickbait phrase scan + emoji U+1F300–U+1FAFF density

Worst-signal-wins: 1.0 − max(triviality, popularity). Bands match v0.26/v0.56/v0.65: OK ≥ 0.85, MINOR ≥ 0.60, else MAJOR. refuse_if_rotten raises when MAJOR fraction exceeds threshold. English-keyword-only in v0.69.0; multilingual lands in v0.69.1.

Cross-cutting hardening

Refactored 3 duplicate TOCTOU blocks behind a new shared paths.enforce_under_cwd_and_no_symlink helper (code-review CRITICAL fix). v0.70 + future releases reuse it.

Numbers

+262 tests in v0.69.0 (11,225 → 11,487) across 5 new test files. 3 POSIX-only symlink tests skip on Windows.

See also

  • [Anti-trend insurance (v0.68)](/docs/anti-trend-insurance) — the TOCTOU helper this consolidates was first lifted here.
  • [Loop hardening (v0.70)](/docs/loop-hardening) — soup expect gates data going into --reward-hack-detector runs.
  • [Eval depth (v0.65)](/docs/eval-depth) — same OK / MINOR / MAJOR taxonomy.