Data Forge & Quality Moat (v0.47.0)
soup data forge — synthetic data pipeline
Chunk → judge → active-prune → JSONL with provenance.
bash
soup data forge --input ./docs --output ./synth.jsonlEach row carries provenance (source doc, chunk offset, judge score).
soup data score — composite quality scorecard
PII + toxicity + langdetect + educational + decontamination.
bash
soup data score ./train.jsonl --output ./scored.jsonlEach filter is also addressable individually:
soup data decontaminate— drop rows overlapping public benchmarks (n-gram heuristic)soup data toxicity— keyword baseline today; Llama-Guard backend v0.47.1soup data langdetect— 2-letter language code per rowsoup data pii— flag email / phone / SSN / credit-card patternssoup data educational— educational-value score per row [0, 1]