Data Tools

Soup includes powerful CLI tools for preparing training datasets.

Inspect

bash
soup data inspect ./data/train.jsonl

Shows dataset statistics: sample count, token distribution, field analysis. For vision datasets, automatically shows image statistics (count, formats, missing files).

Validate

bash
soup data validate ./data/train.jsonl
soup data validate ./data/train.jsonl --format alpaca

Checks for missing fields, encoding issues, and format compliance. Auto-detects format when --format is not specified.

Convert

bash
soup data convert ./data/train.jsonl --to sharegpt --output converted.jsonl

Transform between alpaca, sharegpt, and chatml formats.

Merge

bash
soup data merge data1.jsonl data2.jsonl --output merged.jsonl --shuffle

Combine multiple datasets with optional shuffling.

Deduplicate

bash
# Requires: pip install 'soup-cli[data]'
soup data dedup ./data/train.jsonl --threshold 0.8

Remove near-duplicate samples using MinHash.

Extended Statistics

bash
soup data stats ./data/train.jsonl

Length distribution with histograms, token counts, and language detection.

Synthetic Data Generation

bash
# Generate using OpenAI API
soup data generate --prompt "Create math word problems" --count 100 --format alpaca

# Use a different model
soup data generate --prompt "Medical Q&A pairs" --model gpt-4o --count 500

# Deduplicate against existing data
soup data generate --prompt "..." --count 200 --dedup-with existing.jsonl

# Use seed examples to guide style
soup data generate --prompt "..." --seed examples.jsonl --count 100

# Use a local server (soup serve, Ollama, etc.)
soup data generate --prompt "..." --provider server --api-base http://localhost:11434/v1

Multi-Provider Support (v0.20.0+)

bash
# Generate via local Ollama instance
soup data generate --prompt "..." --provider ollama --model llama3.1
soup data generate --prompt "..." --ollama-model llama3.1  # shorthand

# Generate via Anthropic Claude API (set ANTHROPIC_API_KEY env var)
soup data generate --prompt "..." --provider anthropic --model claude-3-haiku-20240307

# Generate via local vLLM server
soup data generate --prompt "..." --provider vllm --model meta-llama/Llama-3.1-8B-Instruct

Domain Templates (v0.20.0+)

bash
# Code instruction pairs (Python, JS, Go, Rust, Java)
soup data generate --prompt "..." --template code --language Python --task-type function

# Multi-turn conversations
soup data generate --prompt "..." --template conversation --turns 6 --topic "science"

# QA from context document
soup data generate --prompt "..." --template qa --context document.txt

# Preference data (DPO/KTO/ORPO)
soup data generate --prompt "..." --template preference --pref-task dpo

# Chain-of-thought reasoning (GRPO)
soup data generate --prompt "..." --template reasoning --domain math

Quality Pipeline (v0.20.0+)

bash
# Auto-validate after generation (remove malformed entries)
soup data generate --prompt "..." --validate

# Auto-filter by quality (coherence scoring)
soup data generate --prompt "..." --filter

# Auto-dedup (MinHash, requires: pip install 'soup-cli[data]')
soup data generate --prompt "..." --dedup

# Full quality pipeline: validate + filter + dedup
soup data generate --prompt "..." --quality-pipeline

Quality Filter

bash
# Filter by coherence score
soup data filter ./data/train.jsonl --coherence 0.3

# Filter by perplexity + coherence
soup data filter ./data/train.jsonl --perplexity 500 --coherence 0.3

# Add scores without removing samples
soup data filter ./data/train.jsonl --score-only

Uses perplexity + coherence scoring to identify low-quality samples.

Data Sampling (v0.23.0+)

bash
# Random sample
soup data sample ./data/train.jsonl --strategy random --count 1000

# Diverse sample (TF-IDF clustering)
soup data sample ./data/train.jsonl --strategy diverse --count 500

# Hard examples (by length)
soup data sample ./data/train.jsonl --strategy hard --count 500

Data Splitting (v0.23.0+)

bash
# Split into train/val/test
soup data split ./data/train.jsonl --ratio 0.8,0.1,0.1

# Stratified split
soup data split ./data/train.jsonl --ratio 0.9,0.1 --stratify

HuggingFace Dataset Hub (v0.24.0+)

bash
# Search for datasets
soup data search "math reasoning"

# Preview remote dataset metadata
soup data preview tatsu-lab/alpaca

# Download to local JSONL
soup data download tatsu-lab/alpaca --output ./data/alpaca.jsonl --samples 1000

Dataset Registry (v0.24.0+)

Register local datasets by name for use in soup.yaml:

bash
# Register a dataset
soup data register my-chat-data --path ./data/chat.jsonl --format chatml

# List registered datasets
soup data registry

# Use in config: data.train: registry:my-chat-data
soup data unregister my-chat-data