Data Formats

Soup supports these formats. Files can be JSONL, JSON, CSV, Parquet, or TXT. Format is auto-detected from the first row.

Alpaca

json
{"instruction": "Explain gravity", "input": "", "output": "Gravity is..."}

ShareGPT

json
{"conversations": [{"from": "human", "value": "Hi"}, {"from": "gpt", "value": "Hello!"}]}

ChatML

json
{"messages": [{"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hello!"}]}

DPO / ORPO / SimPO / IPO (preference pairs)

json
{"prompt": "Explain gravity", "chosen": "Gravity is a force...", "rejected": "I don't know"}

KTO (unpaired preferences)

json
{"prompt": "Explain gravity", "completion": "Gravity is a force...", "label": true}

LLaVA (vision)

json
{"image": "photo.jpg", "conversations": [{"from": "human", "value": "<image>\nDescribe this."}, {"from": "gpt", "value": "A cat."}]}

ShareGPT4V (vision)

json
{"image": "chart.png", "conversations": [{"from": "human", "value": "<image>\nExplain this chart."}, {"from": "gpt", "value": "Revenue growth."}]}

Plaintext (pre-training)

json
{"text": "Raw text document for continued pre-training..."}

Or use .txt files directly (one document per line).

Embedding

json
{"anchor": "What is Python?", "positive": "Python is a programming language."}
{"anchor": "What is Python?", "positive": "A programming language.", "negative": "A type of snake."}

Audio

json
{"audio": "recording.wav", "messages": [{"role": "user", "content": "Transcribe."}, {"role": "assistant", "content": "Hello world."}]}

All data normalizes internally to {"messages": [...]} structure (+ "image" for vision, "audio" for audio).