Data Formats
Soup supports these formats. Files can be JSONL, JSON, CSV, Parquet, or TXT. Format is auto-detected from the first row.
Alpaca
json
{"instruction": "Explain gravity", "input": "", "output": "Gravity is..."}ShareGPT
json
{"conversations": [{"from": "human", "value": "Hi"}, {"from": "gpt", "value": "Hello!"}]}ChatML
json
{"messages": [{"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hello!"}]}DPO / ORPO / SimPO / IPO (preference pairs)
json
{"prompt": "Explain gravity", "chosen": "Gravity is a force...", "rejected": "I don't know"}KTO (unpaired preferences)
json
{"prompt": "Explain gravity", "completion": "Gravity is a force...", "label": true}LLaVA (vision)
json
{"image": "photo.jpg", "conversations": [{"from": "human", "value": "<image>\nDescribe this."}, {"from": "gpt", "value": "A cat."}]}ShareGPT4V (vision)
json
{"image": "chart.png", "conversations": [{"from": "human", "value": "<image>\nExplain this chart."}, {"from": "gpt", "value": "Revenue growth."}]}Plaintext (pre-training)
json
{"text": "Raw text document for continued pre-training..."}Or use .txt files directly (one document per line).
Embedding
json
{"anchor": "What is Python?", "positive": "Python is a programming language."}
{"anchor": "What is Python?", "positive": "A programming language.", "negative": "A type of snake."}Audio
json
{"audio": "recording.wav", "messages": [{"role": "user", "content": "Transcribe."}, {"role": "assistant", "content": "Hello world."}]}All data normalizes internally to {"messages": [...]} structure (+ "image" for vision, "audio" for audio).