Synthetic Dataset Generation Pipeline
Point at a dataset, pick a model, hit run. Mimesis turns raw seeds into training data — batched, parallel, and ready to push to HuggingFace.
Mimesis is a local-first, full-stack tool for generating synthetic LLM training data at scale. You bring seed rows (a CSV or a HuggingFace dataset), write a system instruction, pick a provider and model, tune your batching, and let the pipeline do the rest — with real-time WebSocket progress, graceful error handling, and optional push to the HF Hub when done.
- Any seed source — CSV upload or HuggingFace dataset (public or private via token)
- Multi-column mapping — seed column, optional context column, optional per-row instruction column
- 7 LLM providers out of the box — Anthropic, OpenAI, Gemini, DeepSeek, Together AI, OpenRouter, vLLM
- 55+ hardcoded models — Claude 4.x/3.7/3.5, GPT-4o/4.5/o-series, Gemini 2.5/2.0/1.5, Llama 3.x, Qwen, Mistral, and more
- Live model fetch — pull the live model list directly from any provider's API with one click
- Test connection — validate credentials with a single dummy request before running
- Custom model ID — override the catalogue with any model ID (fine-tunes, new releases)
- vLLM support — local deployments via OpenAI-compatible API; auto-discover loaded models from
/v1/models - GPU RPM estimator — estimate sustainable RPM from hardware specs (VRAM, GPU count, quantization)
- Parallel workers — configurable concurrency with a shared token-bucket rate limiter
- Preview mode — generate N samples for human review before committing to a full run
- Real-time dashboard — WebSocket-driven progress bar, throughput chart, and live sample feed
- Batch CSV saving — flush to disk every N samples; safe to restart on failure
- HuggingFace push — upload the merged dataset to the HF Hub when complete
- Dark / light mode — full theme support
Configure your input data source — upload a CSV or load any HuggingFace dataset. Map columns to seed, context, and instruction fields. Write the system instruction that will be applied to every row.
Select a provider and model, enter your API key, test the connection, and tune generation parameters. Switch to vLLM for local deployments — the GPU estimator calculates your max sustainable RPM from hardware specs. Live-fetch the full model list from any provider's API with a single click, or type in a custom model ID for fine-tunes and new releases.
Set batch size, worker concurrency, and preview count. Configure the output directory and optional HuggingFace Hub push. The Summary card gives a live snapshot of your full configuration before you proceed.
Generate a small sample batch before committing to the full run. Inspect each row — seed input, context, instruction, and generated output — side by side. Approve to proceed or go back and adjust your settings.
Launch the full pipeline and watch it work. Real-time throughput chart, progress bar, elapsed and remaining time, and a live scrolling sample feed — all streamed over WebSocket.
Browse and download all generated files. Complete merged datasets and per-batch CSVs are listed with file sizes, timestamps, and one-click download buttons.
- Python 3.10+
- Node.js 18+
git clone https://github.com/youruser/mimesis.git
cd mimesis
# Backend
cd backend && pip install -r requirements.txt && cd ..
# Frontend
cd frontend && npm install && cd .../start.sh| Service | URL |
|---|---|
| Frontend | http://localhost:5173 |
| Backend API | http://localhost:8000 |
| Interactive API docs | http://localhost:8000/docs |
Or start them individually:
# Terminal 1 — backend
cd backend && uvicorn main:app --host 0.0.0.0 --port 8000 --reload
# Terminal 2 — frontend
cd frontend && npm run devMimesis ships with a comprehensive hardcoded catalogue (55+ models) and supports live-fetching the full model list directly from each provider's API.
| Provider | Models included | Live fetch |
|---|---|---|
| Anthropic | Claude Opus/Sonnet/Haiku 4.x · Claude 3.7 Sonnet · 3.5 Sonnet/Haiku · 3.0 Opus/Sonnet/Haiku | — |
| OpenAI | GPT-4o · GPT-4.5 · o1/o1-mini/o1-preview · o3/o3-mini · o4-mini · GPT-4 Turbo | ✓ |
| Google Gemini | Gemini 2.5 Pro/Flash · 2.0 Flash/Flash-Lite/Thinking/Pro · 1.5 Pro/Flash/Flash-8B | ✓ |
| DeepSeek | DeepSeek V3 (Chat) · R1 (Reasoner) · Coder V2 | ✓ |
| Together AI | Llama 3.1/3.2/3.3 · DeepSeek V3/R1 · Qwen 2.5 · Mixtral 8×7B/8×22B · Gemma 2 · Sky-T1 | ✓ |
| OpenRouter | 24 curated routes + full live catalogue (200+ models) | ✓ |
| vLLM | Any model — auto-discovered via /v1/models on your server |
✓ |
| Provider | Default (Tier 1) | Higher tiers |
|---|---|---|
| Anthropic | 60 RPM | up to 8,000 (Tier 4) |
| OpenAI | 500 RPM | up to 10,000 (Tier 3) |
| Gemini | 15 RPM (free) | 2,000 (Flash paid) · 360 (Pro paid) |
| DeepSeek | 60 RPM | — |
| Together | 60 RPM (free) | 600 (paid) |
| OpenRouter | 200 RPM | — |
| vLLM | auto-calculated | from GPU hardware specs |
mimesis/
├── backend/
│ ├── main.py # FastAPI entry point
│ ├── config.py # Model catalogue + rate limits
│ ├── requirements.txt
│ ├── core/
│ │ ├── pipeline.py # Async generation engine
│ │ ├── seed_loader.py # CSV / HuggingFace loader
│ │ ├── rate_limiter.py # Token-bucket rate limiter
│ │ ├── output_manager.py # CSV saving + HF Hub push
│ │ └── providers/
│ │ ├── base.py # BaseProvider abstract class
│ │ ├── anthropic_provider.py
│ │ ├── openai_provider.py
│ │ ├── gemini_provider.py
│ │ ├── deepseek_provider.py
│ │ ├── together_provider.py
│ │ ├── openrouter_provider.py
│ │ └── vllm_provider.py
│ └── api/
│ ├── websocket.py # WebSocket connection manager
│ └── routes/
│ ├── seeds.py # /api/seeds/*
│ ├── providers.py # /api/providers/*
│ └── pipeline.py # /api/pipeline/*
│
├── frontend/
│ └── src/
│ ├── App.tsx # Router + layout shell
│ ├── types/index.ts # Shared TypeScript types
│ ├── config/theme.ts # Design tokens (colours, radii)
│ ├── store/pipelineStore.ts # Zustand global state
│ ├── hooks/
│ │ ├── useApi.ts # Typed REST API helpers
│ │ └── useWebSocket.ts # Real-time event handler
│ ├── components/ui/ # Design-system primitives
│ └── pages/
│ ├── LandingPage.tsx
│ ├── SeedsPage.tsx
│ ├── ProviderPage.tsx
│ ├── PipelinePage.tsx
│ ├── PreviewPage.tsx
│ ├── RunPage.tsx
│ └── OutputPage.tsx
│
├── outputs/ # Generated CSVs saved here
├── docs/
│ └── screenshots/ # UI screenshots (used in this README)
└── start.sh
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/providers/catalogue |
Model catalogue + rate limit info |
POST |
/api/providers/fetch-models |
Live-fetch models from a provider's API |
POST |
/api/providers/test-connection |
Validate credentials with a dummy request |
POST |
/api/providers/vllm/calculate |
Estimate RPM from GPU hardware specs |
POST |
/api/seeds/columns |
Inspect columns from a CSV or HF dataset |
POST |
/api/seeds/upload |
Upload a CSV file |
POST |
/api/pipeline/preview |
Run a small preview batch |
POST |
/api/pipeline/start |
Start a full pipeline run |
POST |
/api/pipeline/stop |
Stop a running pipeline |
GET |
/api/pipeline/status |
Poll pipeline status |
GET |
/api/pipeline/outputs |
List generated output files |
WS |
/ws/{pipeline_id} |
Real-time progress stream |
Full interactive docs at http://localhost:8000/docs when the backend is running.
- Create
backend/core/providers/myprovider_provider.py - Extend
BaseProvider, implementasync generate(prompt, system) -> str - Register in
backend/core/providers/__init__.py→REGISTRY - Add model list + rate limits to
backend/config.py→PROVIDER_MODELS - Add the provider name to
EXTERNAL_PROVIDERSinfrontend/src/pages/ProviderPage.tsx
| Layer | Technology |
|---|---|
| Backend | Python 3.10+, FastAPI, uvicorn, asyncio, httpx |
| Frontend | React 18, TypeScript, Vite, Tailwind CSS |
| State management | Zustand |
| Charts | Recharts |
| Icons | Lucide React |
| LLM SDKs | anthropic, openai, google-generativeai |
MIT






