Mimesis

Synthetic Dataset Generation Pipeline

Point at a dataset, pick a model, hit run. Mimesis turns raw seeds into training data — batched, parallel, and ready to push to HuggingFace.

Overview

Mimesis is a local-first, full-stack tool for generating synthetic LLM training data at scale. You bring seed rows (a CSV or a HuggingFace dataset), write a system instruction, pick a provider and model, tune your batching, and let the pipeline do the rest — with real-time WebSocket progress, graceful error handling, and optional push to the HF Hub when done.

Features

Any seed source — CSV upload or HuggingFace dataset (public or private via token)
Multi-column mapping — seed column, optional context column, optional per-row instruction column
7 LLM providers out of the box — Anthropic, OpenAI, Gemini, DeepSeek, Together AI, OpenRouter, vLLM
55+ hardcoded models — Claude 4.x/3.7/3.5, GPT-4o/4.5/o-series, Gemini 2.5/2.0/1.5, Llama 3.x, Qwen, Mistral, and more
Live model fetch — pull the live model list directly from any provider's API with one click
Test connection — validate credentials with a single dummy request before running
Custom model ID — override the catalogue with any model ID (fine-tunes, new releases)
vLLM support — local deployments via OpenAI-compatible API; auto-discover loaded models from /v1/models
GPU RPM estimator — estimate sustainable RPM from hardware specs (VRAM, GPU count, quantization)
Parallel workers — configurable concurrency with a shared token-bucket rate limiter
Preview mode — generate N samples for human review before committing to a full run
Real-time dashboard — WebSocket-driven progress bar, throughput chart, and live sample feed
Batch CSV saving — flush to disk every N samples; safe to restart on failure
HuggingFace push — upload the merged dataset to the HF Hub when complete
Dark / light mode — full theme support

Screenshots

Landing

1 · Seeds

Configure your input data source — upload a CSV or load any HuggingFace dataset. Map columns to seed, context, and instruction fields. Write the system instruction that will be applied to every row.

2 · Provider

Select a provider and model, enter your API key, test the connection, and tune generation parameters. Switch to vLLM for local deployments — the GPU estimator calculates your max sustainable RPM from hardware specs. Live-fetch the full model list from any provider's API with a single click, or type in a custom model ID for fine-tunes and new releases.

3 · Pipeline

Set batch size, worker concurrency, and preview count. Configure the output directory and optional HuggingFace Hub push. The Summary card gives a live snapshot of your full configuration before you proceed.

4 · Preview

Generate a small sample batch before committing to the full run. Inspect each row — seed input, context, instruction, and generated output — side by side. Approve to proceed or go back and adjust your settings.

5 · Run

Launch the full pipeline and watch it work. Real-time throughput chart, progress bar, elapsed and remaining time, and a live scrolling sample feed — all streamed over WebSocket.

6 · Output

Browse and download all generated files. Complete merged datasets and per-batch CSVs are listed with file sizes, timestamps, and one-click download buttons.

Quick Start

Prerequisites

Python 3.10+
Node.js 18+

1 — Clone and install

git clone https://github.com/youruser/mimesis.git
cd mimesis

# Backend
cd backend && pip install -r requirements.txt && cd ..

# Frontend
cd frontend && npm install && cd ..

2 — Start both servers

./start.sh

Service	URL
Frontend	http://localhost:5173
Backend API	http://localhost:8000
Interactive API docs	http://localhost:8000/docs

Or start them individually:

# Terminal 1 — backend
cd backend && uvicorn main:app --host 0.0.0.0 --port 8000 --reload

# Terminal 2 — frontend
cd frontend && npm run dev

Providers & Models

Mimesis ships with a comprehensive hardcoded catalogue (55+ models) and supports live-fetching the full model list directly from each provider's API.

Provider	Models included	Live fetch
Anthropic	Claude Opus/Sonnet/Haiku 4.x · Claude 3.7 Sonnet · 3.5 Sonnet/Haiku · 3.0 Opus/Sonnet/Haiku	—
OpenAI	GPT-4o · GPT-4.5 · o1/o1-mini/o1-preview · o3/o3-mini · o4-mini · GPT-4 Turbo	✓
Google Gemini	Gemini 2.5 Pro/Flash · 2.0 Flash/Flash-Lite/Thinking/Pro · 1.5 Pro/Flash/Flash-8B	✓
DeepSeek	DeepSeek V3 (Chat) · R1 (Reasoner) · Coder V2	✓
Together AI	Llama 3.1/3.2/3.3 · DeepSeek V3/R1 · Qwen 2.5 · Mixtral 8×7B/8×22B · Gemma 2 · Sky-T1	✓
OpenRouter	24 curated routes + full live catalogue (200+ models)	✓
vLLM	Any model — auto-discovered via `/v1/models` on your server	✓

Rate limit presets

Provider	Default (Tier 1)	Higher tiers
Anthropic	60 RPM	up to 8,000 (Tier 4)
OpenAI	500 RPM	up to 10,000 (Tier 3)
Gemini	15 RPM (free)	2,000 (Flash paid) · 360 (Pro paid)
DeepSeek	60 RPM	—
Together	60 RPM (free)	600 (paid)
OpenRouter	200 RPM	—
vLLM	auto-calculated	from GPU hardware specs

Project Structure

mimesis/
├── backend/
│   ├── main.py                        # FastAPI entry point
│   ├── config.py                      # Model catalogue + rate limits
│   ├── requirements.txt
│   ├── core/
│   │   ├── pipeline.py                # Async generation engine
│   │   ├── seed_loader.py             # CSV / HuggingFace loader
│   │   ├── rate_limiter.py            # Token-bucket rate limiter
│   │   ├── output_manager.py          # CSV saving + HF Hub push
│   │   └── providers/
│   │       ├── base.py                # BaseProvider abstract class
│   │       ├── anthropic_provider.py
│   │       ├── openai_provider.py
│   │       ├── gemini_provider.py
│   │       ├── deepseek_provider.py
│   │       ├── together_provider.py
│   │       ├── openrouter_provider.py
│   │       └── vllm_provider.py
│   └── api/
│       ├── websocket.py               # WebSocket connection manager
│       └── routes/
│           ├── seeds.py               # /api/seeds/*
│           ├── providers.py           # /api/providers/*
│           └── pipeline.py            # /api/pipeline/*
│
├── frontend/
│   └── src/
│       ├── App.tsx                    # Router + layout shell
│       ├── types/index.ts             # Shared TypeScript types
│       ├── config/theme.ts            # Design tokens (colours, radii)
│       ├── store/pipelineStore.ts     # Zustand global state
│       ├── hooks/
│       │   ├── useApi.ts              # Typed REST API helpers
│       │   └── useWebSocket.ts        # Real-time event handler
│       ├── components/ui/             # Design-system primitives
│       └── pages/
│           ├── LandingPage.tsx
│           ├── SeedsPage.tsx
│           ├── ProviderPage.tsx
│           ├── PipelinePage.tsx
│           ├── PreviewPage.tsx
│           ├── RunPage.tsx
│           └── OutputPage.tsx
│
├── outputs/                           # Generated CSVs saved here
├── docs/
│   └── screenshots/                   # UI screenshots (used in this README)
└── start.sh

API Reference

Method	Endpoint	Description
`GET`	`/api/providers/catalogue`	Model catalogue + rate limit info
`POST`	`/api/providers/fetch-models`	Live-fetch models from a provider's API
`POST`	`/api/providers/test-connection`	Validate credentials with a dummy request
`POST`	`/api/providers/vllm/calculate`	Estimate RPM from GPU hardware specs
`POST`	`/api/seeds/columns`	Inspect columns from a CSV or HF dataset
`POST`	`/api/seeds/upload`	Upload a CSV file
`POST`	`/api/pipeline/preview`	Run a small preview batch
`POST`	`/api/pipeline/start`	Start a full pipeline run
`POST`	`/api/pipeline/stop`	Stop a running pipeline
`GET`	`/api/pipeline/status`	Poll pipeline status
`GET`	`/api/pipeline/outputs`	List generated output files
`WS`	`/ws/{pipeline_id}`	Real-time progress stream

Full interactive docs at http://localhost:8000/docs when the backend is running.

Adding a New Provider

Create backend/core/providers/myprovider_provider.py
Extend BaseProvider, implement async generate(prompt, system) -> str
Register in backend/core/providers/__init__.py → REGISTRY
Add model list + rate limits to backend/config.py → PROVIDER_MODELS
Add the provider name to EXTERNAL_PROVIDERS in frontend/src/pages/ProviderPage.tsx

Tech Stack

Layer	Technology
Backend	Python 3.10+, FastAPI, uvicorn, asyncio, httpx
Frontend	React 18, TypeScript, Vite, Tailwind CSS
State management	Zustand
Charts	Recharts
Icons	Lucide React
LLM SDKs	`anthropic`, `openai`, `google-generativeai`

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
backend		backend
docs/screenshots		docs/screenshots
frontend		frontend
outputs		outputs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mimesis

Overview

Features

Screenshots

Landing

1 · Seeds

2 · Provider

3 · Pipeline

4 · Preview

5 · Run

6 · Output

Quick Start

Prerequisites

1 — Clone and install

2 — Start both servers

Providers & Models

Rate limit presets

Project Structure

API Reference

Adding a New Provider

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mimesis

Overview

Features

Screenshots

Landing

1 · Seeds

2 · Provider

3 · Pipeline

4 · Preview

5 · Run

6 · Output

Quick Start

Prerequisites

1 — Clone and install

2 — Start both servers

Providers & Models

Rate limit presets

Project Structure

API Reference

Adding a New Provider

Tech Stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages