How do you justify a model choice six months after go-live?
Self-hosted LLM governance monitoring for regulated environments. Continuous scoring against EU AI Act, GDPR, and ANSSI — not a one-shot benchmark.
Built out of a question I couldn't find a good answer to, working on LLM deployment in the French public sector. Directly applicable to AI Act Article 9 requirements (ongoing risk management) and NIS2 operational continuity constraints.
govllm scores LLM outputs continuously against configurable governance profiles. Each response is evaluated by a local LLM-as-a-judge across criteria mapped to regulatory frameworks. The best-performing model per use case is selected automatically — based on your governance criteria, not raw performance metrics.
Request → Governance profile → LLM-as-a-judge scoring → Dynamic routing → Model A / B / C / D
↑ |
└──────────── metrics refine criteria ─────┘
No data leaves your infrastructure. Local models via Ollama. Observable via Grafana and Prometheus.
User
│
▼
Frontend :5173 (Vue 3 + ECharts)
│
├──► llm-gateway :8001 ──► LiteLLM ──► Ollama (qwen / gemma / mistral / phi)
│ │
│ └──── Redis pub/sub
│
├──► observability :8002 ──► Prometheus / Grafana / Langfuse
│
└──► evaluation :8003 ──► Local judge (Ollama) ──► Benchmark · Matrix · Score
Three independent FastAPI microservices share a back/shared/ layer (Pydantic schemas + config).
→ See back/README.md for full API reference and service structure.
Score heatmap per model and use case — auto-routes traffic to best performer per governance profile.
Activate a full compliance profile in one click. Criteria, weights and use cases are configurable from the UI.
Prerequisites: Docker, docker compose, uv.
git clone https://github.com/JehanneDussert/govllm
cd govllm
cp infra/.env.example infra/.env
# Fill in Langfuse keys
make dev # hot reload — code changes reflected immediately
# or
make prod # built images + nginx front
make pull-models| Service | URL |
|---|---|
| Frontend | http://localhost:5173 |
| Gateway | http://localhost:8001/docs |
| Observability | http://localhost:8002/docs |
| Evaluation | http://localhost:8003/docs |
| Langfuse | http://localhost:3000 |
| Grafana | http://localhost:3001 |
| Prometheus | http://localhost:9090 |
Four built-in profiles, each activating a targeted set of criteria and weights:
| Profile | Frameworks | Focus |
|---|---|---|
| AI Act Compliance | EU AI Act Art. 5, 13, 14 | Transparency, human oversight, non-manipulation |
| Data Protection | GDPR, ANSSI | Data privacy, leakage prevention, traceability |
| Security | ANSSI, OWASP LLM Top 10 | Prompt injection, robustness, adversarial inputs |
| Accessibility & Inclusion | RGAA, FALC | Language clarity, cognitive load, inclusive design |
Profiles are applied at runtime — switching a profile updates which criteria are active and their weights without restarting any service. Custom profiles can be created from the Settings view.
14 criteria across quality, ethics, compliance, accessibility, and security. All configurable from the UI.
| Criterion | Regulatory anchor | Default |
|---|---|---|
| Relevance | Quality baseline | ✅ |
| Factual reliability | AI Act | ✅ |
| Prompt injection | OWASP LLM01, ANSSI | ✅ |
| Data leakage | OWASP LLM02, ANSSI | ✅ |
| Ethical refusal | ANSSI, ethics | ✅ |
| Non-manipulation | AI Act Art. 5 | — |
| Human oversight | AI Act Art. 14 | — |
| Explicability | AI Act Art. 13 | — |
| Transparency | AI Act | — |
| Data privacy | GDPR | — |
| Language clarity | RGAA, FALC | — |
| Cognitive load | RGAA | — |
| Fairness | AI Act, ethics | — |
| Robustness | ANSSI | — |
The judge model runs locally (ollama/gemma3:4b by default). Evaluation calls are filtered from the traces view so only user interactions appear.
Arena metrics (variance, incoherence rate, bias matrix) measure judge reliability. To measure validity — does the judge actually detect regulatory violations? — govllm uses a curated binary-checklist corpus of 34 annotated cases anchored to CNIL decisions, ANSSI guidelines, and EU AI Act provisions.
→ See docs/ground_truth/README.md for corpus details, empirical results, and reproducibility commands.
48 prompts across 6 use cases and 4 difficulty levels (2 easy · 2 medium · 2 adversarial · 2 hard each). Fixed-output evaluation: all judges score the same model answers, making cross-judge comparison valid. 768 scored entries (48 × 4 generators × 4 judges), 32 per (model, use case) cell.
→ See docs/benchmark/README.md for pipeline, file formats, and planned analyses.
| Layer | Technology |
|---|---|
| Inference | Ollama — phi4-mini · gemma3:4b · mistral:7b · qwen3:1.7b |
| Proxy | LiteLLM |
| Backend | FastAPI · Python 3.11 · uv |
| Tracing | Langfuse v2 |
| Metrics | Prometheus + Grafana |
| Event bus | Redis |
| Reverse proxy | Caddy |
| Frontend | Vue 3 · TypeScript · ECharts |
| Infra | Docker Compose |
govllm/
├── back/ # three FastAPI microservices + shared layer
│ └── README.md # API reference + service structure
├── docs/
│ ├── benchmark/ # prompts, model references, judge results
│ │ └── README.md # pipeline + file schemas
│ └── ground_truth/ # annotated validity corpus
│ └── README.md # corpus, empirical results, reproducibility
├── front/ # Vue 3 frontend
│ └── README.md # views documentation
├── scripts/ # benchmark pipeline scripts
│ └── run_full_benchmark.py
└── infra/ # Docker Compose, LiteLLM config, Prometheus, Grafana
Governance from metrics. Model selection is driven by governance criteria, not performance alone. The score matrix accumulates from real production usage — not synthetic benchmarks.
Local evaluation judge. Scoring runs on Ollama — sovereign and usable in air-gapped or regulated environments (public sector, healthcare, finance). No response data sent to external APIs.
Profile-driven routing. Switching a governance profile at runtime updates which criteria are active and their weights. The routing layer reads the active profile from Redis at inference time and recommends the best-scoring model for that profile and use case.
Shared schema layer. All three microservices share back/shared/src/shared/ for Pydantic schemas and config — single source of truth for data contracts.
Governance context injection. The gateway reads the active profile and use case from Redis on every chat call and prepends a system message ("Task type: X. Governance framework: Y.") before passing messages to the model. The caller can override this by sending its own system message.
Judge traces filtered. Evaluation calls to LiteLLM are excluded from the traces view so only user interactions appear.
Dev/prod parity via compose overrides. make dev mounts source volumes with --reload. make prod builds images and serves the front via nginx. Same base compose file, no drift.
Regulatory texts
- EU AI Act — Art. 5, 9, 13, 14
- GDPR Art. 22 — automated decision-making
- ANSSI SecNumCloud
- NIS2 Directive
Evaluation and benchmarking
- COMPL-AI — AI Act compliance benchmarking (ETH Zurich)
- LM Evaluation Harness — EleutherAI
- OWASP LLM Top 10
LLM observability landscape
govllm is positioned on two axes: sovereign/on-premise deployment and governance-first scoring (regulatory criteria, not just performance metrics).
- Langfuse — open-source tracing, self-hostable. Used as govllm's tracing layer.
- Giskard — LLM testing and red-teaming, EU-based.
- Arize AI — production LLM observability. Cloud-first.
- Fiddler AI — enterprise ML + LLM monitoring, regulated industries.
- LatticeFlow AI — AI Act compliance validation. Closed, enterprise.
- Holistic AI — AI governance and risk management. Audit-oriented.
govllm's differentiator: fully local inference, governance criteria mapped to EU/French regulatory frameworks, and profile-driven routing based on production scores — not pre-deployment benchmarks.
French public sector context
- DINUM Albert — French government's sovereign LLM
- CNIL AI guidance
- Projet PANAME — CNIL's GDPR audit tool for AI models
- AI Charters Portal for Public Administration
On AI ethics charters
Where charters articulate what should be done, govllm provides a technical layer to verify that it is actually being done, continuously, in production. Principles need observability to become practice.
Claude Code was used throughout development: generating the benchmark pipeline (scripts/run_full_benchmark.py), the ground truth corpus scripts (back/evaluation/scripts/), documenting the codebase (API reference, sub-READMEs), and iterative code review.
EUPL-1.2
