Cestus: Vision Document

An Open-Source Intelligence & Data Operations Platform

Version 0.1 -- February 2026

Executive Summary
Understanding the Landscape: What Palantir Does
Existing Alternatives and Gaps
Mission Statement
Core Feature Set
Architecture Ideas
What Makes Cestus Different
Potential Use Cases
Phased Roadmap
Existing Open-Source Building Blocks
Research Sources

1. Executive Summary

Palantir Technologies has built a multi-billion dollar business around a deceptively simple insight: organizations drown in data not because they lack storage or compute, but because they lack a unified semantic layer that connects disparate data sources into a coherent model of reality -- and then lets humans and AI agents act on that model.

Palantir's moat is not any single algorithm. It is the integration -- the ontology layer that turns raw tables into entities and relationships, the visualization tools that let analysts explore those relationships across graphs, maps, and timelines, the action framework that lets decisions flow back into operational systems, and the deployment machinery that makes all of this work in sensitive, air-gapped environments.

No single open-source project replicates this today. But the building blocks exist. Cestus is a vision for composing those building blocks into a coherent, open-source platform that delivers Palantir-class capabilities to organizations that cannot afford Palantir, do not want vendor lock-in, or need the transparency that only open source provides.

2. Understanding the Landscape: What Palantir Does

2.1 Palantir Gotham (Intelligence & Defense)

Gotham is Palantir's original product, built for intelligence agencies and military organizations. Its core capabilities include:

Data Integration: Connectors to structured sources (databases, ERP, CRM), semi-structured (logs, XML), and unstructured (PDFs, emails, imagery), with heavy investment in deduplication and entity resolution.
Ontology / Entity Graph: A dynamic ontology that tags entities (persons, phone numbers, addresses, organizations, assets) and places links between them. This is the "digital twin" of the intelligence domain.
Link Analysis (Graph): A network analysis canvas where analysts create visual representations of networked data. Users can view aggregated property statistics, organize and style graphs, and annotate them as part of collaborative workflows.
Geospatial Analysis: Map layers showing entity locations, event timelines, and movement patterns.
Object Explorer: Top-down analysis enabling users to find entities with similar characteristics and visualize relationships across millions of records.
Collaboration: Real-time concurrent analysis within Graph and other applications. Shared canvases, annotations, and presentation workflows.

2.2 Palantir Foundry (Commercial & Government Operations)

Foundry is the commercial platform -- a "data operating system" for enterprises:

Data Integration: Pipeline-based ingestion from any source, with data transformation and cleaning.
Ontology Layer: The signature feature. The Foundry Ontology sits on top of datasets and models, connecting them to their real-world counterparts (factories, equipment, products, orders, transactions). It contains both semantic elements (objects, properties, links) and kinetic elements (actions, functions, dynamic security).
Ontology Architecture: A microservices backend with the Ontology Metadata Service (OMS) defining object types, link types, and action types; the Object Data Funnel orchestrating data writes and indexing; and Object Storage V2 separating indexing from querying for horizontal scalability.
Application Building: Low-code/no-code tools for building operational applications on top of the ontology.
Workshop & Quiver: Drag-and-drop application builders for dashboards and operational workflows.

2.3 Palantir AIP (Artificial Intelligence Platform)

AIP integrates LLMs and AI agents into the ontology:

Ontology-Grounded AI: AI agents reason over the ontology's entities, relationships, and business logic rather than raw data. The ontology provides "the nouns" (entities) and "the verbs" (actions) of the enterprise.
AIP Agent Studio: Build, test, and deploy AI agents that can read from and write to the ontology. Agents are sandboxed with specific permissions on data and tools.
AIP Logic: A no-code environment for building LLM-powered functions that leverage the ontology.
Agents as Functions: Agents can be published as Functions, making them composable and reusable across the platform.

2.4 Palantir Apollo (Continuous Delivery & Operations)

Apollo is the deployment and operations layer:

Hub and Spoke Architecture: A central Apollo Hub manages multiple Spoke environments, each running a Spoke Control Plane that reports telemetry and executes deployment plans.
Pull-Based Deployment: Instead of pushing code, environments pull updates via subscriptions to Release Channels.
Air-Gapped Support: Manages software across connected and disconnected environments, critical for defense and regulated industries.
Compliance-Aware: Built-in controls for FedRAMP, IL5, IL6 accreditation frameworks.

2.5 The Palantir "Secret Sauce"

The real power is not any individual product but their integration:

Data goes in (any format, any source)
Ontology maps it to real-world entities and relationships
Humans explore it via graphs, maps, timelines, dashboards
AI reasons over it grounded in the ontology
Actions flow back into operational systems
Apollo deploys it anywhere, including air-gapped environments
Security governs it at every layer with fine-grained access control

3. Existing Alternatives and Gaps

3.1 Commercial Alternatives

Product	Strengths	Gaps vs. Palantir
Databricks	Unified analytics, Delta Lake, MLflow	No ontology layer, no link analysis, no investigative UI
Snowflake	Data warehousing, data sharing	Pure storage/compute, no semantic layer
Dataiku	End-to-end data science	Weaker on ontology and operational applications
d.AP (digetiers)	Ontology-grounded on RDF/OWL open standards	Newer, smaller ecosystem
DataWalk	Link analysis, investigative analysis	Proprietary, narrower scope
Siren	Investigative intelligence, link analysis	Proprietary, Elasticsearch-based
C3 AI	Enterprise AI applications	Proprietary, expensive

3.2 Open-Source Landscape

What exists today:

Data Integration/ETL: Apache Airflow, Apache NiFi, Apache Hop, Apache Kafka, Apache Beam, Airbyte, dbt
Data Catalogs/Metadata: OpenMetadata, DataHub (LinkedIn), Amundsen (Lyft), Apache Atlas
Graph Databases: Neo4j (Community Edition), JanusGraph, Apache TinkerPop, Apache AGE (Postgres extension)
Entity Resolution: Zingg (ML-based), Splink (probabilistic), Dedupe (Python)
Knowledge Graphs: WhyHow Knowledge Graph Studio, Graphiti, KBpedia
Visualization: Apache Superset (dashboards), Grafana (monitoring), Kepler.gl (geospatial), Gephi (graph), Sigma.js (graph/web)
AI/LLM Frameworks: LangChain, LlamaIndex, LangGraph, CrewAI, Dify
Authorization: Casbin, Ory Keto, Permify (Zanzibar-inspired RBAC)
Deployment: ArgoCD, Flux, Kubernetes

The critical gap: No single open-source project or composition of projects provides the ontology-as-operating-system experience -- the unified semantic layer that sits between raw data and applications/AI, with integrated entity resolution, link analysis visualization, and action frameworks. Projects exist in silos. The integration is the product, and that integration does not exist in open source.

3.3 The Closest Attempts

Dashjoin: An open-source low-code platform that establishes a linked data graph over data sources with browsing, searching, editing, AI integration, and GitOps delivery. The closest thing to an integrated Palantir alternative in open source, but significantly smaller in scope and community.
Apache Atlas + JanusGraph + Superset: A common open-source stack for metadata governance and visualization, but lacks the ontology-driven application layer and investigative UI.

4. Mission Statement

The Problem

Organizations of all sizes face the same fundamental challenge that Palantir solves for the world's largest governments and enterprises: their data is fragmented across dozens of systems, in incompatible formats, with no unified way to understand what the data represents in the real world, explore relationships, or act on insights.

Today, solving this problem requires either (a) paying millions for Palantir, (b) assembling a bespoke stack from dozens of open-source tools with no integration layer, or (c) going without.

The Mission

Cestus is a free, open-source data operations platform that unifies data integration, semantic modeling, entity resolution, visual analysis, AI reasoning, and operational action into a single coherent system.

Cestus makes it possible for any organization -- investigative journalists, humanitarian NGOs, academic researchers, mid-sized enterprises, local governments, open-source intelligence analysts -- to turn fragmented data into an entity-relationship model of their domain, explore it visually, reason over it with AI, and take action.

Core Principles

Open Source, Always: Apache 2.0 or similar permissive license. No open-core bait-and-switch.
Ontology-First: The semantic model is the core abstraction. Everything else -- ingestion, visualization, AI, actions -- operates through the ontology.
Composable: Built as a set of well-defined services with clean APIs. Use the whole platform or individual components.
AI-Native: LLM and agent integration is not an afterthought -- it is a first-class capability grounded in the ontology.
Security by Design: Fine-grained access control (RBAC + ABAC), audit logging, and data provenance from day one.
Deploy Anywhere: Cloud, on-premise, air-gapped, edge. Kubernetes-native with support for disconnected environments.

5. Core Feature Set

5.1 Data Integration & Ingestion

Goal: Connect to any data source and bring data into the platform with minimal friction.

Capability	Description	Priority
Connectors	Pre-built connectors for databases (Postgres, MySQL, SQL Server, Oracle), cloud storage (S3, GCS, Azure Blob), APIs (REST, GraphQL), files (CSV, JSON, Parquet, XML), messaging (Kafka, RabbitMQ)	MVP
Custom Connectors	SDK for building custom connectors	v1.0
Stream Ingestion	Real-time data ingestion from streaming sources	v1.0
Data Transformation	Pipeline-based transformation with versioning	MVP
Incremental Sync	Change data capture and incremental updates	v1.0
Unstructured Ingestion	PDF, email, document, and image ingestion with AI-powered extraction	v1.5

5.2 Data Modeling & Ontology

Goal: Provide a semantic layer that maps raw data to real-world entities and relationships.

This is the heart of Cestus and the primary differentiator from "just another data tool."

Capability	Description	Priority
Object Types	Define entity types (Person, Organization, Vehicle, Transaction, Event, etc.) with typed properties	MVP
Link Types	Define relationship types between object types (employs, owns, communicated_with, located_at) with properties	MVP
Ontology Editor	Visual and code-based tools for defining and editing the ontology schema	MVP
Data Mapping	Map raw dataset columns to ontology object properties, with transformation rules	MVP
Entity Resolution	ML-assisted deduplication and entity resolution across data sources	MVP
Interface Types	Polymorphic interfaces (like Palantir's) for consistent modeling across object types that share common shapes	v1.0
Ontology Versioning	Schema versioning with migration support	v1.0
Derived Properties	Computed properties based on linked objects, aggregations, or functions	v1.0
Temporal Modeling	First-class support for time-varying properties and historical states	v1.5

Ontology Data Model (Conceptual):

ObjectType
  - id: UUID
  - name: string (e.g., "Person", "Organization")
  - properties: PropertyDefinition[]
  - interfaces: InterfaceType[]
  - datasource_mappings: DataSourceMapping[]

LinkType
  - id: UUID
  - name: string (e.g., "employed_by", "called")
  - source_type: ObjectType
  - target_type: ObjectType
  - properties: PropertyDefinition[]
  - cardinality: ONE_TO_ONE | ONE_TO_MANY | MANY_TO_MANY

Object (instance)
  - id: UUID
  - type: ObjectType
  - properties: { [key]: value }
  - provenance: DataSource[]  -- which sources contributed to this object
  - confidence: float          -- entity resolution confidence
  - timestamps: { created, modified, valid_from, valid_to }

Link (instance)
  - id: UUID
  - type: LinkType
  - source: Object
  - target: Object
  - properties: { [key]: value }
  - provenance: DataSource[]

5.3 Search & Discovery

Goal: Find any entity, relationship, or pattern across the entire ontology.

Capability	Description	Priority
Full-Text Search	Search across all object properties with relevance ranking	MVP
Faceted Search	Filter by object type, property values, date ranges, data source	MVP
Graph Traversal Search	"Find all entities within N hops of entity X"	MVP
Saved Searches	Save and share search queries	v1.0
Natural Language Search	AI-powered "ask a question in plain English" search	v1.0
Pattern Search	Find subgraph patterns (e.g., "person connected to organization through phone number")	v1.5

5.4 Visualization & Analytics

Goal: Multiple visual paradigms for exploring the ontology -- because different questions require different views.

Capability	Description	Priority
Graph View	Interactive link analysis canvas. Expand entities, explore connections, filter, cluster, style nodes/edges. The primary investigative interface.	MVP
Table View	Spreadsheet-like view of object collections with sorting, filtering, grouping	MVP
Map View	Geospatial visualization of entities with location properties. Layers, clustering, heatmaps.	MVP
Timeline View	Temporal visualization of events and entity activity	v1.0
Dashboard Builder	Drag-and-drop dashboard composition from charts, tables, maps, and graphs	v1.0
Object Profile	Detailed view of a single entity with all properties, linked entities, activity timeline, and source provenance	MVP
Histogram / Charts	Bar, line, pie, scatter, and other statistical visualizations over ontology data	v1.0
Notebook Integration	Jupyter-style notebook for ad-hoc analysis with access to ontology APIs	v1.5

5.5 Collaboration

Goal: Multiple users working together on investigations, analyses, and operational workflows.

Capability	Description	Priority
Workspaces	Shared project spaces for team collaboration	MVP
Annotations	Add notes, tags, and assessments to any entity or relationship	MVP
Canvas Sharing	Share and co-edit graph, map, and timeline canvases	v1.0
Comments	Threaded comments on any object, link, or analysis artifact	v1.0
Activity Feed	See what teammates have been exploring, annotating, or modifying	v1.0
Audit Trail	Full history of who viewed, modified, or exported what data	MVP

5.6 AI / ML Integration

Goal: AI agents that can reason over the ontology, answer questions, and take actions -- grounded in real data, not hallucinations.

Capability	Description	Priority
Ontology-Grounded RAG	LLM queries answered using ontology entities and relationships as context	MVP
Natural Language Query	"Show me all transactions over $10K between Company A and any entity flagged as high-risk"	v1.0
Entity Extraction (NER)	AI-powered extraction of entities and relationships from unstructured text	v1.0
AI Agent Framework	Agents with tool-calling that can search the ontology, traverse graphs, create annotations, and suggest actions	v1.0
Anomaly Detection	ML models that identify unusual patterns in entity behavior or relationships	v1.5
Agent Sandboxing	Fine-grained permissions for what data and actions agents can access	v1.0
Model Registry	Register and version ML models, connect outputs to ontology	v1.5
Bring Your Own LLM	Support for OpenAI, Anthropic, local models (Ollama, vLLM), or any OpenAI-compatible API	MVP

5.7 Actions & Operational Integration

Goal: Move from insight to action -- write changes back to source systems, trigger workflows, and automate operational responses.

Capability	Description	Priority
Action Types	Define typed actions (approve, escalate, flag, update, notify) with input/output schemas	v1.0
Action Execution	Execute actions that write back to source systems via connectors	v1.0
Workflow Engine	Multi-step workflows triggered by events, schedules, or human decisions	v1.5
Webhooks	Outbound webhooks on ontology events (entity created, link added, property changed)	v1.0
Notifications	Alerts and notifications based on ontology events or AI agent findings	v1.0

5.8 Access Control & Security

Goal: Enterprise-grade security that operates at the ontology level, not just the data level.

Capability	Description	Priority
Authentication	SSO (SAML, OIDC), local accounts, API keys	MVP
RBAC	Role-based access control (admin, analyst, viewer, etc.)	MVP
Object-Level Permissions	Control who can see/edit specific object types	MVP
Property-Level Masking	Mask sensitive properties (SSN, financial data) based on role	v1.0
Row-Level Security	Filter visible objects based on user attributes (department, clearance, geography)	v1.0
Marking/Classification	Apply classification markings to data and enforce handling rules	v1.5
Audit Logging	Immutable audit log of all data access, modifications, and exports	MVP
Data Provenance	Track which source contributed each property value, with lineage	MVP

6. Architecture Ideas

6.1 High-Level Architecture

+------------------------------------------------------------------+
|                        Cestus Platform                       |
+------------------------------------------------------------------+
|                                                                    |
|  +-------------------+  +-------------------+  +----------------+ |
|  |   Web UI (SPA)    |  |   CLI / SDK       |  |  REST/GraphQL  | |
|  |   React + D3/     |  |   Python / TS     |  |  API Gateway   | |
|  |   Sigma.js/Deck.gl|  |                   |  |                | |
|  +--------+----------+  +--------+----------+  +-------+--------+ |
|           |                       |                      |         |
|  +--------v-----------------------v----------------------v------+  |
|  |                    API Layer (Gateway)                        |  |
|  |             Authentication | Rate Limiting | Routing          |  |
|  +---+----------+----------+----------+----------+-----------+--+  |
|      |          |          |          |          |           |      |
|  +---v---+ +---v----+ +--v----+ +---v----+ +---v-----+ +--v---+  |
|  |Ontology| |Search  | |Visual | |  AI    | |Actions  | |Auth  |  |
|  |Service | |Service | |Service| |Service | |Service  | |Svc   |  |
|  +---+---+ +---+----+ +--+----+ +---+----+ +---+-----+ +--+---+  |
|      |         |          |          |          |           |      |
|  +---v---------v----------v----------v----------v-----------v---+  |
|  |              Ontology Storage Layer                           |  |
|  |  +-------------+  +------------+  +-----------------------+  |  |
|  |  | Graph DB     |  | Search     |  | Object/Relational    |  |  |
|  |  | (Neo4j/      |  | (Elastic/  |  | Store (Postgres)     |  |  |
|  |  |  JanusGraph)  |  |  Typesense)|  |                      |  |  |
|  |  +-------------+  +------------+  +-----------------------+  |  |
|  +--------------------------------------------------------------+  |
|                                                                    |
|  +--------------------------------------------------------------+  |
|  |              Data Integration Layer                           |  |
|  |  +----------+  +----------+  +----------+  +-------------+  |  |
|  |  | Connectors|  | Transform|  | Entity   |  | Pipeline    |  |  |
|  |  | (Airbyte) |  | (dbt)    |  | Resolver |  | Orchestrator|  |  |
|  |  |           |  |          |  | (Zingg/  |  | (Airflow/   |  |  |
|  |  |           |  |          |  |  Splink) |  |  Temporal)  |  |  |
|  |  +----------+  +----------+  +----------+  +-------------+  |  |
|  +--------------------------------------------------------------+  |
|                                                                    |
|  +--------------------------------------------------------------+  |
|  |              Infrastructure Layer                             |  |
|  |  Kubernetes | Helm Charts | Monitoring (Prometheus/Grafana)   |  |
|  |  Object Storage (MinIO/S3) | Message Queue (Kafka/NATS)      |  |
|  +--------------------------------------------------------------+  |
+------------------------------------------------------------------+

6.2 Key Architectural Decisions

Decision 1: Ontology Storage -- Polyglot Persistence

The ontology needs to be queryable in multiple ways simultaneously:

Graph traversals (follow relationships N hops) --> Graph database
Full-text search (find entities by keyword) --> Search engine
Aggregations and analytics (count, sum, group) --> Relational/columnar database
High-volume writes (ingest pipelines) --> Write-optimized store

Rather than picking one database, use a polyglot persistence approach with an Ontology Storage Layer that writes to multiple backends and keeps them in sync via an event bus:

PostgreSQL (with Apache AGE extension for graph queries): Primary store of record for object and link instances. Proven, reliable, rich ecosystem. AGE adds Cypher query support directly in Postgres.
Elasticsearch or Typesense: Full-text search index, kept in sync via change data capture.
Neo4j Community Edition or JanusGraph: Dedicated graph store for deep traversal queries. Neo4j CE for simpler deployments; JanusGraph for distributed scale.

Decision 2: Ontology Service -- The Central Nervous System

A dedicated Ontology Service acts as the single API for reading and writing ontology data. All other services (search, visualization, AI, actions) interact with the ontology through this service. This ensures:

Consistent schema enforcement
Centralized access control
Event emission for all mutations (enabling downstream sync and audit)

Decision 3: Event-Driven Architecture

All ontology mutations produce events on a message bus (Kafka or NATS):

object.created, object.updated, object.deleted
link.created, link.updated, link.deleted
action.executed

This enables:

Search index updates
Graph database sync
Audit logging
Webhook delivery
AI agent triggers
Real-time UI updates via WebSockets

Decision 4: Frontend -- React with Specialized Visualization Libraries

Core Framework: React (TypeScript)
Graph Visualization: Sigma.js (WebGL-based, handles large graphs) or Cytoscape.js
Geospatial: Deck.gl (WebGL, large-scale) or Leaflet (simpler)
Charts/Dashboards: Apache ECharts or Recharts
Timeline: vis-timeline or custom D3-based
Layout: A workspace model (like VS Code) where users can arrange panels

Decision 5: AI Integration -- Plugin Architecture

Rather than hardcoding LLM providers:

Define an LLM Provider Interface that abstracts model calls
Ship adapters for OpenAI, Anthropic, Ollama (local), vLLM
Use LangChain or LlamaIndex internally for RAG pipeline
Agents use a Tool abstraction that maps to ontology operations (search, traverse, annotate, execute action)

6.3 Proposed Tech Stack

Layer	Technology	License	Rationale
Frontend	React + TypeScript	MIT	Industry standard, massive ecosystem
Graph Viz	Sigma.js	MIT	WebGL performance for large graphs
Geo Viz	Deck.gl	MIT	High-performance geospatial
Charts	Apache ECharts	Apache 2.0	Rich chart types, good performance
API Gateway	Kong or Traefik	Apache 2.0	API management, auth, rate limiting
Backend Services	Python (FastAPI) or Go	MIT / BSD	FastAPI for rapid development; Go for performance-critical services
Primary DB	PostgreSQL + Apache AGE	PostgreSQL / Apache 2.0	Relational + graph in one, proven at scale
Search	Typesense or Elasticsearch	GPL-3 / SSPL	Typesense is simpler and truly open; Elasticsearch has larger ecosystem
Graph DB (optional)	Neo4j CE or JanusGraph	GPL-3 / Apache 2.0	Deep traversal queries; optional if AGE suffices
Message Bus	NATS or Apache Kafka	Apache 2.0	NATS for simplicity; Kafka for scale
Object Storage	MinIO	AGPL-3.0	S3-compatible, for documents and files
Pipeline Orchestration	Apache Airflow or Temporal	Apache 2.0 / MIT	Airflow for batch; Temporal for event-driven workflows
Data Connectors	Airbyte	MIT (Elv2 for some)	300+ pre-built connectors
Entity Resolution	Zingg or Splink	AGPL-3.0 / MIT	ML-based dedup and entity resolution
AI/RAG	LangChain + LlamaIndex	MIT	RAG pipeline and agent framework
Auth	Keycloak + Casbin	Apache 2.0	SSO + fine-grained policy engine
Deployment	Kubernetes + Helm	Apache 2.0	Standard cloud-native deployment
Monitoring	Prometheus + Grafana	Apache 2.0	Observability

7. What Makes Cestus Different

7.1 vs. Palantir

Dimension	Palantir	Cestus
Cost	Millions per year	Free (self-hosted)
Transparency	Proprietary black box	Full source code visibility
Vendor Lock-in	Extreme -- data model tied to platform	Open formats, standard APIs, portable ontology
Customization	Services engagement required	Fork it, extend it, contribute back
Community	Palantir employees only	Open contributor community
AI Models	Palantir-selected models	Bring your own -- local, cloud, or any provider
Deployment	Palantir-managed	Self-managed with Helm charts, or managed by community providers

7.2 vs. Other Open-Source Tools

Dimension	Typical OSS Stack	Cestus
Integration	Assemble 10+ tools yourself, build glue code	Integrated platform with shared ontology
Ontology	Each tool has its own data model	Single semantic ontology layer across all features
Entity Resolution	Run separately, reconcile manually	Built-in, continuous ER feeding the ontology
Visualization	Superset for charts, Gephi for graphs, Kepler for maps -- disconnected	Unified workspace with graph, map, timeline, charts sharing one ontology
AI Grounding	RAG over raw data	RAG over the ontology -- entities and relationships, not raw tables
Access Control	Bolt-on per tool	Ontology-level security that governs all views

7.3 The Core Differentiator

Cestus's differentiator is the ontology as the universal API. Every feature -- ingestion, search, visualization, AI, actions, security -- speaks the language of entities and relationships, not tables and columns. This is what makes Palantir powerful, and it is what no open-source project currently provides as a unified, integrated experience.

8. Potential Use Cases

8.1 Investigative Journalism

Journalists investigating financial crime, political corruption, or corporate misconduct need to connect entities (people, companies, addresses, bank accounts) from leaked documents, public records, and proprietary databases. Cestus would provide the graph analysis, entity resolution, and document ingestion to do this -- capabilities currently available only through expensive tools or manual effort.

Example: A newsroom integrates Panama Papers data, corporate registries, and political donation records. Cestus resolves entities across sources and reveals hidden ownership networks.

8.2 Humanitarian & NGO Operations

Organizations like the UNHCR, Red Cross, or Doctors Without Borders manage operations across fragmented data systems -- beneficiary registries, supply chain databases, field reports, geospatial data. Cestus could unify this into a coherent operational picture.

Example: An NGO integrates refugee registration data, supply depot inventories, and field incident reports to optimize resource allocation and identify underserved areas.

8.3 Open-Source Intelligence (OSINT)

OSINT analysts -- whether in journalism, civil society, or academic research -- need to collect, structure, and analyze publicly available information. Cestus would provide the entity resolution, link analysis, and geospatial tools that are currently locked in expensive proprietary platforms.

Example: Researchers tracking the spread of disinformation map social media accounts, websites, and funding sources to reveal coordinated influence networks.

8.4 Academic & Scientific Research

Researchers studying complex systems (epidemiology, climate, social networks, supply chains) need to integrate diverse datasets and explore relationships. Cestus's ontology and visualization tools would serve as a research platform.

Example: Epidemiologists integrate hospital records, genomic data, and mobility data to model disease transmission networks.

8.5 Small/Medium Enterprise Operations

Mid-sized companies that cannot afford Palantir but need to connect their CRM, ERP, supply chain, and financial systems into a coherent view. Cestus could serve as the "data operating system" for companies with 100-10,000 employees.

Example: A manufacturing company connects their ERP, IoT sensor data, supplier database, and quality control system to get end-to-end visibility into production issues.

8.6 Local Government & Public Sector

City and county governments managing public safety, infrastructure, permits, and social services across disconnected systems. Cestus could provide the unified view that large federal agencies get from Palantir.

Example: A city government integrates building permits, code violations, fire inspections, and 311 complaints to identify properties that pose safety risks.

8.7 Fraud Detection & Compliance

Financial institutions, insurance companies, and regulatory bodies that need to detect complex fraud patterns across entity networks.

Example: A credit union integrates transaction data, account records, and external watchlists. Cestus's entity resolution links related accounts, and graph analysis reveals suspicious transaction patterns.

9. Phased Roadmap

Phase 0: Foundation (Months 1-3)

Goal: Core infrastructure and ontology service that proves the concept.

Project scaffolding: monorepo, CI/CD, contribution guidelines, governance
Ontology Service: Core API for defining object types, link types, and properties
PostgreSQL + Apache AGE storage backend for objects, links, and graph queries
Basic data ingestion: CSV and JSON file upload, manual data entry
Basic entity resolution: Rule-based deduplication (exact match, fuzzy match on key fields)
REST API: Full CRUD on ontology schema and instances
Authentication: Basic auth and API keys (Keycloak integration in Phase 1)
Minimal Web UI: Object type browser, entity list view, single entity detail page

Deliverable: A working ontology service with API, basic ingestion, and a minimal UI that lets you define a schema, import data, resolve entities, and browse entities and their relationships.

Success Criteria:

Schema CRUD: Define ontology schema (>=3 object types, >=2 link types) via REST API. Minimum: less than 2s p99. Target: less than 500ms p99.
Ingest benchmark: Import entities from CSV. Minimum: 1K entities in less than 5 min. Target: 10K entities in less than 5 min on a 4-core/16GB machine.
Entity resolution: Rule-based dedup on reference dataset. Minimum: >=80% exact-match. Target: >=90% exact + >=70% fuzzy.
Session durability: Every turn writes v2-conformant turns.jsonl. Minimum: interrupted sessions lose <=2 turns. Target: <=1 turn loss (verified by kill-test).
Cross-runtime compat: Both Python and desktop can read each other's sessions. Minimum: >=80% of fixture matrix. Target: 100%.

Phase 1: Core Visualization & Search (Months 4-8)

Goal: The investigative experience -- graph, map, search, and collaboration.

Deliverable: An analyst can ingest data from multiple sources, search across entities, explore relationships in a graph canvas, view entities on a map, and collaborate with teammates in shared workspaces.

Success Criteria:

Graph performance: Expand node to 3 hops, filter 2+ types, >=500 entities. Minimum: >=15fps. Target: >=30fps.
Search recall: Top-5 results for test suite. Minimum: >=60% of 50 queries. Target: >=80%.
Provenance drill-down: Revelation cards resolve to evidence. Minimum: >=50% of cards have >=1 drill-down link. Target: 100%.
Connector SDK: Working connectors pass ingestion test. Minimum: 2 connectors. Target: 3+ (PostgreSQL, CSV, REST API).
Audit completeness: Minimum: >=90% coverage. Target: 100% of data operations logged.

Phase 2: AI & Advanced Analytics (Months 9-14)

Goal: AI-powered analysis and richer visualization.

Deliverable: An analyst can ask questions in natural language, have AI agents traverse the ontology to find answers, ingest and extract entities from documents, build dashboards, and set up automated data pipelines.

Success Criteria:

RAG accuracy: Ontology-grounded answers (human eval). Minimum: >=50% correct, <=10% hallucination. Target: >=70% correct, <=5% hallucination.
NL query translation: Valid ontology queries. Minimum: >=40% of 40-query suite. Target: >=60%.
Entity extraction: NER F1 on reference corpus. Minimum: >=60% F1. Target: >=75% F1.
Dashboard builder: Non-technical user creates 3-widget dashboard. Minimum: less than 20 min. Target: less than 10 min.
Pipeline reliability: 100 consecutive runs. Minimum: <=5% failure. Target: <=2% failure.

Phase 3: Actions, Agents & Operational Workflows (Months 15-20)

Goal: Transform from an analytical platform into an operational one.

Deliverable: The platform is operational -- insights lead to actions, workflows automate responses, AI agents work alongside human analysts, and the security model supports sensitive use cases.

Success Criteria:

Action round-trip: Flag entity, write-back reflected in ontology. Minimum: less than 60s. Target: less than 30s.
Agent sandboxing: Restricted agent cannot access out-of-scope entities. Minimum: 0 escapes in 10-op test. Target: 0 escapes in 20-op test.
Workflow reliability: 5-step event-triggered workflow. Minimum: >=90% over 50 runs. Target: >=95%.
Pattern search: Subgraph pattern on >=100K entities. Minimum: less than 10s. Target: less than 5s.

Phase 4: Scale, Deploy Anywhere & Ecosystem (Months 21+)

Goal: Production hardening, deployment flexibility, and community ecosystem.

Horizontal Scaling: Sharded ontology storage for very large datasets
Air-Gapped Deployment: Full functionality without internet access
Edge Deployment: Lightweight deployment for resource-constrained environments
Plugin Marketplace: Community-contributed connectors, visualizations, AI agents
Multi-Tenancy: Shared infrastructure with isolated ontologies
Federated Ontologies: Connect multiple Cestus instances while respecting access boundaries
Compliance Frameworks: FedRAMP, SOC 2, GDPR compliance tooling
Mobile UI: Responsive interface for field use
Real-Time Collaboration: Google Docs-style concurrent editing of canvases and annotations

Success Criteria:

Horizontal scale: 10M+ entities, 3-hop traversal. Minimum: less than 5s p95. Target: less than 2s p95.
Air-gapped: Full function without network. Minimum: 24h. Target: 72h+.
Multi-tenancy isolation: Cross-tenant query provably blocked. Minimum: verified by test suite.

9.5 Success Metrics Dashboard

The following operational metrics are tracked continuously across all phases:

Session contract conformance rate: % of sessions with valid v2 metadata
Provenance coverage: % of revelation cards with >=1 drill-down target
Cross-runtime compatibility score: % of test matrix passing
Mean time to evidence: Seconds from overview card click to evidence display

10. Existing Open-Source Building Blocks

One of Cestus's strategic advantages is that it does not need to build everything from scratch. The following projects can serve as foundations:

10.1 Data Integration & Pipeline

Project	What It Provides	How Cestus Uses It	License
Airbyte	300+ pre-built data connectors	Data ingestion from any source	Elv2 (core) / MIT
Apache Airflow	Workflow orchestration, DAGs, scheduling	Pipeline orchestration for batch ingestion	Apache 2.0
Temporal	Durable execution, event-driven workflows	Action execution and operational workflows	MIT
dbt	SQL-based data transformation	Transform raw data before ontology mapping	Apache 2.0
Apache Kafka / NATS	Event streaming, message bus	Internal event bus for ontology mutations	Apache 2.0
Apache NiFi	Data flow management, visual pipeline builder	Alternative/complement to Airflow for stream processing	Apache 2.0

10.2 Storage & Search

Project	What It Provides	How Cestus Uses It	License
PostgreSQL	Relational database	Primary store for ontology objects and metadata	PostgreSQL
Apache AGE	Graph query extension for PostgreSQL	Graph traversal queries within Postgres	Apache 2.0
Neo4j Community	Native graph database	Deep traversal queries (optional)	GPL-3.0
JanusGraph	Distributed graph database	Large-scale graph workloads (alternative to Neo4j)	Apache 2.0
Elasticsearch	Search engine	Full-text search and faceted filtering	SSPL
Typesense	Search engine (simpler, truly open)	Full-text search (alternative to ES)	GPL-3.0
MinIO	S3-compatible object storage	Document and file storage	AGPL-3.0

10.3 Entity Resolution & Knowledge Graphs

Project	What It Provides	How Cestus Uses It	License
Zingg	ML-based entity resolution at scale	Deduplication and entity matching across sources	AGPL-3.0
Splink	Probabilistic record linkage	Scalable entity resolution (Python, multiple backends)	MIT
Dedupe	Python entity resolution library	Lightweight ER for smaller datasets	MIT
WhyHow KG Studio	Knowledge graph construction with entity resolution	Reference architecture for ontology management	MIT

10.4 Visualization

Project	What It Provides	How Cestus Uses It	License
Sigma.js	WebGL graph rendering for the web	Link analysis / graph exploration canvas	MIT
Cytoscape.js	Graph theory library for visualization	Alternative graph renderer with rich layout algorithms	MIT
Deck.gl	WebGL-powered large-scale geospatial visualization	Map view for entities with location data	MIT
Leaflet	Lightweight interactive maps	Simpler geospatial view (alternative to Deck.gl)	BSD-2
Apache ECharts	Rich charting library	Dashboard charts and statistical visualizations	Apache 2.0
vis-timeline	Interactive timeline visualization	Timeline view for temporal entity data	MIT/Apache 2.0
Gephi	Desktop graph analysis tool (reference)	Architectural inspiration for graph analysis features	GPL
Apache Superset	Dashboard and visualization platform	Reference architecture; possible embed for dashboarding	Apache 2.0

10.5 AI & ML

Project	What It Provides	How Cestus Uses It	License
LangChain	LLM application framework	RAG pipeline, agent framework, tool integration	MIT
LlamaIndex	Data indexing and retrieval for LLMs	Ontology-aware indexing for AI queries	MIT
LangGraph	Graph-based agent workflows	Multi-step agent reasoning over ontology	MIT
Dify	LLM app development platform	Reference architecture for AI integration	Apache 2.0
Ollama	Local LLM serving	Run models locally for air-gapped deployments	MIT
vLLM	High-performance LLM serving	Production LLM inference	Apache 2.0

10.6 Security & Auth

Project	What It Provides	How Cestus Uses It	License
Keycloak	Identity and access management, SSO	Authentication (SAML, OIDC, LDAP)	Apache 2.0
Casbin	Authorization library (RBAC, ABAC)	Fine-grained policy enforcement	Apache 2.0
Permify	Google Zanzibar-inspired authorization	Relationship-based access control (alternative to Casbin)	Apache 2.0
Open Policy Agent (OPA)	Policy engine	Policy-as-code for complex authorization rules	Apache 2.0

10.7 Deployment & Operations

Project	What It Provides	How Cestus Uses It	License
Kubernetes	Container orchestration	Deployment platform	Apache 2.0
Helm	Kubernetes package manager	Deployment packaging	Apache 2.0
ArgoCD	GitOps continuous delivery	Automated deployment from Git	Apache 2.0
Prometheus + Grafana	Monitoring and observability	Platform health monitoring	Apache 2.0

10.8 Data Catalogs (Reference Architecture)

Project	What It Provides	Relevance
OpenMetadata	Unified metadata platform	Reference for metadata management and lineage
DataHub	Event-driven metadata management	Reference for real-time metadata sync
Apache Atlas	Metadata governance for Hadoop	Reference for classification and security integration

11. Research Sources

This document is a living artifact. It represents the initial vision for Cestus and should be revised as the project evolves, the community grows, and real-world usage reveals what matters most.

FilesExpand file tree

VISION.md

Latest commit

History

VISION.md

File metadata and controls

Cestus: Vision Document

Table of Contents

1. Executive Summary

2. Understanding the Landscape: What Palantir Does

2.1 Palantir Gotham (Intelligence & Defense)

2.2 Palantir Foundry (Commercial & Government Operations)

2.3 Palantir AIP (Artificial Intelligence Platform)

2.4 Palantir Apollo (Continuous Delivery & Operations)

2.5 The Palantir "Secret Sauce"

3. Existing Alternatives and Gaps

3.1 Commercial Alternatives

3.2 Open-Source Landscape

3.3 The Closest Attempts

4. Mission Statement

The Problem

The Mission

Core Principles

5. Core Feature Set

5.1 Data Integration & Ingestion

5.2 Data Modeling & Ontology

5.3 Search & Discovery

5.4 Visualization & Analytics

5.5 Collaboration

5.6 AI / ML Integration

5.7 Actions & Operational Integration

5.8 Access Control & Security

6. Architecture Ideas

6.1 High-Level Architecture

6.2 Key Architectural Decisions

6.3 Proposed Tech Stack

7. What Makes Cestus Different

7.1 vs. Palantir

7.2 vs. Other Open-Source Tools

7.3 The Core Differentiator

8. Potential Use Cases

8.1 Investigative Journalism

8.2 Humanitarian & NGO Operations

8.3 Open-Source Intelligence (OSINT)

8.4 Academic & Scientific Research

8.5 Small/Medium Enterprise Operations

8.6 Local Government & Public Sector

8.7 Fraud Detection & Compliance

9. Phased Roadmap

Phase 0: Foundation (Months 1-3)

Phase 1: Core Visualization & Search (Months 4-8)

Phase 2: AI & Advanced Analytics (Months 9-14)

Phase 3: Actions, Agents & Operational Workflows (Months 15-20)

Phase 4: Scale, Deploy Anywhere & Ecosystem (Months 21+)

9.5 Success Metrics Dashboard

10. Existing Open-Source Building Blocks

10.1 Data Integration & Pipeline

10.2 Storage & Search

10.3 Entity Resolution & Knowledge Graphs

10.4 Visualization

10.5 AI & ML

10.6 Security & Auth

10.7 Deployment & Operations

10.8 Data Catalogs (Reference Architecture)

11. Research Sources