- Neo4j Enterprise Operator (Kubebuilder/controller-runtime 0.21, Go 1.24) automates Neo4j Enterprise 5.26.x (last semver LTS) and 2025.x.x+ (CalVer) on Kubernetes.
- Status: alpha; expect churn. Follow
docs/plus historical reports inreports/. - Platform assumptions: Kubernetes ≥1.21, cert-manager ≥1.18 for TLS, and Kind-only for any cluster work.
- Kind only: use
make dev-cluster,test-cluster,operator-setup; never minikube/k3s. - Enterprise images only (
neo4j:<version>-enterprise). Discovery uses LIST resolver with static pod FQDNs (port 6000); 5.26.x requires explicitV2_ONLY, CalVer 2025.x+ (including 2026.x+) omits the flag — handled inbuildVersionSpecificDiscoveryConfig()viaisCalverImage()/ParseVersion. - Operator must run in-cluster: never
make dev-run/hack/dev-run.shor host-mode runs. - Server-based design: single
{cluster}-serverStatefulSet withtopology.servers; preservetopology.serverModeConstraint/serverRoleshints and centralized{cluster}-backupStatefulSet. - Conflict-safe writes: wrap creates/updates in
retry.RetryOnConflict; when checking existing resources use UID (not ResourceVersion) for template comparison. - Edition removed: API assumes Enterprise; do not reintroduce the field or allow community images.
- Safety hooks: keep split-brain detector (
internal/controller/splitbrain_detector.go) wiring and event reasonSplitBrainDetected; status.phasedrives readiness checks. - Plugin rules: APOC via env vars; Bloom/GDS/GenAI and similar via ConfigMap with automatic security defaults and dependency resolution; respect cluster vs standalone naming/labeling in
plugin_controller. - Property sharding:
Neo4jShardedDatabase+ related tests stay opt-in; requires ≥5 servers with 4–8Gi each—guard resource gates. - Database & TLS:
Neo4jDatabasemust work for cluster and standalone (ensureNEO4J_AUTHfor standalone), respect seedURI/seedConfig, TLS automation forspec.tls.mode=cert-manager. Discovery settings differ by version — always go throughbuildVersionSpecificDiscoveryConfig(), never hard-code K8S or V1 discovery settings. - CRD scope separation: Cluster/Standalone manage infra/config; Database manages database lifecycle only—no cross-CRD overrides.
- CRDs (
api/v1alpha1): Neo4jEnterpriseCluster, Neo4jEnterpriseStandalone, Neo4jDatabase, Neo4jPlugin, Neo4jBackup, Neo4jRestore, Neo4jShardedDatabase. - Controllers (
internal/controller/):- Cluster/Standalone reconcilers build ConfigMaps/Services/StatefulSets, manage status phases, TLS, auth, placement, and cache/configmap managers.
- Database controller auto-detects cluster vs standalone, uses appropriate Bolt client, and supports wait/ifNotExists/topology/seed flows.
- Plugin controller manages env-vs-config installs, dependencies, and pod readiness per deployment type.
- Backup controller runs centralized
{cluster}-backuppods with request file drop; Restore controller handles PIT restores. - Sharded database controller enforces property-sharding readiness; Topology scheduler adds AZ spread/anti-affinity; Rolling upgrade orchestrator handles leader-aware upgrades and metrics.
- Split-brain detector restarts orphans after multi-pod view comparison.
- Validation (
internal/validation/): auth, backup, cluster, database (cluster+standalone), image, memory, plugin, resource, security, shardeddatabase, storage, TLS, topology, upgrade validators; edition validator stubbed out. - Resources/Clients:
internal/resourcesbuilds Kubernetes objects (TLS policies, discovery ports, memory sizing);internal/neo4jwraps Bolt, version parsing (5.x vs 2025.x), and upgrade safety checks.
| Area | Purpose |
|---|---|
cmd/main.go |
Manager entrypoint wiring controllers/webhooks. |
api/v1alpha1/ |
CRD schemas listed above. |
internal/controller/ |
Reconcilers, split-brain detector, topology scheduler, rolling upgrade, cache/configmap managers. |
internal/resources/ |
Builders for StatefulSets/Services/ConfigMaps, memory sizing, TLS/discovery helpers. |
internal/neo4j/ |
Bolt client + version helpers used by controllers/tests. |
internal/validation/ |
Validation/recommendation logic per CRD. |
charts/neo4j-operator/ |
Helm chart (README quick start). |
config/ |
Kubebuilder manifests (CRDs, RBAC, samples, overlays). |
docs/ |
User/developer guides, deployment/seed-URI/split-brain guides, quick reference, API reference. |
examples/ |
Standalone, clusters, backup/restore, plugins, property sharding, E2E scenarios. |
test/ |
Ginkgo integration suites, fixtures, helpers. |
scripts/ |
Automation (test-env.sh, setup/cleanup, demos, RBAC helpers, verification). |
reports/ |
Design history and implementation analyses. |
- Prereqs: Go 1.24+, Docker, kubectl, Kind, make, git; verify
kind version. - Dev cluster:
make dev-cluster(Kindneo4j-operator-dev, installs cert-manager issuer). Test cluster viamake test-cluster. - Codegen/hygiene:
make manifests generate; thenmake fmt vet lint(userg/goimports; ASCII only). - Build & load image:
make docker-build IMG=neo4j-operator:devthenkind load docker-image neo4j-operator:dev --name neo4j-operator-dev(or test cluster). Never run operator out-of-cluster. - Deploy:
make deploy-dev/deploy-prod(or*-registryfor registry images); Helm chart undercharts/if needed. - Utilities:
make operator-setup,operator-status,operator-logs; cleanup viamake dev-cluster-clean,dev-cluster-reset,dev-destroy(similartest-*targets). Demos viamake demo,demo-fast,demo-setup.
- Unit:
make test-unit(envtest; skips integration dirs). - Integration (
test/integration, Ginkgo v2):make test-integrationbuilds local image, loads into Kindneo4j-operator-test, deploys operator, runs suite (timeouts ~5m, CI extends to 10–20m). CI-friendly subsets:make test-integration-ci, heavy*-ci-full, full workflowmake test-ci-local(logs inlogs/). - Operator mode during integration tests — ALL paths now deploy in production mode:
make test-integrationusesconfig/overlays/integration-test/kustomization.yaml→ deploys toneo4j-operator-systemwithout--mode=dev. Suite finds the operator and waits for readiness before running specs..github/workflows/integration-tests.yml(on-demand) does the same via its ownci-tempoverlay.make deploy-devis for manual debugging only (neo4j-operator-dev,--mode=dev); never use it as the pre-test deployment step.
- Cleanup rule: every spec must
AfterEachdelete CRs, drop finalizers, and callcleanupCustomResourcesInNamespace()—do not rely on suite cleanup. - Property sharding: opt-in only (
test/integration/property_sharding_test.go), requires ≥5 servers and 4–8Gi memory each. - Plugin tests: clusters expect env var config, standalone expects ConfigMap content; keep sources consistent.
- Scripts:
scripts/test-env.shmanages Kind clusters;scripts/run-tests-clean.shwraps go test.
- User docs:
docs/user_guide(install/config, external access, topology placement, property sharding, backup/restore, security, performance, monitoring, upgrades, troubleshooting). - Developer docs:
docs/developer_guide/(architecture, development setup, testing instructions, Makefile reference). - References:
docs/quick-reference/operator-modes-cheat-sheet.md,docs/api_reference/for CRDs,docs/user_guide/deployment.md,docs/user_guide/guides/seed-uri.md,docs/user_guide/troubleshooting/split-brain.md. - Examples:
examples/for standalone/cluster mins, plugins, backup/restore, property sharding, E2E blueprints.
- Logs:
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager -formake operator-logs - Events: All material state transitions emit structured Kubernetes events using constants from
internal/controller/events.go. Monitor by reason:kubectl get events --field-selector reason=SplitBrainDetected -Akubectl get events --field-selector reason=BackupFailedkubectl get events --field-selector reason=ClusterFormationFailed
- Live Diagnostics: When
spec.monitoring.enabled=trueand cluster isReady, the operator collectsSHOW SERVERS/SHOW DATABASESresults intostatus.diagnostics. Two conditions —ServersHealthyandDatabasesHealthy— surface cluster health withoutkubectl exec. - Prometheus Metrics: Custom metrics exported under
neo4j_operator_*prefix:neo4j_operator_cluster_healthy/neo4j_operator_cluster_phase/neo4j_operator_cluster_replicas_totalneo4j_operator_server_health{server_name, server_address}— per-server health from diagnosticsneo4j_operator_backup_total/neo4j_operator_reconcile_duration_secondsneo4j_operator_split_brain_detected_total
- Inspect:
kubectl explain neo4jenterprisecluster.spec,kubectl describe neo4jenterprisecluster/<name>; usecypher-shellin pods for cluster state. - GitOps: ArgoCD health check scripts for all 7 CRDs in
docs/gitops/argocd-health-checks.yaml. Flux uses the standardReadycondition automatically. - Status Conditions: All CRDs emit standardized conditions using helpers from
internal/controller/conditions.go:SetReadyCondition(forReady),SetNamedCondition(forServersHealthy/DatabasesHealthy). - Debug aids: see
CLAUDE.mdfor enabling debug logging, OOM checks, and port-forward guidance.
- Follow
CONTRIBUTING.md; runmake fmt lint test-unit(plus integration when relevant) before PRs; Kind required. - Keep CRDs regenerated with
make manifests; runmake generateif API types change. - Prefer Makefile/scripts over ad-hoc commands; do not reintroduce edition field or community images.
- Maintain docs/examples/API references when behavior changes; add design reports under
reports/for architectural shifts. - Preserve AfterEach cleanup in integration tests; respect usage of
retry.RetryOnConflictand server-based topology helpers. - Default to ASCII text; prefer
rgfor search. Avoid creating new docs unless explicitly requested.
README.mdfor requirements and quick start.docs/developer_guide/architecture.mdfor design (server architecture, centralized backup, controllers).internal/controller/neo4jenterprisecluster_controller.go,splitbrain_detector.go,rolling_upgrade.go,topology_scheduler.gofor reconciliation/safety flows.test/integration/integration_suite_test.gofor harness assumptions and cleanup expectations.CLAUDE.md(and this file) for agent-specific guardrails before making changes.