AGENTS.md

Quick Mission

Neo4j Enterprise Operator (Kubebuilder/controller-runtime 0.21, Go 1.24) automates Neo4j Enterprise 5.26.x (last semver LTS) and 2025.x.x+ (CalVer) on Kubernetes.
Status: alpha; expect churn. Follow docs/ plus historical reports in reports/.
Platform assumptions: Kubernetes ≥1.21, cert-manager ≥1.18 for TLS, and Kind-only for any cluster work.

Non-Negotiables (read with `CLAUDE.md`)

Kind only: use make dev-cluster, test-cluster, operator-setup; never minikube/k3s.
Enterprise images only (neo4j:<version>-enterprise). Discovery uses LIST resolver with static pod FQDNs (port 6000); 5.26.x requires explicit V2_ONLY, CalVer 2025.x+ (including 2026.x+) omits the flag — handled in buildVersionSpecificDiscoveryConfig() via isCalverImage() / ParseVersion.
Operator must run in-cluster: never make dev-run/hack/dev-run.sh or host-mode runs.
Server-based design: single {cluster}-server StatefulSet with topology.servers; preserve topology.serverModeConstraint/serverRoles hints and centralized {cluster}-backup StatefulSet.
Conflict-safe writes: wrap creates/updates in retry.RetryOnConflict; when checking existing resources use UID (not ResourceVersion) for template comparison.
Edition removed: API assumes Enterprise; do not reintroduce the field or allow community images.
Safety hooks: keep split-brain detector (internal/controller/splitbrain_detector.go) wiring and event reason SplitBrainDetected; status .phase drives readiness checks.
Plugin rules: APOC via env vars; Bloom/GDS/GenAI and similar via ConfigMap with automatic security defaults and dependency resolution; respect cluster vs standalone naming/labeling in plugin_controller.
Property sharding: Neo4jShardedDatabase + related tests stay opt-in; requires ≥5 servers with 4–8Gi each—guard resource gates.
Database & TLS: Neo4jDatabase must work for cluster and standalone (ensure NEO4J_AUTH for standalone), respect seedURI/seedConfig, TLS automation for spec.tls.mode=cert-manager. Discovery settings differ by version — always go through buildVersionSpecificDiscoveryConfig(), never hard-code K8S or V1 discovery settings.
CRD scope separation: Cluster/Standalone manage infra/config; Database manages database lifecycle only—no cross-CRD overrides.

Architecture Anchors

CRDs (api/v1alpha1): Neo4jEnterpriseCluster, Neo4jEnterpriseStandalone, Neo4jDatabase, Neo4jPlugin, Neo4jBackup, Neo4jRestore, Neo4jShardedDatabase.
Controllers (internal/controller/):
- Cluster/Standalone reconcilers build ConfigMaps/Services/StatefulSets, manage status phases, TLS, auth, placement, and cache/configmap managers.
- Database controller auto-detects cluster vs standalone, uses appropriate Bolt client, and supports wait/ifNotExists/topology/seed flows.
- Plugin controller manages env-vs-config installs, dependencies, and pod readiness per deployment type.
- Backup controller runs centralized {cluster}-backup pods with request file drop; Restore controller handles PIT restores.
- Sharded database controller enforces property-sharding readiness; Topology scheduler adds AZ spread/anti-affinity; Rolling upgrade orchestrator handles leader-aware upgrades and metrics.
- Split-brain detector restarts orphans after multi-pod view comparison.
Validation (internal/validation/): auth, backup, cluster, database (cluster+standalone), image, memory, plugin, resource, security, shardeddatabase, storage, TLS, topology, upgrade validators; edition validator stubbed out.
Resources/Clients: internal/resources builds Kubernetes objects (TLS policies, discovery ports, memory sizing); internal/neo4j wraps Bolt, version parsing (5.x vs 2025.x), and upgrade safety checks.

Repository Atlas

Area	Purpose
`cmd/main.go`	Manager entrypoint wiring controllers/webhooks.
`api/v1alpha1/`	CRD schemas listed above.
`internal/controller/`	Reconcilers, split-brain detector, topology scheduler, rolling upgrade, cache/configmap managers.
`internal/resources/`	Builders for StatefulSets/Services/ConfigMaps, memory sizing, TLS/discovery helpers.
`internal/neo4j/`	Bolt client + version helpers used by controllers/tests.
`internal/validation/`	Validation/recommendation logic per CRD.
`charts/neo4j-operator/`	Helm chart (README quick start).
`config/`	Kubebuilder manifests (CRDs, RBAC, samples, overlays).
`docs/`	User/developer guides, deployment/seed-URI/split-brain guides, quick reference, API reference.
`examples/`	Standalone, clusters, backup/restore, plugins, property sharding, E2E scenarios.
`test/`	Ginkgo integration suites, fixtures, helpers.
`scripts/`	Automation (`test-env.sh`, setup/cleanup, demos, RBAC helpers, verification).
`reports/`	Design history and implementation analyses.

Development Workflow

Prereqs: Go 1.24+, Docker, kubectl, Kind, make, git; verify kind version.
Dev cluster: make dev-cluster (Kind neo4j-operator-dev, installs cert-manager issuer). Test cluster via make test-cluster.
Codegen/hygiene: make manifests generate; then make fmt vet lint (use rg/goimports; ASCII only).
Build & load image: make docker-build IMG=neo4j-operator:dev then kind load docker-image neo4j-operator:dev --name neo4j-operator-dev (or test cluster). Never run operator out-of-cluster.
Deploy: make deploy-dev/deploy-prod (or *-registry for registry images); Helm chart under charts/ if needed.
Utilities: make operator-setup, operator-status, operator-logs; cleanup via make dev-cluster-clean, dev-cluster-reset, dev-destroy (similar test-* targets). Demos via make demo, demo-fast, demo-setup.

Testing Strategy

Unit: make test-unit (envtest; skips integration dirs).
Integration (test/integration, Ginkgo v2): make test-integration builds local image, loads into Kind neo4j-operator-test, deploys operator, runs suite (timeouts ~5m, CI extends to 10–20m). CI-friendly subsets: make test-integration-ci, heavy *-ci-full, full workflow make test-ci-local (logs in logs/).
Operator mode during integration tests — ALL paths now deploy in production mode:
- make test-integration uses config/overlays/integration-test/kustomization.yaml → deploys to neo4j-operator-system without --mode=dev. Suite finds the operator and waits for readiness before running specs.
- .github/workflows/integration-tests.yml (on-demand) does the same via its own ci-temp overlay.
- make deploy-dev is for manual debugging only (neo4j-operator-dev, --mode=dev); never use it as the pre-test deployment step.
Cleanup rule: every spec must AfterEach delete CRs, drop finalizers, and call cleanupCustomResourcesInNamespace()—do not rely on suite cleanup.
Property sharding: opt-in only (test/integration/property_sharding_test.go), requires ≥5 servers and 4–8Gi memory each.
Plugin tests: clusters expect env var config, standalone expects ConfigMap content; keep sources consistent.
Scripts: scripts/test-env.sh manages Kind clusters; scripts/run-tests-clean.sh wraps go test.

Documentation & Examples

User docs: docs/user_guide (install/config, external access, topology placement, property sharding, backup/restore, security, performance, monitoring, upgrades, troubleshooting).
Developer docs: docs/developer_guide/ (architecture, development setup, testing instructions, Makefile reference).
References: docs/quick-reference/operator-modes-cheat-sheet.md, docs/api_reference/ for CRDs, docs/user_guide/deployment.md, docs/user_guide/guides/seed-uri.md, docs/user_guide/troubleshooting/split-brain.md.
Examples: examples/ for standalone/cluster mins, plugins, backup/restore, property sharding, E2E blueprints.

Observability & Troubleshooting

Logs: kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager -f or make operator-logs
Events: All material state transitions emit structured Kubernetes events using constants from internal/controller/events.go. Monitor by reason:
- kubectl get events --field-selector reason=SplitBrainDetected -A
- kubectl get events --field-selector reason=BackupFailed
- kubectl get events --field-selector reason=ClusterFormationFailed
Live Diagnostics: When spec.monitoring.enabled=true and cluster is Ready, the operator collects SHOW SERVERS/SHOW DATABASES results into status.diagnostics. Two conditions — ServersHealthy and DatabasesHealthy — surface cluster health without kubectl exec.
Prometheus Metrics: Custom metrics exported under neo4j_operator_* prefix:
- neo4j_operator_cluster_healthy / neo4j_operator_cluster_phase / neo4j_operator_cluster_replicas_total
- neo4j_operator_server_health{server_name, server_address} — per-server health from diagnostics
- neo4j_operator_backup_total / neo4j_operator_reconcile_duration_seconds
- neo4j_operator_split_brain_detected_total
Inspect: kubectl explain neo4jenterprisecluster.spec, kubectl describe neo4jenterprisecluster/<name>; use cypher-shell in pods for cluster state.
GitOps: ArgoCD health check scripts for all 7 CRDs in docs/gitops/argocd-health-checks.yaml. Flux uses the standard Ready condition automatically.
Status Conditions: All CRDs emit standardized conditions using helpers from internal/controller/conditions.go: SetReadyCondition (for Ready), SetNamedCondition (for ServersHealthy/DatabasesHealthy).
Debug aids: see CLAUDE.md for enabling debug logging, OOM checks, and port-forward guidance.

Contribution Expectations

Follow CONTRIBUTING.md; run make fmt lint test-unit (plus integration when relevant) before PRs; Kind required.
Keep CRDs regenerated with make manifests; run make generate if API types change.
Prefer Makefile/scripts over ad-hoc commands; do not reintroduce edition field or community images.
Maintain docs/examples/API references when behavior changes; add design reports under reports/ for architectural shifts.
Preserve AfterEach cleanup in integration tests; respect usage of retry.RetryOnConflict and server-based topology helpers.
Default to ASCII text; prefer rg for search. Avoid creating new docs unless explicitly requested.

Where to Start

README.md for requirements and quick start.
docs/developer_guide/architecture.md for design (server architecture, centralized backup, controllers).
internal/controller/neo4jenterprisecluster_controller.go, splitbrain_detector.go, rolling_upgrade.go, topology_scheduler.go for reconciliation/safety flows.
test/integration/integration_suite_test.go for harness assumptions and cleanup expectations.
CLAUDE.md (and this file) for agent-specific guardrails before making changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AGENTS.md

Quick Mission

Non-Negotiables (read with `CLAUDE.md`)

Architecture Anchors

Repository Atlas

Development Workflow

Testing Strategy

Documentation & Examples

Observability & Troubleshooting

Contribution Expectations

Where to Start

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

AGENTS.md

Quick Mission

Non-Negotiables (read with CLAUDE.md)

Architecture Anchors

Repository Atlas

Development Workflow

Testing Strategy

Documentation & Examples

Observability & Troubleshooting

Contribution Expectations

Where to Start

Non-Negotiables (read with `CLAUDE.md`)