Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
233 changes: 1 addition & 232 deletions ROADMAP.md
Original file line number Diff line number Diff line change
@@ -1,232 +1 @@
# KubeStellar Console Roadmap

This document outlines the planned direction for KubeStellar Console. It is a living document and will be updated as priorities evolve based on community feedback, user needs, and ecosystem changes.

## Completed Milestones

### v0.1 — Foundation (Q3 2025)
- Multi-cluster dashboard with real-time health monitoring
- Helm release tracking across clusters
- Pod, deployment, and event monitoring cards
- Demo mode with MSW mock data for offline usage
- GitHub OAuth authentication
- Dark/light theme support

### v0.2 — Intelligence Layer (Q4 2025)
- AI-powered missions system with Claude and kagent integration
- Community missions browser with console-kb knowledge base
- Contributor rewards system with leaderboard and coin economy
- 80+ dashboard cards covering CNCF ecosystem
- GPU monitoring cards (overview, inventory, utilization, reservations)
- OPA, Kyverno, Falco, and Trivy security cards
- ArgoCD application monitoring
- Drag-and-drop dashboard customization with card catalog

### v0.3 — Scale & Operations (Q1–Q2 2026)
- **Console Studio** — Visual dashboard builder with AI card generation
- **Mission Control** — Guided CNCF project deployment with Flight Plan blueprint, phased launch, and AI-assisted cluster assignment; dry-run mode and kind cluster E2E tests
- **Orbital Maintenance** — Automated cluster maintenance missions with scheduling
- **Benchmark streaming** — Real-time vLLM/llm-d performance data via Google Drive with hardware leaderboards
- **GPU namespace drill-down** — Per-GPU-type, per-node allocation views
- **Workload import dialog** — YAML, Helm, GitHub, and Kustomize import support
- **NPS survey system** — In-app Net Promoter Score feedback collection
- **VCluster and KubeVirt** cards for virtualized workloads
- **Marketplace** — Community card preset marketplace with 45+ CNCF project templates
- **OpenSSF Scorecard improvements** — Signed releases, SLSA provenance, scoped workflow permissions
- 160+ total dashboard cards
- Nightly and weekly automated releases with Helm OCI chart publishing
- Comprehensive Auto-QA workflows for code quality, governance, and UI consistency
- Contributor leaderboard with GitHub-synced rewards
- **AI Missions UX** — Message edit/resend, microphone input, scroll-to-bottom, draft click-to-open, history toggle panel, mission sort by activity, retry on failure, response cancellation
- **Auth hardening** — GA4 telemetry on auth failure paths (SSE 401, WS token missing, agent token failure, session refresh), agentFetch migration for all kc-agent calls, HS256-only JWT parsing (TAG-Security fix)
- **kc-agent API expansion** — `/nvidia-operators`, `/events/stream` SSE, `/federation/detect`, agent token bridging to frontend
- **Responsive container-query rollout** — Phase 3a/3b across 63 files: responsive skeleton grids, flex-wrap in CNCF status cards
- **Test infrastructure** — Coverage from 0% to 91%: 10,000+ unit tests, 12-shard parallel coverage, coverage regression guard with auto-issue, post-merge Playwright verification against production
- **Code quality automation** — UI/UX standards scanner with Storybook and Playwright visual regression, post-build vendor safety checks, MSW catch-all for unmocked routes
- **Backend refactoring** — Monolith splits: sqlite.go (3,321 → 8 files), server_http.go/server_ai.go/server_operations.go into domain handlers, CardWrapper.tsx into 4 sub-components; 609 fmt.Sprintf calls converted to structured slog fields
- **ArgoCD ApplicationSet** integration with security fixes
- **Saved Filter Sets** — Snapshot all filters into named presets; merged Project Selector and Filter Panel into single dropdown
- **Learn dropdown** — Auto-populated from YouTube playlist with video tutorials
- **Claude Code GitHub Action** — AI-assisted PR review and issue triage via Claude Opus 4.6

## v0.4 — AI-Native Observability (Target: Q3 2026)

This milestone crystallizes the near-term roadmap items into a cohesive theme: establishing KubeStellar Console as the canonical AI/ML workload visibility and operations layer for Kubernetes.

### Core Scope

- **llm-d stack monitoring** — First-class support for llm-d inference serving: EPP routing, model endpoint health, autoscaler status, disaggregated serving topology
- **Drasi reactive pipelines** — Real-time change-feed dashboard for Drasi continuous queries, sources, and reactions across deployment modes (drasi-server, drasi-platform, CRD-based)
- **kagent/kagenti integration** — Full agent lifecycle management through MCP-compatible interfaces

### Quality & Testing

- **Nightly E2E expansion** — Automated end-to-end testing across all 8 llm-d deployment guides on OpenShift
- **Marketplace v2** — Require live data hooks, unified controls, demo data, and install links for all card presets; community review process

### UX & Accessibility

- **i18n completeness** — Eliminate all hardcoded English strings; prepare for community localization contributions
- **Accessibility audit** — Replace remaining `window.confirm()` dialogs, add ARIA labels, keyboard navigation for all interactive elements
- **GA4 UX funnel** — Measure conversion from landing to agent install to first mission; identify and fix drop-off points
- **Component consistency** — Migrate remaining raw HTML elements to shared UI components (Button, Modal, Dialog); standardize modal visibility patterns

### Community Health

- **Adopters program** — Populate ADOPTERS.MD with confirmed production users; define maturity tiers (install-mission vs. production deployment)
- **Contributor onboarding** — Establish PR triage SLA, define `ai-needs-human` escalation path, and publish contributor guide update; see `docs/plans/PR-TRIAGE-SLA.md`
- **Adoption metrics** — Replace all `TBD` fields in `docs/adoption-metrics.md` with real measurements before any CNCF application

### Tech Debt Unblocking Strategy

As the codebase scales past 160+ dashboard cards and 10,000+ unit tests, technical debt items that were previously deprioritized ("hold" status) now represent scaling risks. This section defines the unblocking strategy to address accumulated tech debt before it impacts delivery velocity.

**Priority 1: Performance & Scalability**
- **Card render optimization** — Audit and fix cards with >500ms initial render time; establish performance budgets per card type
- **Cache eviction policy** — Implement LRU eviction for SQLite WASM cache to prevent unbounded growth; target <50MB cache size
- **Test parallelization** — Reduce CI test suite runtime from current baseline; investigate Jest worker memory limits

**Priority 2: Code Health**
- **TypeScript strict mode** — Enable `strict: true` incrementally, starting with new files; eliminate remaining `any` types in card components
- **Dependency updates** — Unblock Vite 6, React 19, and Tailwind 4 upgrades currently held due to breaking changes; allocate dedicated sprint
- **Bundle size** — Audit and tree-shake unused dependencies; target <2MB initial JS bundle (currently ~2.8MB)

**Priority 3: Developer Experience**
- **Storybook coverage** — Achieve 80% component coverage in Storybook (currently ~40%); prioritize cards with complex state
- **E2E test stability** — Fix flaky Playwright tests in `nightly-e2e` workflow; define retry/timeout standards
- **Documentation debt** — Update outdated API docs in `pkg/api/`, particularly for Stellar subsystem endpoints

**Execution Model**
- Allocate 20% of each sprint cycle to tech debt work (approximately 1 issue per developer per 2-week sprint)
- Tag tech debt issues with `tech-debt` label and priority tier (`p1-perf`, `p2-health`, `p3-dx`)
- Track tech debt ratio (tech debt issues / total issues) as a key health metric; target <15%
- Block new feature work if tech debt ratio exceeds 25% or any P1 item is open >30 days

## Near-Term (Q2–Q3 2026)

See **v0.4 — AI-Native Observability** milestone above for the full near-term feature scope, quality gates, and community health targets.

**Branch Stability Covenant (effective immediately):** Main branch must remain green at all times. A post-merge integration smoke gate (combining TS build, auth smoke, and workflow startup checks) is required before new feature PRs are merged. See issue [#17756](https://github.com/kubestellar/console/issues/17756) for tracking.

## Mid-Term (Q3–Q4 2026)

- **Stellar subsystem GA** — Graduate the Stellar persistent AI runtime from alpha to GA: finalize CRD versioning (v1 stability), complete Mission Operator test coverage, publish upgrade path documentation, and achieve at least one confirmed non-demo deployment. GA criteria tracked in [#17757](https://github.com/kubestellar/console/issues/17757). Stellar GA is the strategic milestone that moves Console from a dashboard to a production AI operations runtime.
- **GitOps integration milestone** — First-class Flux + Argo CD support with observability parity, declarative Console configuration, and Mission Control deep links; see `docs/plans/GITOPS-INTEGRATION-RFC.md`
- **Multi-tenant RBAC** — Role-based access control for teams sharing a Console instance, with namespace-scoped permissions
- **Plugin architecture** — Extensible card and mission system allowing third-party developers to build custom dashboard components; see `docs/plans/PLUGIN-ARCHITECTURE-RFC.md` (RFC to be authored — tracked in [#17760](https://github.com/kubestellar/console/issues/17760))
- **Helm operator** — Kubernetes operator for fleet-wide Console deployment and lifecycle management
- **Enhanced AI missions** — AI-assisted troubleshooting missions that diagnose cluster issues and suggest remediation steps
- **Offline/air-gapped mode** — Full Console functionality without internet connectivity for restricted environments
- **CNCF incubation preparation** — Governance documentation, adopters program, and community growth metrics; target Q4 2026 TOC application
- **Third-party security audit (Q3 2026)** — Engage CNCF-sponsored auditors (ADA Logics or CNCF Security Audit program) for formal code security audit; required gate for CNCF incubation. **Owner:** clubanderson. **Timeline:** Open CNCF Security Audit request at https://github.com/cncf/toc/issues in Q2 2026; schedule audit completion for Q3 2026. This positions the project for Q4 2026 incubation application with completed security due-diligence.
- **Multi-model AI backend** — Support for multiple LLM providers (OpenAI, Ollama, vLLM) behind a unified mission interface, reducing vendor lock-in
- **Webhook-driven card updates** — Push-based card refresh via Kubernetes webhooks instead of polling, reducing API server load on large clusters
- **Custom alert rules** — User-defined threshold alerts on any card metric, with notification channels (Slack, email, PagerDuty)

## Long-Term (2027+)

- **Policy engine** — Built-in policy authoring, testing, and enforcement with OPA/Gatekeeper integration
- **AI-assisted operations** — Proactive anomaly detection, capacity planning, and automated incident response via MCP
- **Federation** — Console-to-Console federation for organizations managing multiple Console instances across regions
- **Compliance dashboards** — Automated compliance reporting against CIS benchmarks, SOC 2, and HIPAA requirements
- **Collaborative dashboards** — Real-time multi-user dashboard editing with presence indicators and conflict resolution
- **Workflow automation** — Visual workflow builder for multi-step cluster operations (rolling upgrades, canary deployments, disaster recovery runbooks)
- **Embedded terminal** — In-browser kubectl/helm terminal with context-aware autocomplete, scoped to the user's RBAC permissions

## Non-Goals

KubeStellar Console intentionally does **not** aim to:

- **Replace kubectl** — Console is a visual companion, not a CLI replacement. Power users should continue using kubectl, helm, and other CLI tools directly.
- **Be a general-purpose IDE** — While Console includes AI-powered features, it is not a code editor or development environment.
- **Manage non-Kubernetes workloads** — Console focuses exclusively on Kubernetes clusters and cloud-native workloads.
- **Provide its own container runtime** — Console observes and manages existing clusters; it does not provisions infrastructure.
- **Compete with commercial APM tools** — Console provides operational visibility, not deep application performance monitoring. Use Datadog, New Relic, or Grafana for APM.

## How to Influence the Roadmap

We welcome community input on priorities:

- **GitHub Issues** — Open an issue on [kubestellar/console](https://github.com/kubestellar/console/issues) with the `enhancement` label
- **Discussions** — Join [#kubestellar-dev on Slack](https://cloud-native.slack.com/channels/kubestellar-dev)
- **Mailing List** — Email [kubestellar-dev@googlegroups.com](mailto:kubestellar-dev@googlegroups.com)

---

## Strategic Health — June 2026

> Status snapshot filed by the strategist agent (ACMM L6). Updated when material risks to roadmap delivery are identified.
> **Last updated:** 2026-06-20 (11:06 AM EDT, pass 16)

### Current Risk Register

| Risk | Severity | Issue | Status |
|------|----------|-------|--------|
| v0.4 T-11 days: FINAL WARNING — zero feature PRs across all 7 scope items, Q3 starts July 1 | 🔴 Critical | #19307 #19257 | Scope decision + feature captain required immediately |
| Coverage suite 100% collapse — all 12 shards failing 4+ days; v0.3 "91% coverage" claim unsupportable | 🔴 Critical | #19158 | CI fix attempts in progress (#19292 #19293 merged) but coverage still down |
| GitHub branch protection still absent — policy files advisory-only, merges unblocked | 🔴 Critical | #18355 | 5-minute fix: GitHub Settings → Branch protection |
| CNCF security audit Q2 action overdue — 40 days past deadline | 🔴 Critical | #18207 | Requires @clubanderson action at github.com/cncf/toc/issues |
| Nightly CI: partial recovery in progress — regressions fixed, coverage still broken | 🟠 High | #19005 | #19292 (Playwright) + #19293 (nightly regressions) merged Jun 20 |
| PR hygiene crisis: 67% DCO failures + 37% WIP zombies | 🟠 High | #19007 | Structural Copilot DCO gap; needs process fix |
| ADOPTERS.md self-referential — no external adopters listed | 🟠 High | — | Ongoing |
| PR triage SLA absent — ai-needs-human PRs still lack escalation (#18598 #18599 open) | 🟡 Medium | #18037 | @Jayant-kernel contributed Auto-QA SLA doc (#19291) — partial progress |
| Tech-debt arch refactors: #17124, #17576, #17882, #17883 still open | 🟡 Medium | #17883 | Architect in progress |
| Stellar subsystem — no GA milestone or alpha exit criteria | 🟡 Medium | #17757 | Tracked |
| CNCF incubation tracker on `hold` | 🟡 Medium | #4072 | Blocked pending security audit + adopters |
| ~~Auto-QA triage backlog: 4 issues stuck in ai-needs-human~~ | ~~🟠 High~~ | ~~#19256~~ | ✅ Closed as completed Jun 20 — @Jayant-kernel added Auto-QA SLA doc (#19291) |
| ~~SSRF: IsBlockedIP missing IsMulticast~~ | ~~🟠 High~~ | ~~#18372~~ | ✅ Fixed |
| ~~Community PRs prow-gated: @bmvinay7 + @AdeshDeshmukh~~ | ~~🟠 High~~ | ~~#18305~~ | ✅ All three June 13 wave PRs merged |
| ~~Coverage suite: 39% run failure rate~~ | ~~🟠 High~~ | ~~#18533~~ | ⬆️ Escalated to Critical: 100% collapse |
| ~~Auth smoke test regression~~ | ~~🔴 Critical~~ | ~~#18354~~ | ✅ Fixed |
| ~~CSP `unsafe-eval` default~~ | ~~🟠 High~~ | ~~#18326~~ | ✅ Fixed |
| ~~Playwright Firefox nightly failing~~ | ~~🟠 High~~ | ~~#18304~~ | ✅ Fixed |
| ~~Nightly CI trifecta~~ | ~~🔴 Critical~~ | ~~#18299-18301~~ | ✅ Resolved |
| ~~Coverage suite: 67 failures~~ | ~~🟠 High~~ | ~~#18226~~ | ✅ Fixed |
| ~~Merge gate disabled~~ | ~~🔴 Critical~~ | ~~#17852~~ | ✅ Closed |
| ~~DCO sign-off failures~~ | ~~🔴 Critical~~ | ~~#17966~~ | ✅ Closed |

### Community Momentum 🌱

**@Jayant-kernel: most active external contributor** — 3 PRs merged in 2 days, including governance:
- PR #19251 `docs: promote console marketplace in README` ✅ Merged 2026-06-19
- PR #19291 `docs: add Auto-QA triage SLA` ✅ Merged 2026-06-20 (directly addressed strategist gap #19256)
- Strategic retention opportunity filed: #19306

**@ashnaaseth2325-oss: returning contributor:**
- PR #19225 `fix: incorrect rollback actions after Helm release rollback` ✅ Merged 2026-06-19 (2nd PR)

**Earlier June wave — all merged:**
- **@bmvinay7** — PR #18264 ✅ Merged 2026-06-16
- **@ashnaaseth2325-oss** — PR #18377 ✅ Merged 2026-06-16
- **@AdeshDeshmukh** — PR #18373 ✅ Merged 2026-06-17

**Human contributor ratio: ~15%** (3 distinct external contributors in June). Exceeds 10% target for the first time; community health signal is positive.

### v0.4 Delivery Prerequisites

Before v0.4 ("AI-Native Observability") can ship on-schedule (Q3 2026), ordered by urgency:

1. **Scope decision + feature captain** (#18974, #19307) — Q3 starts July 1 (T-11 days). No feature work has started for any of 7 scope items. Designate a feature captain and choose whether to scope-reduce to Tier 1 (llm-d monitoring only) or slip deadline to Q4.
2. **Fix coverage suite** (#19158) — 4+ days of 100% failure; CI recovery PRs (#19292, #19293) merged but coverage still down. Assign targeted fix.
3. **Enable GitHub branch protection on `main`** (#18355) — [Configure here](https://github.com/kubestellar/console/settings/branch_protection_rules). 5-minute task.
4. **File CNCF security audit** (#18207) — Q2 deadline passed 40 days ago. File at `github.com/cncf/toc/issues`.
5. **Engage @Jayant-kernel for v0.4 feature work** (#19306) — 3 PRs merged, governance contribution. Strategic window to channel toward llm-d or Drasi issues (#18031, #18032).
6. **Tag ≥20 issues `good-first-issue`** (#18785) — Hacktoberfest 2026 signups begin September.
7. **External adopter recruitment** — ADOPTERS.md needs ≥3 external organizations before CNCF application.

### Adoption Readiness

| Signal | Target | Current |
|--------|--------|---------|
| Main branch build stability | Green ≥14 consecutive days | 🔴 Coverage suite 100% collapse (#19158); branch protection absent (#18355) |
| Test infrastructure | All CI shards passing | 🔴 All 12 coverage shards failing 4+ days; Playwright + nightly regressions fixed Jun 20 |
| Feature velocity | ≥1 v0.4 feature PR merged | 🔴 Zero v0.4 feature PRs; Q3 starts July 1 |
| External adopters in ADOPTERS.md | ≥3 confirmed orgs | ❌ 0 external (KubeStellar self-listed only) |
| Human contributor ratio | ≥10% of merged PRs | ✅ ~15% (3 external contributors this month); first time target met |
| Community contributors active | ≥2 distinct contributors/month | ✅ @Jayant-kernel (3 PRs), @ashnaaseth2325-oss (2 PRs), @AdeshDeshmukh (1 PR) |
| Community PR merge time | ≤7 days for first-time contributors | ✅ June wave: all merged within 4 days |
| `good-first-issue` label coverage | ≥20 issues tagged | ❌ 0 issues tagged (#18785 open) |
| Security posture | No active sec-check findings | ✅ All security findings fixed; 1 resilience fix in progress (#19304) |
| Auto-QA SLA defined | Documented in `docs/plans/` | ✅ Auto-QA triage SLA doc added by @Jayant-kernel (#19291) |
| CNCF security audit | Filed | ❌ Q2 deadline passed 40 days ago (#18207) |
| CNCF incubation application | Filed | ⏸ On hold (#4072) |
/tmp/roadmap-p17.md
Loading