Enhance Drasi Server with Web UI, solutions, and dynamic loading of pluggins.#84
Conversation
- React 19 + TypeScript + Vite + Tailwind gaming-inspired management UI - Flow canvas with React Flow showing Sources → Queries → Reactions topology - Full CRUD: create/start/stop/delete for sources, queries, reactions, instances - Draft-based editing (local state until explicit save) - Per-kind config forms for all source types (mock, HTTP, gRPC, Postgres, platform), query config, and all reaction types (log, HTTP, gRPC, SSE, profiler) - Instance selector with create dialog for multi-instance management - Inspector panel with status, config details, connections, and action buttons - Real-time updates via SSE instead of polling (/api/v1/events endpoint) - Internal components (__introspection__, __attach_*) filtered from UI - Fix: Query status now reflects actual state instead of hardcoded 'Running' - Server: Added global component events SSE endpoint for reactive UI updates - Server: Added ComponentStatus Added/Removed variants to observability DTOs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… persistence Backend: - Add POST /api/v1/sources/:id/push proxy endpoint to forward data to HTTP/gRPC source listening ports, bypassing browser CORS restrictions - Add instance-specific and default-instance route handlers for the new push endpoint in v1/handlers.rs and v1/routes.rs - Add reqwest 0.12 dependency for outbound HTTP client calls - Fix clippy uninlined_format_args lint errors in shared/handlers.rs UI Canvas: - Add NodeShell component as a shared wrapper for all node types with expand/collapse animation (framer-motion), per-node locking, and consistent handle rendering - Refactor SourceNode, QueryNode, and ReactionNode to use NodeShell, exposing detailed config (query text, source lists, reaction properties) in the expanded view - Add SourcePushPanel component for inline test data submission to HTTP/gRPC sources via the new push proxy endpoint - Add useAutoLayout hook with collision-aware node clamping to prevent expanded nodes from overlapping neighbors - Add useCanvasPersistence hook to save/restore node positions, expanded state, lock state, and viewport to localStorage per instance - Add canvas-level lock toggle, multi-select node deletion with confirmation, and Delete/Backspace keyboard shortcut support - Update CSS: edges render above nodes (z-index), smooth position transitions for displaced nodes, refined node-card transition props UI Instance Management: - Support ?instance= URL search param for deep-linking to instances - Persist selected instance to localStorage across sessions - Show 'instance not found' banner with option to create the missing instance, pre-filling the ID in CreateInstanceDialog UI Data Model: - Add properties field to SourceStatusResponse and ReactionStatusResponse - Add query and queryLanguage fields to pipeline QueryInfo - Export layout constants (COLUMN_X, NODE_SPACING_Y, NODE_START_Y) from graph utils for use by auto-layout Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Create Instance dialog (and other API calls) would hang for a long time
because the browser's HTTP/1.1 connection pool was saturated by SSE streams.
Root cause: useSources, useQueries, and useReactions each opened their own
EventSource to the same /events endpoint, consuming 3 of the browser's 6
concurrent connections per domain. With inspector panels adding per-component
SSE streams on top, new API requests (POST, GET) would queue indefinitely
behind the persistent SSE connections.
Changes:
1. Shared EventSource singleton (ui/src/hooks/useApi.ts):
- All hooks now share a single EventSource per instance via a module-level
Map<instanceId, SharedES> that multiplexes events to multiple listeners
- When the last listener unsubscribes, the EventSource is closed and removed
- Reduces SSE connections from 3 per page to 1 per instance
2. Create Instance dialog loading state (CreateInstanceDialog.tsx):
- handleSave is now async and awaits the onSave callback
- Added saving state with 'Creating...' button text and disabled state
- onSave prop typed as Promise<void> for proper async handling
- Errors during creation are caught and shown inline
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Nodes would mount at a default width then animate to their persisted size,
causing a jarring resize flash when the UI first loaded.
Root cause: buildFlowGraph() created nodes without expanded/locked/position
state. useCanvasPersistence restored this state in a useEffect AFTER mount,
triggering a Framer Motion width animation from default → persisted size.
Fix:
1. FlowCanvas now reads persisted state from localStorage synchronously
during useMemo (before mount) and pre-applies positions, expanded, and
locked flags to the initial nodes.
2. NodeShell sets initial={{ width: targetWidth }} on the motion.div so
Framer Motion starts at the correct width — no animation on mount.
Subsequent expand/collapse interactions still animate normally.
3. Exported loadPersistedState() from useCanvasPersistence for reuse.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Server' - Create DrasiLogo.tsx component with the official Drasi chevron icon SVG (lime #c3fb3b + green #48e263 arrow polygons) and full wordmark variant - Replace placeholder gradient 'D' square in AppLayout header with DrasiLogo - Rename 'DRASI CONTROL' to 'DRASI SERVER' in the UI header - Update HTML page title from 'Drasi Control' to 'Drasi Server' - Update server.rs log messages from 'Drasi Control UI' to 'Drasi Server Admin UI' Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move global canvas lock from top-right panel into bottom-left React Flow Controls toolbar as a ControlButton for consistent placement - Hide React Flow's built-in interactivity toggle (showInteractive=false) to avoid duplicate lock icons in the controls panel - Add selection-aware lock/unlock button in top-right toolbar that appears alongside delete when nodes are selected (locks all selected if any are unlocked, unlocks all if all are locked) - Top-right panel now only renders when canvas is unlocked and nodes are selected, keeping the canvas clean when not in multi-select mode Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icon - Replace EventBar (fixed bottom bar) with EventPanel (left slide-out panel) that shows all events as a scrollable list with colored status dots - Wire the Activity icon in the header as a toggle button to open/close the panel, with a badge showing unread event count - Add Clear and Close buttons to the panel header - Remove .event-bar CSS class (no longer needed) - Rename EventBar.tsx to EventPanel.tsx to match new component name - Replace placeholder favicon with Drasi chevron logo (lime + green arrows on dark background), properly centered in 32x32 viewBox Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduce a complete theming system that allows users to switch between light and dark modes, with their preference remembered across sessions. CSS variable-based theming: - Define light theme variables on :root and dark theme variables on .dark - Replace all hardcoded hex colors in tailwind.config.js with var(--drasi-*) - Enable Tailwind darkMode: "class" strategy - Add --drasi-minimap-mask variable for theme-aware minimap overlay useTheme hook (new file): - Reads/writes theme preference to localStorage under drasi-theme key - Toggles .dark class on <html> element to switch CSS variable sets - Defaults to dark theme when no preference is stored Flash-of-unstyled-content prevention: - Inline script in index.html applies saved theme before first paint - Removes hardcoded class="dark" from <html>, letting the script decide Header theme toggle: - Sun icon (in dark mode) / Moon icon (in light mode) in AppLayout header - Theme state and toggle callback threaded from App.tsx through props Theme-aware component updates: - DrasiLogo wordmark fill changed from #fff to currentColor - FlowCanvas Background and MiniMap use CSS variables instead of hex - NodeShell hover states use drasi-text-secondary/10 instead of white/10 - Edge strokes in graph.ts use var(--drasi-border) for inactive edges - colors.ts adds getTheme() that reads computed CSS variables at runtime Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Apply vertical displacement in the same setNodes call as the expand toggle so both CSS transitions start in the same paint frame. Add height target locking in useAutoLayout to prevent intermediate ResizeObserver measurements from restarting CSS transitions. Switch to useLayoutEffect for displacement to commit before paint. Slow height expansion from 400ms to 405ms to prevent the expanding node from visually outrunning displaced neighbors. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eployment
Features:
- Solution templates system with YAML-based templates in /solutions directory
- REST API endpoints: GET/POST /api/v1/catalog/solutions for browsing,
POST /api/v1/instances/{id}/solutions for deployment
- Variable substitution with ${VAR:-default} syntax and YAML comment extraction
- UI integration: Browse Catalog tab in Add dialog, deploy dialog with
instance creation, variable configuration showing descriptions and usedBy
Deployment improvements:
- Validation-first: parse and validate ALL components before creating any
- Returns all validation errors at once so users can fix them together
- Creation order: sources → queries → reactions (all stopped initially)
- Start phase only after all creations succeed
UI/UX improvements:
- Redesigned Add dialog with tabbed interface (Add Component / Browse Catalog)
- Redesigned deploy dialog with visual component breakdown
- Error messages displayed on nodes and inspector panel
- Instance selector with inline create option in deploy dialog
- Fixed crypto.randomUUID fallback for older browsers
- Smarter error handling (only clear lists on 404, not transient errors)
Other:
- Added dev-build and clean-dev-build Makefile targets
- Updated OpenAPI documentation with solution endpoints
- DtoMapper now supports variable overrides for template instantiation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add collapsedMinHeight prop to NodeShell component and set it to 85px for all node types (Source, Query, Reaction) to ensure uniform sizing when collapsed on the canvas. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously, source→query edges animated based on query status only, and query→reaction edges animated based on reaction status only. Now edges only animate (and show green) when BOTH endpoints are running. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
UI changes: - Add useQueryResults hook for fetching and streaming query results via SSE - Display live results table in expanded QueryNode with streaming indicator - Pass instanceId through component graph for proper API routing Solution template fix: - Fix iot-temperature-monitor.yaml query label: Sensor -> sensorReading (mock source generates SensorReading nodes, not Sensor) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 47 new tests for solution template functionality: - solution_catalog_test.rs (15 tests): Template listing, details, variable extraction, and catalog API validation - solution_deployment_test.rs (22 tests): Solution deployment via API, variable substitution, validation errors, multi-instance support, scriptfile bootstrap, and query result validation - solution_e2e_data_flow_test.rs (10 tests): Complete pipeline validation using wiremock to capture HTTP reaction output and verify data flows correctly from Source → Query → Reaction with payload validation Test infrastructure: - tests/test_support/solution_helpers.rs: Template generators, router setup, wait helpers for query results, JSONL file creation utilities - tests/fixtures/solutions/: Test fixture templates Also fixes flaky log stream tests in api_integration_test.rs by using unique instance IDs per test to avoid global log registry interference. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Light theme UI improvements: - Darken canvas background (#f8fafc → #e2e8f0) so white cards stand out - Use pure white (#ffffff) for node cards with subtle drop shadow - Increase border contrast (#cbd5e1 → #94a3b8) for visible card edges - Add --drasi-edge variable (#64748b) for darker connection lines - Improve secondary text contrast (#64748b → #475569) Fix iot-temperature-monitor.yaml: - Change query label from 'sensorReading' to 'SensorReading' (PascalCase) to match the labels generated by MockSource Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move lock/start-stop/expand controls to bottom toolbar on right - Change lock icon to pin with angled rotation for unpinned state - Reduce node card padding for tighter layout (p-3 → px-2 py-1.5) - Increase toolbar icon size to 14px for better visibility - Keep expand button always visible (greyed out when disabled) - Add canvas click handler to close inspector panel - Add slide-out animation for inspector panel using Framer Motion - Add useApi hook for direct start/stop API calls from nodes Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add specialized inspector panels for Source, Query, and Reaction components - Each panel shows component-specific config and Data Flow section - Data Flow shows INPUT sources and OUTPUT reactions/queries - Redesign node toolbar with crossfade animation between collapsed/expanded states - Pin and expand controls in bottom toolbar (collapsed) or top-right (expanded) - Controls disabled (not hidden) when canvas is locked - Add delete confirmation dialog with dependency checking - Sources cannot be deleted if queries depend on them - Queries cannot be deleted if reactions depend on them - Inspector panel slides out with animation on close - Expanded nodes show only runtime info (results/activity), config in inspector only - Canvas controls sized to 32x32 with 18px icons Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Standardize colors across all UI components: - Sources: Green (#22c55e) - Queries: Blue (#3b82f6) - Reactions: Purple (#8b5cf6) Centralize color definitions in tailwind.config.js and utils/colors.ts. Replace all hardcoded color values with semantic Tailwind classes (text-drasi-source, text-drasi-query, text-drasi-reaction, etc.) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- SourceInspectorPanel: Use kind-based icon map (Database, Globe, Radio, etc.) - QueryInspectorPanel: Use Search icon (consistent with QueryNode) - ReactionInspectorPanel: Use kind-based icon map (FileText, Globe, Rss, etc.) Icons now match between node cards and their inspector panels. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Files were already tracked before adding to .gitignore. Use 'git rm --cached' to untrack while keeping files on disk. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add removeNodeFromPersistedState() to clean up localStorage when a node is deleted - Add removeInstancePersistedState() for future instance deletion support - Update useSources/useQueries/useReactions remove() to clean up persisted state Node positions are already partitioned by instance ID via storage key prefix. Auto-save on node add is handled by the fingerprint change detection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Display 'GQL Query' or 'Cypher Query' instead of generic 'Continuous Query' based on the queryLanguage property. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
New button arranges nodes in columns by type: - Sources on left (x=50) - Queries in center (x=400) - Reactions on right (x=750) Uses LayoutGrid icon from lucide-react. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Increase dropdown width from 256px to 384px (w-64 → w-96) - Increase display name truncation from 16 to 32 characters GUIDs are 36 characters, so most instance IDs will now display fully. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Auto-layout now centers and zooms to show all nodes after rearranging. Uses React Flow's fitView with 20% padding and 300ms animation. Refactored AutoLayout component to expose canvas API (collision + fitView) via ref pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…export template)
Add three new instance management capabilities to the Drasi Server UI:
1. Create Instance with Solution Template
- Add solution template dropdown to CreateInstanceDialog
- Auto-deploy selected template after instance creation
- New instance becomes active after completion
2. Clone Instance
- Add CloneInstanceDialog with progress tracking
- Clone menu item in InstanceSelector
- Copies all sources, queries, reactions with autoStart=false
- Client-side orchestration using existing CRUD endpoints
3. Create Solution Template from Instance
- Add CreateSolutionTemplateDialog with component selection
- Users can select which sources/queries/reactions to include
- New server endpoint: POST /api/v1/instances/{id}/catalog/solutions
- Exports selected components as reusable YAML template
Backend changes:
- Add CreateSolutionTemplateRequest/Response types
- Implement create_solution_template in solutions.rs
- Add handler and route for new endpoint
- Update OpenAPI documentation with new endpoint and schemas
- Add unit tests for request validation and response types
UI changes:
- New CloneInstanceDialog.tsx component
- New CreateSolutionTemplateDialog.tsx component
- Update InstanceSelector with clone/template menu items
- Update CreateInstanceDialog with template selector
- Add createSolutionTemplate to API client
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add missing instanceId to reaction pipeline data in App.tsx
This fixes the reaction node toolbar start/stop buttons not working
because instanceId was undefined when calling the API
- Fix clippy errors in solutions.rs by using inline format string variables
(e.g., {source_id} instead of "{}", source_id)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove redundant Quick Stats section (sources/lines/reactions boxes) since this info is already shown in the Data Flow section - Increase query definition box max height from 256px to 384px - Use explicit vertical scrollbar (overflow-y-auto) for long queries Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add start/stop buttons and clickable navigation to connected components in all three inspector panels (Source, Query, Reaction). New features: - ConnectedComponentItem: Reusable component for data flow items - Clickable component names navigate to that component's inspector - Start/Stop buttons for each connected component based on status - Spinning indicator shown during Starting/Stopping transitions Changes: - Create ConnectedComponentItem.tsx shared component - Update SourceInspectorPanel with onNavigate, onStartQuery, onStopQuery - Update QueryInspectorPanel with onNavigate, onStart/Stop for sources and reactions - Update ReactionInspectorPanel with onNavigate, onStartQuery, onStopQuery - Wire up all handlers in App.tsx Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
C15: Add read-only mode checks to all 5 plugin mutation endpoints (load, install, upgrade, retire, promote) in plugin_handlers.rs. Each returns 403 with CONFIG_READ_ONLY when the server config is read-only, matching the pattern used by source/query/reaction handlers. C16: Add read-only mode checks to create_solution_template and deploy_solution in solution_handlers.rs. Also clean up a redundant fully-qualified error_codes path. OpenAPI annotations updated with 403 responses for all 7 handlers. PR review plan updated with progress (11 of 19 resolved). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
C4: Update "Running all cargo tests" to "Running unit and integration tests (including ignored/E2E)" since `cargo test --tests` does not run doctests. Doctests are covered by the separate `cargo test --doc` line. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
serenakeyitan
left a comment
There was a problem hiding this comment.
Thanks for the follow-up fixes. I re-checked the latest head and I’m still holding approval on two blocking issues:
read_onlystill is not enforced on the new mutation endpoints.retire_plugin,load_plugin,install_plugin,upgrade_plugin, andpromote_plugininsrc/api/v1/plugin_handlers.rs, pluscreate_solution_templateanddeploy_solutioninsrc/api/v1/handlers/solution_handlers.rs, can all mutate runtime state or persisted files even when the server is started in read-only mode.- There are still multiple
plugin_registryread guards held across.await, for example insrc/server.rs:431,src/server.rs:446,src/plugin_orchestrator.rs:534, andsrc/api/shared/handlers/instance_handlers.rs:268. That can block lifecycle mutations and hot-reload work behind component construction, which matches the deadlock/starvation concern already called out earlier.
Happy to take another pass once those are addressed.
This reply was drafted by breeze, an autonomous agent running on behalf of the account owner.
agentofreality
left a comment
There was a problem hiding this comment.
Resolved — Rather than threading a shared Mutex through two separate top-level components, the implementation composes PluginOperations inside PluginOrchestrator (as plugin_ops: Option<PluginOperations>) and adds a dir_mutex: Mutex<()> on the orchestrator that serializes all directory-mutating operations:
install_and_load()acquiresdir_mutexbefore callingops.install_from_registry()+load_plugin_inner()load_plugin_locked()acquiresdir_mutexbefore loading from diskupgrade_plugin()callsload_plugin_locked(), inheriting the lock
All API handlers (load_plugin, install_plugin, upgrade_plugin) route through these locked methods. Read-only operations (search_registry, list_plugins) appropriately bypass the mutex. The remove_plugin_file() method on PluginOperations has an explicit doc comment noting that runtime callers must go through the orchestrator's locked paths.
This composition pattern is cleaner than a shared-mutex approach: only PluginOrchestrator manages concurrency, PluginOperations stays a stateless utility, and CLI/init single-threaded paths don't carry unnecessary mutex overhead.
Add dir_mutex to PluginOrchestrator to serialize plugin directory mutations (install, load, remove), preventing races between API calls, hot-reload watcher, and startup loading. PluginOperations is now composed inside PluginOrchestrator rather than operating independently. This resolves PR drasi-project#84 review comment C18 (Daniel Gerlag). Also fix trading demo start-demo.sh failure by adding verifyPlugins: false to trading-sources-only.yaml — locally-built plugins have no OCI signatures, so signature verification must be disabled for the demo. Other changes in this commit: - Plugin orchestrator: with_ops() constructor, install_and_load(), load_plugin_locked() with verification support - CLI: rename --verify-plugins to --skip-verification (opt-out default) - Update CLAUDE.md docs for verification flag change - Update pr-84-review-plan.md: mark C18 resolved (13/19 complete) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add create_source_locked() and create_reaction_locked() factory functions that acquire the RwLock<PluginRegistry>, clone Arc descriptors, drop the lock, then perform async component creation. This eliminates the anti-pattern of holding a tokio RwLock read guard across .await points, which blocks writers and risks deadlock. Updated call sites across all code paths: - instance_handlers.rs: clone_instance source/reaction creation - source_handlers.rs: create, upsert (update + create paths) - reaction_handlers.rs: create, upsert (update + create paths) - solution_handlers.rs: deploy_solution no longer holds lock - solutions.rs: accepts &RwLock<PluginRegistry> instead of &PluginRegistry - server.rs: startup source/reaction creation loops - plugin_orchestrator.rs: migrate_component source/reaction recreation Also fixes pre-existing clippy uninlined_format_args lint. Addresses PR drasi-project#84 review comments C17 and C2. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The runtime upgrade-migration path on PluginOrchestrator was incomplete and contained several unresolved correctness issues raised in PR drasi-project#84 review (C13: query-recovery semantics during drain-then-retire, and C14: no rollback if migration failed after the old component was removed). Rather than patching a partial design, this strips the entire half-implemented surface so it can be redesigned cleanly in a follow-up PR. Removed: - PluginOrchestrator: upgrade_plugin, retire_plugin, promote_plugin, migrate_component, library_generation tracking - REST API: POST /plugins/{id}/upgrade, /retire, /promote and their DTOs - UI: PluginManagementPanel and PluginsPanel upgrade/retire controls - Config & init prompts: hot-reload-related upgrade options that only made sense alongside live replacement - Tests covering the removed surface (hot_reload_test, init_output_test, openapi_test, plugin_handlers tests) Behavior change: replacing a loaded plugin now requires a server restart. Attempting to load a plugin whose id is already registered returns an explicit "already loaded — restart to replace" error. README and CLAUDE.md are updated to document this constraint. Net diff: 19 files, +62 / -1149. Refs PR drasi-project#84 review comments C13, C14. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both items were resolved by commit 2d03ef2, which removed the partial plugin upgrade/migration surface entirely instead of patching the incomplete drain-then-retire path. Updates the plan to reflect the strip-down resolution and bumps progress 15 → 17 of 19. Also folds in the previously-unstaged updates noting that C17 and C2 are complete. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
C3 (resolver tests): Serialize tests that mutate process-wide
environment variables. std::env::set_var/remove_var race with each
other under cargo test's parallel runner. Introduce a static
ENV_TEST_LOCK Mutex<()> in the resolver test module and have every
env-touching test acquire it (poisoning-safe via
unwrap_or_else(|e| e.into_inner())) for the duration of the test.
C9 (plugin error HTTP status mappings): Map plugin-related error
codes to more accurate HTTP statuses instead of falling through to
500:
- PLUGIN_NO_DIRECTORY -> 503 Service Unavailable (server has no
plugins directory configured, so the operation is unavailable)
- PLUGIN_INSTALL_FAILED, PLUGIN_SEARCH_FAILED -> 502 Bad Gateway
(these flow from upstream OCI registry calls)
- PLUGIN_LOAD_FAILED stays at 500 with an inline comment
explaining why (server-side cdylib failure, not client error)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All 19 review comments from PR drasi-project#84 have been resolved (the final two — C3 and C9 — landed in a5261c3). The plan was a working artifact for tracking responses and is no longer needed in-tree. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-server#84) (#388) * feat: add Added and Removed variants to ComponentStatus Restore Added and Removed as first-class variants of ComponentStatus, replacing the brittle pattern of encoding add/remove semantics in message suffix strings (e.g., msg.ends_with("added")). Previously, ComponentGraph emitted ComponentStatus::Stopped with a message like "Source added" or "Source removed", forcing downstream consumers to parse message strings to distinguish add/remove events from actual stops. This eliminated compiler-enforced exhaustive matching and created invisible coupling between producer format strings and consumer suffix checks. Changes: - Add Added and Removed variants to ComponentStatus enum - Register methods now create components with status Added (was Stopped) - remove_component() now emits Removed status (was Stopped) - ComponentGraphSource dispatches on status enum instead of message suffix parsing - Update state machine: Added can transition to Starting, Stopped, or Reconfiguring - Update FfiComponentStatus and all FFI mappings (plugin-sdk, host-sdk) - Update status_str(), lifecycle docs, and state diagram - Update all affected tests across 12 test files Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * refactor: make JSON conversion infallible, subscribe_events read-only, and add defensive guards Three groups of improvements: 1. Make convert_json_to_element_value and convert_json_to_element_properties infallible (return T instead of Result<T>). Every JSON type maps cleanly to an ElementValue, so the Result wrapper was unnecessary. Updates all callers across bootstrappers, sources, and tests. 2. Change ComponentGraph::subscribe_events from &mut self to &self by adding try_subscribe() which returns Option instead of auto-creating channels. This allows managers (query, reaction, source) to use read() locks instead of write() locks when subscribing to events. 3. Add double-start guard to ComponentGraphSource::start(), clarify subscription label-to-source mapping comment, and fix doc comment (HasDependents -> Validation). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * refactor(host-sdk): remove plugin upgrade/retire/side-by-side machinery Drops the partial and broken plugin lifecycle features (generation counters, side-by-side versions, drain-then-retire orchestration) that were never fully wired through. The remaining surface keeps single-version plugin load/register/status semantics: - lifecycle: drop generation_counter, LoadedPluginState.generation, and retire_plugin; load path no longer assigns generations - plugin_registry: remove register_*_versioned, deregister_*, promote_*, and the per-plugin generation parameter from register_*_with_metadata - plugin_types: remove PluginStatus::{Draining,Retired} and PluginEvent::{LoadedSideBySide,Upgraded,UpgradePartialFailure,Draining, Promoted,Retired} - watcher: doc-only cleanup - lib: scrub pluginGeneration references from builder and *_ops doc comments Also fixes the drasi-bootstrap-scriptfile doctest, which mixed the re-exported ScriptFileBootstrapConfig with the drasi_lib::bootstrap path and tripped E0308 ("two different versions of crate drasi_lib"). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bumped to the matched 0.6.x family release of the Drasi crate set: drasi-lib 0.4.2 -> 0.6.0 drasi-host-sdk 0.4.3 -> 0.6.0 drasi-plugin-sdk 0.4.3 -> 0.6.0 drasi-core 0.4.1 -> 0.4.2 drasi-bootstrap-noop 0.1.10 -> 0.1.12 drasi-bootstrap-application 0.1.11 -> 0.1.13 drasi-reaction-application 0.2.10 -> 0.2.11 drasi-state-store-redb 0.1.8 -> 0.1.9 drasi-index-rocksdb 0.3.1 -> 0.3.2 drasi-index-garnet 0.1.8 -> 0.1.9 API changes required by 0.6: - QueryConfig gained a `recovery_policy: Option<RecoveryPolicy>` field. Default to `None` in the DTO->QueryConfig mapper so the global default policy applies. - SubscriptionResponse gained a `position_handle: Option<Arc<AtomicU64>>` field used by replay-capable sources for durable position tracking. All four in-tree mock subscribe() implementations (persistence, api/joins_tests, tests/test_support/mock_components, and tests/bootstrap_results_test) now return `position_handle: None`. `cargo check --all-targets` is clean and all 287 lib tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| /// components (sources, queries, reactions) in the target instance. | ||
| /// All cloned components are created in the stopped state. | ||
| /// On failure, already-created components are rolled back. | ||
| pub async fn clone_instance( |
There was a problem hiding this comment.
Rollback on mid-operation failure seems to be best-effort here.
What happens if the rollback itself errors?
There was a problem hiding this comment.
Will it leave orphaned Phase-1 components?
There was a problem hiding this comment.
Fixed in e604bdc.
After auditing DrasiLib::remove_* (drasi-lib 0.6.0), the realistic failure surface in this call site is very narrow — every cloned component is forced to auto_start = false, rollback removes in dependent-first order (reactions → queries → sources), teardown of a never-started component with cleanup = false is essentially a runtime-map removal, and graph deregister failures are absorbed inside remove_*. The remaining failure modes are concurrent-mutation races (no instance-wide lock) and bugs in drasi-lib internals; neither is improved by refusing to roll back or retrying.
So rather than restructure the rollback flow, this commit makes the existing best-effort behavior observable:
- Doc block above
rollback_sources/queries/reactionsdocuments why best-effort is safe at this call site and what the realistic failure modes are. - Rollback helpers now return
Vec<String>of human-readable failure descriptions (e.g."source 'pg1': <err>") instead of(). - Promoted rollback log lines from
warn!toerror!so they appear at default log level. - New
clone_errorhelper builds theErrorResponsewith collected rollback failures threaded intoErrorDetail::technical_details(per project convention — never embedded inmessage). All 5 phase-failure paths inclone_instanceuse it. HTTP status, error codes, and primarymessageare unchanged. An operator who hits the rare race now gets a structured list of any orphans for manual cleanup instead of a200-success-shaped void.
No new tests — the failure paths can't be exercised reliably without monkey-patching drasi-lib internals.
|
|
||
| /// Helper function to persist configuration after a successful operation. | ||
| /// Logs errors but does not fail the request - persistence failures are non-fatal. | ||
| pub async fn persist_after_operation( |
There was a problem hiding this comment.
This logs persistence errors and returns OK. Is this being surfaced to the caller of the operation?
There was a problem hiding this comment.
Fixed in e2952e9.
persist_after_operation now returns Result<(), ErrorResponse> carrying a new PERSISTENCE_FAILED error code (HTTP 500). All 12 call sites across source/query/reaction/instance/clone handlers ?-propagate the result, so a YAML write failure is now surfaced to the API caller instead of being silently logged. The high-level message documents that the in-memory change was applied but not persisted (recommend retry/restart); the underlying error is placed in ErrorDetail::technical_details per the project convention (no raw errors embedded in message, now also documented in CLAUDE.md).
Unit tests in src/persistence.rs cover both the failure path (chmod the config dir to 0o555 mid-flight, assert correct code/message/technical_details) and the None no-op path. cargo clippy --all-targets is clean and cargo test passes (289 lib + integration suites).
| // Parse the schema map | ||
| let schema_map: serde_json::Map<String, Value> = match serde_json::from_str(schema_map_json) { | ||
| Ok(Value::Object(map)) => map, | ||
| Ok(_) | Err(_) => { |
There was a problem hiding this comment.
Will we warn and return empty error vector for invalid json?
There was a problem hiding this comment.
Fixed in 48d154c.
The three meta-level failure paths in validate_against_schema are now distinguishable via a new schema_error_codes module (SCHEMA_UNPARSEABLE, SCHEMA_ENTRY_MISSING, SCHEMA_VALIDATOR_BUILD_FAILED) exposed through a new optional code: Option<&'static str> field on FieldError (#[serde(skip_serializing_if)] keeps existing JSON unchanged for jsonschema-derived errors). Mirrors the error_codes pattern already used in api::shared::error.
Behavior:
- Invalid schema JSON / non-object root — fatal
FieldErrorwithSCHEMA_UNPARSEABLE(was already fatal; now also carries the code). - Entry name missing from the parsed schema map — was previously a silent
Vec::new()(this was the worst silent path: a typo in a plugin'sconfig_schema_name()would let any config validate as OK). Now a fatalFieldErrorwithSCHEMA_ENTRY_MISSING. jsonschema::validator_forbuild failure — log promoted fromwarn!toerror!for visibility, but still returns an empty Vec. The widespread cause today is incomplete$defsin upstream drasi-core plugin schemas (refs toQueryConfigDto,ConfigValue,ConfigValueStringthat aren't published alongside the root schema). Making this fatal would block validation of every example config that uses those plugins. TheSCHEMA_VALIDATOR_BUILD_FAILEDconstant is reserved and the code path has a doc comment so this can be promoted to fatal once upstream schemas are fixed.
Tests in schema_validation.rs cover all three paths (16 total, all green); full cargo test suite passes.
| PluginFileEvent::Added(path) | PluginFileEvent::Changed(path) => { | ||
| info!("Plugin file change detected: {}", path.display()); | ||
| match orchestrator_for_watcher | ||
| .load_plugin_locked(&path, None) |
There was a problem hiding this comment.
Hot reload path doesn't seem to have verification on the config. So will anyone with write access to plugins dir be able to swap verified binary for malicious one after start up?
Not a blocker for this PR, but if the suspicion is true here, then we should open a tracking issue for this.
There was a problem hiding this comment.
Acknowledged, but flagging this as out of scope for this PR.
The entire Drasi Server API is currently unauthenticated — anyone with network access to the API port can already create/delete/start/stop sources, queries, and reactions, install plugins from arbitrary OCI registries, and trigger arbitrary configuration. Filesystem write access to the plugins directory is a strictly weaker capability than what the API already grants unauthenticated.
Signature verification on the load path is a useful supply-chain control against tampered registry artifacts, but it isn't a meaningful access control against a local attacker, and tightening just the hot-reload path while the API remains open would be a misleading hardening.
AuthN/authZ for the API, threat-modeling local-vs-remote attackers, and deciding whether/how to enforce signature verification across all load paths (startup, API install, hot-reload, lockfile-cached) belong together in a dedicated security design effort. Tracking this comment as input to that work rather than patching it in isolation here.
| // The lockfile-based check above covers the install-then-load path; | ||
| // for direct load-from-disk, we trust the file if it passes metadata | ||
| // validation during load. | ||
| warn!( |
There was a problem hiding this comment.
If the lockfile is stale or missing, then will unverified plugins load?
There was a problem hiding this comment.
Acknowledged, but flagging this as out of scope for this PR — same framing as the hot-reload comment.
The entire Drasi Server API is currently unauthenticated, so any caller can already install plugins, mutate configuration, and start/stop components without credentials. A stale-or-missing lockfile bypassing verification is a real concern for supply-chain integrity, but in the current threat model it's strictly weaker than the unauthenticated API surface that already exists; hardening the lockfile path in isolation would give a misleading sense of defense.
The correct fix is a dedicated security design pass covering: API authN/authZ, end-to-end signature verification policy across all load paths (startup, API install, hot-reload, lockfile-cached, no-lockfile-fallback), and the trust model for the lockfile itself (who writes it, how is it validated, what happens on mismatch). Tracking this as input to that work rather than patching it in isolation here.
There was a problem hiding this comment.
Given the scope of this change, I've focused my review on spot checks across the main areas rather than a line-by-line pass. At a high level the structure looks reasonable.
Planning to approve later today so the team isn't held up. We can follow up on specific areas in separate PRs later if anything comes up.
Add scripts/run-end-to-end.sh that brings up Postgres, builds and launches drasi-server, and exercises the configured sources, queries, and a CDC INSERT against the REST API as a self-checking smoke test suitable for local verification of the getting-started tutorial. Notes: - Schema apply and replication-slot creation are issued in separate psql sessions because pg_create_logical_replication_slot() called via DO/PERFORM in the same session as CREATE PUBLICATION silently rolls back the publication. - The slot is dropped and recreated alongside the publication on every run; a logical slot's catalog snapshot is captured at creation time, so a reused slot cannot see a recreated publication and replication fails with "publication does not exist". - Auto-detects a sibling ../drasi-core/target/release/plugins build for platforms (e.g. darwin-arm64) where the OCI registry does not yet publish plugins compatible with the SDK version on this branch; falls back to the OCI registry otherwise. LOCAL_PLUGINS_DIR=skip forces the registry path; BUILD_LOCAL_PLUGINS=1 builds first. - README documents the script, env-var overrides, and plugin source resolution order. - .gitignore excludes server.log and the generated server-config.local.yaml. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ct#84 C2) Mutating handlers (create/delete source/query/reaction, instance create, clone) previously called `persist_after_operation` which logged YAML write failures and returned `()`, so the API responded `200 OK` even when the on-disk configuration was out of sync with the in-memory state. Changes: - Add `PERSISTENCE_FAILED` to `error_codes` (HTTP 500 by default). - `persist_after_operation` now returns `Result<(), ErrorResponse>`. The high-level `message` documents that the change was applied in memory but not persisted; the underlying error string is placed in `ErrorDetail::technical_details` per the project convention (no raw errors embedded in `message`). - Update all 12 call sites to `?`-propagate the result. - Document the new code and the "no raw errors in message" rule in CLAUDE.md. - Add unit tests in `src/persistence.rs`: * `test_persist_after_operation_surfaces_failure` (Unix) — chmods the config dir to 0o555 to trigger the temp-file write failure, then asserts the returned `ErrorResponse` has the right code, the expected high-level message, populated `technical_details`, and that the raw error is NOT embedded in `message`. * `test_persist_after_operation_none_is_noop`. Addresses review comment C2 on PR drasi-project#84. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rasi-project#84 C3) validate_against_schema previously had three meta-level failure paths that returned an empty Vec<FieldError> (or a single string-only entry), causing components whose plugin schemas couldn't be evaluated to be reported as "OK" with no signal to the caller. Changes: - Add schema_error_codes module to plugin_validation with three stable string constants (SCHEMA_UNPARSEABLE, SCHEMA_ENTRY_MISSING, SCHEMA_VALIDATOR_BUILD_FAILED). Mirrors the error_codes pattern in api::shared::error (string consts, no enum). - Extend FieldError with code: Option<&'static str> (#[serde(skip_serializing_if = "Option::is_none")] keeps existing JSON output unchanged for jsonschema-derived field errors). - SCHEMA_UNPARSEABLE: now carries the code; message documents that the component config could not be validated. - SCHEMA_ENTRY_MISSING: was previously a silent Vec::new(). Now returns a fatal FieldError with the code populated -- this was the worst silent path (a typo in config_schema_name() would pass validation with zero errors). - SCHEMA_VALIDATOR_BUILD_FAILED: typically caused by incomplete $defs in upstream plugin schemas (e.g. refs to QueryConfigDto, ConfigValue, ConfigValueString). Promote log from warn! to error! so it's visible, but keep returning an empty Vec -- making this fatal would block validation of every example config that uses affected plugins. Doc comment reserves the constant for future tightening once upstream schemas are fixed. - Tests: keep all existing schema-validation tests passing; rewrite test_missing_entry_point to assert the new fatal+coded behavior; add test_field_errors_have_no_code and test_validator_build_failure_is_logged_not_fatal. 16 tests total in this module, all pass. Addresses review comment C3 on PR drasi-project#84. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-project#84 C1) clone_instance rollback helpers (rollback_sources/queries/reactions) were silently best-effort: any error from DrasiLib::remove_* was warn-logged and discarded, and the resulting ErrorResponse carried no indication that an orphan might exist. Audit of DrasiLib::remove_* (drasi-lib 0.6.0) confirmed best-effort is the correct semantic at this call site: - All cloned components are forced to auto_start = false, so rollback removes stopped components. - Removal happens in dependent-first order (reactions -> queries -> sources), so can_remove dependent checks should not reject. - Teardown of a never-started component with cleanup = false is essentially a runtime-map removal. - Graph deregister failures are absorbed inside remove_*. The realistic failure surface is therefore concurrent-mutation races (narrow given no instance-wide lock) or bugs/panics in drasi-lib internals. In either case, refusing to roll back or retrying is not safer than logging and continuing. Changes: - Add a doc block above the three rollback helpers documenting why best-effort is safe and what the realistic failure modes are. - Helpers now return Vec<String> of human-readable failure descriptions ("source 'pg1': <err>") instead of (). - Promote rollback log lines from warn! to error! so they appear at default log level. - New clone_error helper builds ErrorResponse with collected rollback failures threaded into ErrorDetail::technical_details (per project convention -- never embedded in message). All 5 phase-failure paths in clone_instance use it. HTTP status, error codes, and primary message are unchanged. Operators who hit the rare race get a structured list of any orphans for manual cleanup. No new tests: the failure paths cannot be exercised reliably without monkey-patching drasi-lib internals. cargo clippy --all-targets clean, full cargo test suite passes. Addresses review comment C1 on PR drasi-project#84. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
nit: I had to run make build-ui before running ./start-demo.sh for the drasi server UI to show up. This is not documented in the README
There was a problem hiding this comment.
improved scripts and docuementation
| .await; | ||
|
|
||
| // Start plugin hot-reload watcher if configured | ||
| let watcher_handle = if config.hot_reload_plugins { |
There was a problem hiding this comment.
At startup (L90–L177), plugins are verified against the OCI registry using cosign before loading. However, the hot-reload path (L292–L298) calls load_plugin_locked directly without any verification. Should we gate hot-reload behind the same signature check used at startup?
There was a problem hiding this comment.
Good catch. We are undertaking an all up review of plugin loading and component lifecycles, so will close this hole as part of that work.
Source-build users had to manually run `make build-ui` before the Web UI would appear at /ui (raised in PR drasi-project#84 review). The Makefile already wires build-ui into make build/build-release/run, but the README Quick Start and example scripts called `cargo build --release` directly, bypassing it. - examples: switch trading/start-demo.sh, playground/start.sh, and getting-started/scripts/start-server.sh from `cargo build --release` to `make build-release`, and add a `ui/dist` existence check that invokes `make build-ui` when the binary is already present. - README: add Node.js/npm prerequisite and recommend `make build-release` in Quick Start "Option B" and Building from Source "Option 2"; add a new Prerequisites subsection to the Web UI Guide explaining the build-from-source vs Docker distinction. - src/server.rs: upgrade the runtime "ui/dist not found" warning to include the resolved path checked and the concrete remediation commands (`make build-ui` / `--disable-ui`). No functional code changes; no build-system changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cargo fmt cleanup of the missing-ui_dist warning added in the previous commit. No behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This pull request introduces significant enhancements to the Drasi Server, focusing on advanced plugin lifecycle management, expanded API endpoints, improved documentation, and dependency updates. It also removes several VSCode workspace configuration files and tasks, and updates the Dockerfile to build the web UI as part of the Docker image. The changes improve developer workflows, clarify plugin development, and enable dynamic plugin operations at runtime.
Major enhancements and new features:
Plugin lifecycle management:
Arc<RwLock<PluginRegistry>>. ThePluginOperationsmodule is introduced as the single source of truth for plugin management, shared across CLI, API, and startup. [1] [2] [3] [4]API and configuration improvements:
Dependency and build system updates:
drasi-lib,drasi-bootstrap-*,drasi-reaction-application,drasi-index-*,drasi-state-store-redb,drasi-plugin-sdk, anddrasi-host-sdk. Adds new dependencies for schema validation and HTTP requests. [1] [2] [3] [4]Developer tooling and documentation:
Documentation improvements:
CLAUDE.mdwith detailed guides on plugin development, server architecture, API endpoints, error handling, and state management. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10].github/skills/commit-staged/SKILL.mdwith step-by-step instructions for committing staged files.VSCode workspace cleanup:
References:
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]