Skip to content

Synchronize gateway controller replicas based on DB backed eventing Mechanism.#1354

Merged
Krishanx92 merged 18 commits intowso2:mainfrom
VirajSalaka:sync-fix-v3
Mar 17, 2026
Merged

Synchronize gateway controller replicas based on DB backed eventing Mechanism.#1354
Krishanx92 merged 18 commits intowso2:mainfrom
VirajSalaka:sync-fix-v3

Conversation

@VirajSalaka
Copy link
Copy Markdown
Contributor

@VirajSalaka VirajSalaka commented Mar 15, 2026

Purpose

Explain why this feature or fix is required. Describe the underlying problems, issues, or needs driving this feature/fix and include links to related issues in the following format: Resolves issue1, issue2, etc.

Goals

Describe what solutions this feature or fix introduces to address the problems outlined above.

Approach

Describe how you are implementing the solutions. Include an animated GIF or screenshot if the change affects the UI. Include a link to a Markdown file or Google doc if the feature write-up is too long to paste here.

User stories

Summary of user stories addressed by this change>

Documentation

Link(s) to product documentation that addresses the changes of this PR. If no doc impact, enter “N/A” plus brief explanation of why there’s no doc impact

Automation tests

  • Unit tests

    Code coverage information

  • Integration tests

    Details about the test cases and coverage

Security checks

Samples

Provide high-level details about the samples related to this feature

Related PRs

List any other related PRs

Test environment

List all JDK versions, operating systems, databases, and browser/versions on which this feature/fix was tested

Summary by CodeRabbit

  • New Features
    • Event-driven synchronization: gateways can publish and consume API and API key lifecycle events for multi-replica coordination; services support event-driven or inline modes.
  • Infrastructure
    • Persistent, SQL-backed Event Hub with subscription, replay, retry, cleanup, and graceful shutdown; database schema bumped to v2 to support gateway state and events.
  • Tests
    • Large suite of unit and integration tests validating publishing, subscription, idempotency, cleanup, polling, and event processing.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a pluggable EventHub subsystem with public event types and interface, a SQL-backed backend (polling, replay, cleanup), gateway registry, DB schema for events/gateway_states, EventListener and processors to reconcile state and drive xDS/policy updates, controller/service wiring to publish events, and extensive tests.

Changes

Cohort / File(s) Summary
Event Hub Core
common/eventhub/types.go, common/eventhub/backend.go, common/eventhub/eventhub.go, common/eventhub/sqlbackend.go, common/eventhub/topic.go, common/eventhub/apikey_event.go
Defines Event/ GatewayState types, EventHub interface and SQL backend implementation (initialize, register, publish, subscribe, unsubscribe, cleanup, close), gateway registry, and API key entity ID helpers.
Event Hub Tests
common/eventhub/eventhub_test.go, common/eventhub/apikey_event_test.go
Adds unit and integration-style tests covering registration, publish/subscribe delivery, idempotency, cleanup/polling behavior, and API key entity ID parsing/validation.
Controller wiring & services
gateway/gateway-controller/cmd/controller/main.go, gateway/gateway-controller/pkg/api/handlers/handlers.go, gateway/gateway-controller/pkg/service/restapi/service.go, gateway/gateway-controller/pkg/utils/api_deployment.go, gateway/gateway-controller/pkg/utils/api_key.go
Wires EventHub into startup/shutdown, passes EventHub into Client/APIServer/RestAPIService/APIDeployment/APIKeyService, adds SetEventHub/publish helpers, and switches several flows to event-driven publishing instead of inline in-memory/xDS updates.
EventListener & processors
gateway/gateway-controller/pkg/eventlistener/listener.go, .../api_processor.go, .../apikey_processor.go
Implements EventListener subscribing to EventHub, dispatching API and API_KEY events, reconciling DB/memory state, triggering async xDS/policy updates, and robust error/panic recovery.
EventListener & controller tests
gateway/gateway-controller/pkg/eventlistener/*_test.go, gateway/gateway-controller/pkg/api/handlers/handlers_test.go, gateway/gateway-controller/pkg/controlplane/*, .../controlplane/*_test.go
Adds extensive tests and mocks for EventListener, API/API key processing, policy derivation, XDS interactions, and integrates mock EventHub into controller/client tests to assert published events and DB state.
Storage & schema
gateway/gateway-controller/pkg/storage/gateway-controller-db.sql, ...-postgres.sql, gateway/gateway-controller/pkg/storage/interface.go, gateway/gateway-controller/pkg/storage/sql_store.go, gateway/gateway-controller/pkg/storage/sqlite.go, gateway/gateway-controller/pkg/storage/sqlite_test.go, gateway/gateway-controller/tests/integration/schema_test.go
Adds gateway_states and events tables and index, bumps schema version to 2, and exposes Storage.GetDB()/sqlStore.GetDB() to provide *sql.DB for EventHub usage; updates schema tests.
API key flows & tests
gateway/gateway-controller/pkg/utils/api_key.go, gateway/gateway-controller/pkg/utils/api_key_test.go, gateway/gateway-controller/pkg/eventlistener/apikey_processor_test.go
Refactors API key lifecycle to support event-driven mode (SetEventHub, publishAPIKeyEvent), adds DB-fallback config lookup, and expands tests for create/update/revoke flows and XDS interactions.
Misc test/integration helpers
gateway/gateway-controller/pkg/controlplane/*, gateway/gateway-controller/pkg/controlplane/client_integration_test.go, numerous test files
Adds integration helpers, mock EventHub implementations, test client constructors, and propagates EventHub through NewClient/NewAPIServer for tests.

Sequence Diagram(s)

mermaid
sequenceDiagram
autonumber
participant Controller as Controller
participant SQL as SQLBackend (events/gateway_states)
participant Listener as EventListener
participant Store as In‑Memory ConfigStore / xDS
Controller->>SQL: Publish(event: API / API_KEY)
Note right of SQL: persist event row; update gateway_state
SQL->>Listener: deliver event via subscriber channel
Listener->>Store: load config (DB or memory)
Listener->>Store: update in-memory state / remove API keys
Listener->>Store: trigger async xDS snapshot update

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

"🐰
I stitched events from DB to ear,
Hopped keys and APIs with cheer,
Controllers publish, listeners sweep,
Replicas wake from synced deep,
A joyful hop — the cluster's clear!"

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is entirely a template with no actual content filled in. All required sections remain as placeholders with explanatory text in angle brackets, making it impossible to understand the purpose, goals, approach, or implementation details of this substantial changeset. Complete all sections of the PR description template with concrete details: fill in Purpose (issues being resolved), Goals (solutions introduced), Approach (implementation methodology), test coverage results, security check confirmations, and test environment details. Provide actual information instead of template placeholders.
Docstring Coverage ⚠️ Warning Docstring coverage is 28.21% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and concisely describes the main change: implementing a database-backed eventing mechanism to synchronize multiple gateway controller replicas.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

🧹 Nitpick comments (4)
gateway/gateway-controller/tests/integration/schema_test.go (1)

103-108: Assert the new v2 tables here too.

This PR bumps user_version because schema v2 adds the eventing tables, but the test still only validates the legacy schema. A build that forgets to create events or gateway_states would still pass this check, so please extend the existence/index assertions in this test.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/tests/integration/schema_test.go` around lines 103
- 108, In the SchemaVersion test (t.Run "SchemaVersion") after checking PRAGMA
user_version via rawDB, add assertions that the new v2 tables 'events' and
'gateway_states' exist and that their expected indexes are present by querying
sqlite_master (or PRAGMA index_list/PRAGMA index_info) using
rawDB.QueryRow/Query and asserting no error and non-empty results; reference the
test's rawDB variable and the t.Run "SchemaVersion" block so the checks run
alongside the version assertion.
gateway/gateway-controller/pkg/storage/interface.go (1)

273-276: Keep Storage backend-agnostic by moving GetDB() to a narrower interface.

Lines 273-276 introduce a DB-specific type into the core abstraction, which conflicts with the interface’s own design guidance (Lines 46-47). Prefer a separate optional interface for DB handle access.

♻️ Suggested refactor
 type Storage interface {
@@
-    // GetDB returns the underlying *sql.DB for direct access.
-    // Used by EventHub for event synchronization.
-    // Returns nil for non-SQL backends.
-    GetDB() *sql.DB
@@
     Close() error
 }
+
+// DBHandleProvider is an optional extension for SQL-backed implementations.
+type DBHandleProvider interface {
+    GetDB() *sql.DB
+}
// Call site example:
if dbProvider, ok := storageInstance.(storage.DBHandleProvider); ok {
    db := dbProvider.GetDB()
    // wire EventHub SQL backend
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/storage/interface.go` around lines 273 - 276,
Remove the SQL-specific method from the general Storage abstraction: delete
GetDB() from the Storage interface and introduce a narrower interface named
DBHandleProvider that exposes GetDB() *sql.DB; update any consumers that
currently call Storage.GetDB() (notably EventHub wiring code) to first
type-assert the storage instance to storage.DBHandleProvider and only then call
GetDB(), leaving non-SQL backends unaffected.
gateway/gateway-controller/pkg/controlplane/controlplane_test.go (1)

41-63: Make mock event publication thread-safe to avoid racey tests.

mockControlPlaneEventHub.PublishEvent (Lines 54-63) appends to publishedEvents without locking. If publish calls happen from goroutines, this can race and make tests flaky under -race.

🧪 Test mock hardening
 type mockControlPlaneEventHub struct {
+    mu              sync.Mutex
     publishedEvents []publishedControlPlaneEvent
     publishErr      error
 }
@@
 func (m *mockControlPlaneEventHub) PublishEvent(gatewayID string, event eventhub.Event) error {
     if m.publishErr != nil {
         return m.publishErr
     }
+    m.mu.Lock()
+    defer m.mu.Unlock()
     m.publishedEvents = append(m.publishedEvents, publishedControlPlaneEvent{
         gatewayID: gatewayID,
         event:     event,
     })
     return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/controlplane/controlplane_test.go` around
lines 41 - 63, The mockControlPlaneEventHub's PublishEvent appends to
publishedEvents without synchronization, causing races; make the mock
thread-safe by adding a sync.Mutex (or sync.RWMutex) field to the
mockControlPlaneEventHub struct and use it to protect access to publishedEvents
(lock at start of PublishEvent, defer unlock around the check/append/return).
Ensure any other accesses to publishedEvents in tests or helpers also use the
same mutex if present.
gateway/gateway-controller/pkg/storage/gateway-controller-db.sql (1)

237-250: EventID uniqueness is guaranteed by UUID generation; composite key is an optional design refinement.

The PRIMARY KEY (event_id) is safe in production because EventID is either generated via uuid.New() or sourced from a high-entropy correlationID passed through the request chain. The Publish() method in common/eventhub/sqlbackend.go handles duplicate EventID collisions idempotently—it checks eventExists() and logs an info message rather than failing, so no events are silently dropped.

If the design intent is to scope events per gateway (allowing the same event_id across different gateways), consider the composite key suggestion below for clarity:

Optional: Composite key for gateway-scoped events
 CREATE TABLE IF NOT EXISTS events (
    gateway_id TEXT NOT NULL,
    processed_timestamp TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    originated_timestamp TIMESTAMP NOT NULL,
    entity_type TEXT NOT NULL,
    action TEXT NOT NULL CHECK(action IN ('CREATE', 'UPDATE', 'DELETE')),
    entity_id TEXT NOT NULL,
    event_id TEXT NOT NULL,
    event_data TEXT NOT NULL,
-   PRIMARY KEY (event_id),
+   PRIMARY KEY (gateway_id, event_id),
    FOREIGN KEY (gateway_id) REFERENCES gateway_states(gateway_id) ON DELETE CASCADE
 );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/storage/gateway-controller-db.sql` around
lines 237 - 250, The table currently enforces global uniqueness with PRIMARY KEY
(event_id); if you intend event IDs to be scoped per gateway, change the PK to a
composite PRIMARY KEY (gateway_id, event_id) in the events DDL and adjust the
idx_events_gateway_id_processed_timestamp if needed; then update related code
paths—specifically Publish() and eventExists() in
common/eventhub/sqlbackend.go—to query/insert using both gateway_id and event_id
(and any upsert/duplicate-handling logic) so DB constraints and application
logic remain consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@common/eventhub/apikey_event.go`:
- Around line 26-40: The composite ID format is ambiguous when apiID or keyID
contain the separator; update BuildAPIKeyEntityID and ParseAPIKeyEntityID to use
an unambiguous format such as length-prefixed API ID: e.g. BuildAPIKeyEntityID
should produce a string like "<apiIDLen>:<apiID>:<keyID>" (use
strconv.Itoa(len(apiID)) and fmt.Sprintf to build it) and ParseAPIKeyEntityID
should parse that format by reading the leading length, validating it,
extracting the exact apiID bytes and returning the remaining keyID, returning an
error on malformed input; update any uses of apiKeyEntityIDSeparator logic
accordingly (references: BuildAPIKeyEntityID, ParseAPIKeyEntityID,
apiKeyEntityIDSeparator).

In `@common/eventhub/sqlbackend.go`:
- Around line 562-595: The loop advances gateway state even when gw.subscribers
is empty; change the logic in the event delivery block (around gw.subscribers,
subscriberChannelsAvailable, deliveryBlocked, latestDeliveredTimestamp,
deliveredCount) to check if len(subscribers) == 0 immediately after reading
gw.subscribers under b.registry.mu.RLock() and return/skip delivery early so you
do not update gw.knownVersion or gw.lastPolled; ensure you release the RLock
before returning and do not acquire the subsequent b.registry.mu.Lock() that
updates the cursor/version when there are no subscribers.
- Around line 282-315: The insert can fail when processed_timestamp (time.Now())
collides for the same gateway; modify the Publish path around
tx.Stmt(b.insertEventStmt).Exec to detect unique-constraint errors on the
(gateway_id, processed_timestamp) key and retry the insert with a fresh
timestamp (or a monotonic sequence) a few times before giving up; specifically,
wrap the Exec call in a small retry loop that updates the processed_timestamp
value and re-executes the prepared statement, still performing proper rollback
on persistent failures, and keep the existing duplicate-event handling
(b.eventExists) for non-timestamp-violation failures.

In `@common/eventhub/topic.go`:
- Around line 71-79: The get method on gatewayRegistry currently returns a live
*gateway pointer (gatewayRegistry.get -> gateway) which lets callers access
mutable fields (subscribers, knownVersion, lastPolled) after the registry lock
is released; change the API to avoid leaking the internal gateway: implement and
use snapshot/update helpers on gatewayRegistry (e.g.,
SnapshotSubscribers(gatewayID) to return a copy of the subscribers slice and
SnapshotState(gatewayID) to return copies of knownVersion/lastPolled) and update
methods (e.g., UpdateLastPolled(gatewayID, ts),
IncrementKnownVersion(gatewayID), RemoveSubscriber(gatewayID, subscriberID))
that take the registry lock internally; update all call sites that used get (and
the similar code at lines ~137-145) to call these new helpers so no caller
iterates or mutates the internal gateway fields without holding the registry
mutex.

In `@gateway/gateway-controller/cmd/controller/main.go`:
- Around line 145-168: Trim and canonicalize cfg.Controller.Server.GatewayID
before EventHub setup and use that trimmed ID everywhere
(registration/publishing/subscribing) instead of the raw value; call
strings.TrimSpace on cfg.Controller.Server.GatewayID, pass the trimmedID into
eventHubInstance.RegisterGateway, check and handle its return/error (fail fast
and log error via log.Error and os.Exit(1) if RegisterGateway returns an error),
and use the same trimmedID in the Info log so the registered value matches what
EventListener.Start() will use.

In `@gateway/gateway-controller/pkg/api/handlers/handlers.go`:
- Around line 732-742: The handler currently treats s.eventHub.PublishEvent(...)
failures as non-fatal after DB mutations (using eventhub.Event with
EventType:eventhub.EventTypeAPI, Action:"UPDATE", EntityID:existing.UUID,
EventID:correlationID), which leaves replicas stale; change the control flow so
PublishEvent errors are part of the request success criteria: if PublishEvent
returns an error, do not return success—either persist the event to the same
transactional outbox/store (e.g., call your outbox save function) before
acknowledging success or return an error response (propagate a 5xx/appropriate
error) so the caller sees failure; apply the same change to the delete branch
around the s.eventHub.PublishEvent call at the other location (lines ~872-882)
so both update and delete require successful event persistence/publish before
replying OK.

In `@gateway/gateway-controller/pkg/controlplane/client.go`:
- Around line 1073-1083: The event-driven branches (code using
c.eventHub.PublishEvent with eventhub.Event in functions that perform
undeploy/delete) currently mutate persistent state first then only log
PublishEvent failures; change this to atomic publish-or-fail semantics by
ensuring PublishEvent succeeds before committing the DB mutation or by staging
the event in a transactional outbox and committing both together. Concretely,
for the helpers that call PublishEvent (identify by c.eventHub, PublishEvent,
gatewayID, and the undeploy/delete helpers and undeployedEvent.CorrelationID),
either: 1) publish the event first and abort/return an error if PublishEvent
fails so the caller falls back to the inline path, or 2) implement an
outbox/transaction pattern where you persist the pending event alongside the
state mutation in one transaction and have a reliable forwarder send the event;
if using option 1 also ensure any partial DB changes are rolled back on
PublishEvent failure and surface the error instead of just logging it.

In `@gateway/gateway-controller/pkg/eventlistener/api_processor.go`:
- Around line 127-162: The DELETE path can skip xDS cleanup when existingConfig
is nil; modify the delete flow so RemoveAPIKeysByAPI on l.apiKeyXDSManager is
invoked even if existingConfig == nil by using tombstone/event payload data or
by extracting apiName/apiVersion from the incoming event before deleting from
l.store (e.g., ensure apiName/apiVersion are read from event and passed to
l.apiKeyXDSManager.RemoveAPIKeysByAPI(entityID, apiName, apiVersion,
event.EventID)); keep the existing l.store.Delete and l.store.RemoveAPIKeysByAPI
calls but add a fallback that calls l.apiKeyXDSManager.RemoveAPIKeysByAPI using
event-provided apiName/apiVersion when extractAPINameVersion(existingConfig)
would return empty.

In `@gateway/gateway-controller/pkg/eventlistener/apikey_processor.go`:
- Around line 155-205: The revoke path currently skips xDS revoke when
apiKeyName is empty (derived only from l.store.GetAPIKeyByID), which breaks
replayed DELETEs; update apikey_processor.go to extract the API key name from
the incoming event payload (event.EventData) before falling back to the
in-memory store so DELETE events are self-contained: parse event.EventData for
the key name into apiKeyName (use the same struct/field used when emitting
deletes), then continue to call l.store.RemoveAPIKeyByID(apiID, keyID) and
RevokeAPIKey (or the existing revoke flow) even if the store lookup failed,
using the name from event.EventData; keep the existing error/logging behavior
for GetAPIKeyByID and RemoveAPIKeyByID but do not skip xDS revoke solely because
apiKeyName is empty if the name is present in event.EventData.

In `@gateway/gateway-controller/pkg/eventlistener/listener.go`:
- Around line 117-121: The Stop() method on EventListener currently only calls
l.cancel() and logs, but does not deregister the channel from the EventHub,
leaving a buffered channel registered that can fill and stall delivery; modify
EventListener.Stop() to call the appropriate unsubscribe method on the EventHub
(e.g., EventHub.Unsubscribe or UnsubscribeAll) for the listener's
subscription(s) before or immediately after l.cancel(), referencing the
EventListener struct's hub/subscription fields to locate the registration and
ensure the backend (common/eventhub/sqlbackend.go) no longer holds the channel
before returning.

In `@gateway/gateway-controller/pkg/storage/sqlite.go`:
- Line 84: The code currently hardcodes currentSchemaVersion = 2 and initSchema
rejects DBs at user_version 1; add an idempotent migration path so existing v1
databases are upgraded instead of rejected. Modify initSchema to read PRAGMA
user_version, and if it's 1 run a migration routine (e.g., migrateV1toV2) that
executes the DDL needed for v2 (create any new tables/indexes, ALTERs if
necessary) inside a transaction, then set PRAGMA user_version = 2; ensure the
migration is safe to re-run (use IF NOT EXISTS for CREATE, check for existing
indexes/tables) and log errors via the same logger used by initSchema.

In `@gateway/gateway-controller/pkg/utils/api_deployment.go`:
- Around line 106-124: The publishEvent function currently swallows PublishEvent
errors which can cause the DB commit to succeed while replicas never receive the
update; change APIDeploymentService.publishEvent to return an error (instead of
void) and stop swallowing the error from s.eventHub.PublishEvent(s.gatewayID,
event), returning it to the caller; update all callers of publishEvent to treat
the call as part of the write transaction (check the returned error and
abort/rollback the DB commit or snapshot update if non-nil) so that failures in
eventHub.PublishEvent cause the overall write to fail rather than silently
succeed.
- Around line 355-375: DeployAPIConfiguration currently bases create-vs-update
and name/handle conflict checks on the local cache (s.store) when s.eventHub !=
nil, causing the async path to send incorrect CREATE/UPDATE events; change the
logic so the existence and conflict checks are performed against the
authoritative config store/DB (the persistent ConfigStore methods you already
have) rather than the in-memory s.store before deciding isUpdate or calling
SaveConfig/publishEvent; specifically, in DeployAPIConfiguration and the
corresponding async branch where you call
s.publishEvent/publishEvent(eventhub.EventTypeAPI, action, ...) and in the other
similar block (lines ~461-476), replace uses of s.store for existence/conflict
determination with a call to the persistent store (e.g.,
s.configStore.Exists/GetAPI or equivalent) and derive action from that result so
the event reflects DB state, keeping the memory-only
snapshotManager.UpdateSnapshot branch unchanged except for using the same
DB-derived decision.

In `@gateway/gateway-controller/pkg/utils/api_key.go`:
- Around line 160-178: publishAPIKeyEvent currently swallows EventHub publish
errors (in APIKeyService.publishAPIKeyEvent), causing DB changes to succeed
without notifying replicas; change publishAPIKeyEvent to return an error and
propagate that error from callers so failures are visible to the HTTP handler.
Specifically, modify publishAPIKeyEvent (and any callers such as methods that
create/update/delete keys) to return error instead of void, check the error
returned from s.eventHub.PublishEvent(...) and return it upward instead of only
logging, and update calling functions to handle/return the error so the API
returns non-2xx when EventHub publish fails.
- Around line 288-324: The async-only branch (when s.eventHub != nil and
publishAPIKeyEvent is called) leaves s.store stale causing
UpdateAPIKey/RegenerateAPIKey/ListAPIKeys to miss recent writes; ensure the same
in-memory/config updates happen in the event-driven path: after
publishAPIKeyEvent, call s.store.StoreAPIKey(apiKey) (and
s.xdsManager.StoreAPIKey(...) if applicable) and handle errors/rollback
similarly to the memory-only branch (including using s.db.RemoveAPIKeyAPIAndName
on failure), or alternatively make the read paths
(UpdateAPIKey/RegenerateAPIKey/ListAPIKeys) fall back to the DB when s.store
lookup returns not-found; update publishAPIKeyEvent, s.store usage, and the read
paths (UpdateAPIKey, RegenerateAPIKey, ListAPIKeys) to keep local state
consistent with DB/async events.
- Around line 817-836: Replace the current delete-on-failure rollback with a
restore of the prior API key record: capture the original API key object before
calling db.UpdateAPIKey, and if s.store.StoreAPIKey(regeneratedKey) fails, call
s.db.UpdateAPIKey(originalKey) (or the equivalent restore method) instead of
s.db.RemoveAPIKeyAPIAndName; keep the same error logging pattern, handle nil
s.db, and surface any restore error in the logger and returned error.

---

Nitpick comments:
In `@gateway/gateway-controller/pkg/controlplane/controlplane_test.go`:
- Around line 41-63: The mockControlPlaneEventHub's PublishEvent appends to
publishedEvents without synchronization, causing races; make the mock
thread-safe by adding a sync.Mutex (or sync.RWMutex) field to the
mockControlPlaneEventHub struct and use it to protect access to publishedEvents
(lock at start of PublishEvent, defer unlock around the check/append/return).
Ensure any other accesses to publishedEvents in tests or helpers also use the
same mutex if present.

In `@gateway/gateway-controller/pkg/storage/gateway-controller-db.sql`:
- Around line 237-250: The table currently enforces global uniqueness with
PRIMARY KEY (event_id); if you intend event IDs to be scoped per gateway, change
the PK to a composite PRIMARY KEY (gateway_id, event_id) in the events DDL and
adjust the idx_events_gateway_id_processed_timestamp if needed; then update
related code paths—specifically Publish() and eventExists() in
common/eventhub/sqlbackend.go—to query/insert using both gateway_id and event_id
(and any upsert/duplicate-handling logic) so DB constraints and application
logic remain consistent.

In `@gateway/gateway-controller/pkg/storage/interface.go`:
- Around line 273-276: Remove the SQL-specific method from the general Storage
abstraction: delete GetDB() from the Storage interface and introduce a narrower
interface named DBHandleProvider that exposes GetDB() *sql.DB; update any
consumers that currently call Storage.GetDB() (notably EventHub wiring code) to
first type-assert the storage instance to storage.DBHandleProvider and only then
call GetDB(), leaving non-SQL backends unaffected.

In `@gateway/gateway-controller/tests/integration/schema_test.go`:
- Around line 103-108: In the SchemaVersion test (t.Run "SchemaVersion") after
checking PRAGMA user_version via rawDB, add assertions that the new v2 tables
'events' and 'gateway_states' exist and that their expected indexes are present
by querying sqlite_master (or PRAGMA index_list/PRAGMA index_info) using
rawDB.QueryRow/Query and asserting no error and non-empty results; reference the
test's rawDB variable and the t.Run "SchemaVersion" block so the checks run
alongside the version assertion.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0ddc0df8-4b2a-4e60-a124-6979a3161fe5

📥 Commits

Reviewing files that changed from the base of the PR and between c621181 and d21bbd3.

📒 Files selected for processing (31)
  • common/eventhub/apikey_event.go
  • common/eventhub/apikey_event_test.go
  • common/eventhub/backend.go
  • common/eventhub/eventhub.go
  • common/eventhub/eventhub_test.go
  • common/eventhub/sqlbackend.go
  • common/eventhub/topic.go
  • common/eventhub/types.go
  • gateway/gateway-controller/cmd/controller/main.go
  • gateway/gateway-controller/pkg/api/handlers/handlers.go
  • gateway/gateway-controller/pkg/api/handlers/handlers_test.go
  • gateway/gateway-controller/pkg/controlplane/api_deleted_test.go
  • gateway/gateway-controller/pkg/controlplane/client.go
  • gateway/gateway-controller/pkg/controlplane/client_integration_test.go
  • gateway/gateway-controller/pkg/controlplane/controlplane_test.go
  • gateway/gateway-controller/pkg/eventlistener/api_processor.go
  • gateway/gateway-controller/pkg/eventlistener/api_processor_test.go
  • gateway/gateway-controller/pkg/eventlistener/apikey_processor.go
  • gateway/gateway-controller/pkg/eventlistener/apikey_processor_test.go
  • gateway/gateway-controller/pkg/eventlistener/branches_test.go
  • gateway/gateway-controller/pkg/eventlistener/listener.go
  • gateway/gateway-controller/pkg/eventlistener/listener_test.go
  • gateway/gateway-controller/pkg/storage/gateway-controller-db.postgres.sql
  • gateway/gateway-controller/pkg/storage/gateway-controller-db.sql
  • gateway/gateway-controller/pkg/storage/interface.go
  • gateway/gateway-controller/pkg/storage/sql_store.go
  • gateway/gateway-controller/pkg/storage/sqlite.go
  • gateway/gateway-controller/pkg/storage/sqlite_test.go
  • gateway/gateway-controller/pkg/utils/api_deployment.go
  • gateway/gateway-controller/pkg/utils/api_key.go
  • gateway/gateway-controller/tests/integration/schema_test.go

Comment thread common/eventhub/apikey_event.go Outdated
Comment thread common/eventhub/sqlbackend.go
Comment thread common/eventhub/sqlbackend.go
Comment thread common/eventhub/topic.go
Comment on lines +145 to +168
// Initialize EventHub for multi-replica sync (requires persistent storage)
var eventHubInstance eventhub.EventHub
var eventHubStorage storage.Storage
if cfg.IsPersistentMode() {
// Create separate storage connection for EventHub (avoids SQLite lock contention)
eventHubStorage, err = storage.NewStorage(toBackendConfig(cfg), log)
if err != nil {
log.Error("Failed to initialize EventHub storage", slog.Any("error", err))
os.Exit(1)
}
eventHubDB := eventHubStorage.GetDB()
if eventHubDB != nil {
eventHubInstance = eventhub.New(eventHubDB, log, eventhub.DefaultConfig())
if err := eventHubInstance.Initialize(); err != nil {
log.Error("Failed to initialize EventHub", slog.Any("error", err))
os.Exit(1)
}
eventHubInstance.RegisterGateway(cfg.Controller.Server.GatewayID)
log.Info("EventHub initialized for multi-replica sync",
slog.String("gateway_id", cfg.Controller.Server.GatewayID))
} else {
log.Warn("EventHub storage returned nil DB, skipping EventHub initialization")
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Canonicalize and verify gateway_id before EventHub setup.

EventListener.Start() trims controller.server.gateway_id before subscribing, but this path registers the raw value and ignores the registration result. Use one trimmed ID for registration/publishing/subscribing and fail fast on RegisterGateway errors; otherwise a value with surrounding spaces can break replica delivery silently.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/cmd/controller/main.go` around lines 145 - 168,
Trim and canonicalize cfg.Controller.Server.GatewayID before EventHub setup and
use that trimmed ID everywhere (registration/publishing/subscribing) instead of
the raw value; call strings.TrimSpace on cfg.Controller.Server.GatewayID, pass
the trimmedID into eventHubInstance.RegisterGateway, check and handle its
return/error (fail fast and log error via log.Error and os.Exit(1) if
RegisterGateway returns an error), and use the same trimmedID in the Info log so
the registered value matches what EventListener.Start() will use.

Comment thread gateway/gateway-controller/pkg/utils/api_deployment.go
Comment thread gateway/gateway-controller/pkg/utils/api_deployment.go
Comment thread gateway/gateway-controller/pkg/utils/api_key.go
Comment thread gateway/gateway-controller/pkg/utils/api_key.go
Comment thread gateway/gateway-controller/pkg/utils/api_key.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
gateway/gateway-controller/pkg/utils/api_key.go (3)

1396-1413: ⚠️ Potential issue | 🟡 Minor

ExternalRefId not preserved during update.

The updateAPIKeyFromRequest method creates a new APIKey struct but doesn't copy existingKey.ExternalRefId. Based on learnings, APIKey.ExternalRefId should remain immutable after creation.

🔧 Suggested fix
 	updatedKey := &models.APIKey{
 		UUID:         existingKey.UUID,
 		Name:         existingKey.Name,
 		DisplayName:  displayName,
 		APIKey:       hashedAPIKeyValue,
 		MaskedAPIKey: maskedAPIKeyValue,
 		ArtifactUUID: existingKey.ArtifactUUID,
 		Operations:   operations,
 		Status:       models.APIKeyStatusActive,
 		CreatedAt:    existingKey.CreatedAt,
 		CreatedBy:    existingKey.CreatedBy,
 		UpdatedAt:    now,
 		ExpiresAt:    expiresAt,
 		Unit:         unit,
 		Duration:     duration,
 		Source:       existingKey.Source,
+		ExternalRefId: existingKey.ExternalRefId,
 	}

Based on learnings: "In gateway/gateway-controller/pkg/utils/api_key.go, ensure APIKey.Issuer and APIKey.ExternalRefId remain immutable after creation."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/utils/api_key.go` around lines 1396 - 1413,
The updateAPIKeyFromRequest function builds a new models.APIKey (updatedKey) but
fails to preserve immutable fields; ensure you copy existingKey.ExternalRefId
and existingKey.Issuer into updatedKey (e.g., set ExternalRefId:
existingKey.ExternalRefId and Issuer: existingKey.Issuer) so those values remain
unchanged when regenerating/updating the key, while keeping other updated fields
(APIKey, MaskedAPIKey, UpdatedAt, etc.) as-is.

422-428: ⚠️ Potential issue | 🟠 Major

RevokeAPIKey uses store without DB fallback.

Unlike CreateAPIKey, UpdateAPIKey, and RegenerateAPIKey which use getAPIConfigByHandle (with DB fallback), RevokeAPIKey directly uses s.store.GetByHandle. In event-driven mode, this can fail to find recently deployed APIs that haven't yet been processed by the EventListener.

🔧 Suggested fix
 	// Validate that API exists
-	config, err := s.store.GetByHandle(params.Handle)
+	config, err := s.getAPIConfigByHandle(params.Handle, logger)
 	if err != nil {
-		logger.Warn("API configuration not found for API key revocation",
-			slog.Any("error", err))
-		return nil, fmt.Errorf("API configuration handle '%s' not found", params.Handle)
+		if storage.IsNotFoundError(err) {
+			logger.Warn("API configuration not found for API key revocation",
+				slog.Any("error", err))
+			return nil, fmt.Errorf("API configuration handle '%s' not found", params.Handle)
+		}
+		logger.Error("Failed to retrieve API configuration for API key revocation",
+			slog.Any("error", err))
+		return nil, fmt.Errorf("failed to retrieve API configuration for handle '%s': %w", params.Handle, err)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/utils/api_key.go` around lines 422 - 428,
RevokeAPIKey currently calls s.store.GetByHandle which skips the DB fallback and
can miss newly deployed APIs; change it to call getAPIConfigByHandle(ctx, s,
params.Handle) (the same helper used by
CreateAPIKey/UpdateAPIKey/RegenerateAPIKey) and handle its returned config and
error the same way (log via logger.Warn/slog.Any and return a formatted error if
not found) so event-driven mode will fall back to the DB when needed.

951-958: ⚠️ Potential issue | 🟠 Major

ListAPIKeys uses store without DB fallback.

Similar to RevokeAPIKey, this method uses s.store.GetByHandle directly instead of getAPIConfigByHandle. This creates inconsistent behavior across the API key operations.

🔧 Suggested fix
 	// Validate that API exists
-	config, err := s.store.GetByHandle(params.Handle)
+	config, err := s.getAPIConfigByHandle(params.Handle, logger)
 	if err != nil {
-		logger.Warn("API configuration not found for API keys listing",
-			slog.String("handle", params.Handle),
-			slog.String("correlation_id", params.CorrelationID))
-		return nil, fmt.Errorf("API configuration handle '%s' not found", params.Handle)
+		if storage.IsNotFoundError(err) {
+			logger.Warn("API configuration not found for API keys listing",
+				slog.String("handle", params.Handle),
+				slog.String("correlation_id", params.CorrelationID))
+			return nil, fmt.Errorf("API configuration handle '%s' not found", params.Handle)
+		}
+		logger.Error("Failed to retrieve API configuration for API keys listing",
+			slog.Any("error", err))
+		return nil, fmt.Errorf("failed to retrieve API configuration for handle '%s': %w", params.Handle, err)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/utils/api_key.go` around lines 951 - 958, In
ListAPIKeys, replace the direct call to s.store.GetByHandle with the helper
getAPIConfigByHandle to ensure DB fallback and consistent behavior (locate the
code around the ListAPIKeys function where s.store.GetByHandle is used); use the
returned config and error from getAPIConfigByHandle and preserve existing
logger/error returns so ListAPIKeys mirrors RevokeAPIKey's behavior when a
config is not found or fallback is required.
gateway/gateway-controller/pkg/utils/api_deployment.go (1)

206-225: ⚠️ Potential issue | 🟠 Major

Conflict checks use in-memory store, not authoritative DB.

In event-driven mode, the in-memory s.store may be stale. The conflict checks for name/version (line 210) and handle (line 217) still query s.store, which can miss recently created APIs or allow false conflicts.

This is related to the past review comment about ConfigStore not being authoritative in async mode. While saveOrUpdateConfig was updated to check the DB for existence (line 439), the conflict detection in DeployAPIConfiguration still relies on the potentially stale in-memory store.

🔧 Suggested approach

When s.eventHub != nil, the conflict checks should also query the database:

if s.eventHub != nil && s.db != nil {
    // Check DB for name/version conflicts
    if conflicting, err := s.db.GetConfigByNameVersion(apiName, apiVersion); err == nil {
        if !isUpdate || conflicting.UUID != apiID {
            return nil, fmt.Errorf("%w: configuration with name '%s' and version '%s' already exists", storage.ErrConflict, apiName, apiVersion)
        }
    }
    // Similar for handle conflicts
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/utils/api_deployment.go` around lines 206 -
225, The conflict checks in DeployAPIConfiguration currently query only the
in-memory s.store (using GetByNameVersion and GetAll) which can be stale in
event-driven mode; update the logic so when s.eventHub != nil (async mode) and
s.db != nil you also query the authoritative DB (e.g., call
s.db.GetConfigByNameVersion for name/version and the DB equivalent for handle)
and apply the same isUpdate/UUID check as with s.store, ensuring both s.store
and s.db are consulted before returning storage.ErrConflict; this mirrors the
approach used in saveOrUpdateConfig.
🧹 Nitpick comments (3)
gateway/gateway-controller/pkg/utils/api_key_test.go (1)

881-893: Consider adding cleanup for SQLite storage.

The newTestSQLiteStorage helper creates a SQLite database but doesn't register a cleanup function to close it. While the temporary directory will be removed, explicitly closing the database connection ensures proper resource cleanup.

♻️ Suggested improvement
 func newTestSQLiteStorage(t *testing.T, logger *slog.Logger) storage.Storage {
 	t.Helper()

 	dbPath := filepath.Join(t.TempDir(), "apikey.db")
 	db, err := storage.NewStorage(storage.BackendConfig{
 		Type:       "sqlite",
 		SQLitePath: dbPath,
 	}, logger)
 	if err != nil {
 		t.Fatalf("failed to create sqlite storage: %v", err)
 	}
+	t.Cleanup(func() {
+		if closer, ok := db.(interface{ Close() error }); ok {
+			closer.Close()
+		}
+	})
 	return db
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/utils/api_key_test.go` around lines 881 - 893,
The helper newTestSQLiteStorage currently creates a storage.Storage via
storage.NewStorage but doesn’t close it; register a t.Cleanup callback that
closes the returned storage (call the storage’s Close method) so the DB
connection is closed when the test ends. Update newTestSQLiteStorage to call
t.Cleanup(func(){ _ = db.Close() }) (or handle Close error as appropriate) after
db is created, referencing the newTestSQLiteStorage function and the
storage.NewStorage / storage.Storage Close method.
common/eventhub/eventhub_test.go (1)

34-62: Duplicate test setup functions with identical implementations.

setupTestDB and setupTestDBAllowTimestampOverlap have identical implementations. Consider consolidating into a single function or documenting why two variants exist.

♻️ Suggested consolidation
-func setupTestDBAllowTimestampOverlap(t *testing.T) *sql.DB {
-	t.Helper()
-	db, err := sql.Open("sqlite3", ":memory:?_journal_mode=WAL&_busy_timeout=5000&_foreign_keys=ON")
-	require.NoError(t, err)
-
-	_, err = db.Exec(`
-		CREATE TABLE IF NOT EXISTS gateway_states (
-			gateway_id TEXT PRIMARY KEY,
-			version_id TEXT NOT NULL DEFAULT '',
-			updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
-		);
-		CREATE TABLE IF NOT EXISTS events (
-			gateway_id TEXT NOT NULL,
-			processed_timestamp TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
-			originated_timestamp TIMESTAMP NOT NULL,
-			entity_type TEXT NOT NULL,
-			action TEXT NOT NULL CHECK(action IN ('CREATE', 'UPDATE', 'DELETE')),
-			entity_id TEXT NOT NULL,
-			event_id TEXT NOT NULL,
-			event_data TEXT NOT NULL,
-			PRIMARY KEY (event_id)
-		);
-	`)
-	require.NoError(t, err)
-
-	t.Cleanup(func() { db.Close() })
-	return db
-}
+// setupTestDBAllowTimestampOverlap is an alias for setupTestDB.
+// Retained for test readability in boundary-overlap scenarios.
+var setupTestDBAllowTimestampOverlap = setupTestDB

Also applies to: 64-91

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@common/eventhub/eventhub_test.go` around lines 34 - 62, The two test helpers
setupTestDB and setupTestDBAllowTimestampOverlap are identical; consolidate them
by removing the duplicate and either rename setupTestDBAllowTimestampOverlap to
setupTestDB (or replace both with a single setupTestDB that accepts an options
flag like allowTimestampOverlap bool), update all test callers to use the single
helper (or pass the option), and ensure the shared behavior (table creation and
cleanup in setupTestDB) is preserved; adjust any tests that relied on the
separate name to use the unified helper to avoid duplicate code.
common/eventhub/sqlbackend.go (1)

672-694: Potential race in Close during channel cleanup.

The Close method iterates b.registry.getAll() outside of the lock, then acquires b.registry.mu.Lock() per gateway. A concurrent operation could modify the registry between these steps. Consider holding the lock for the entire cleanup loop.

♻️ Suggested fix
 // Close gracefully shuts down the backend
 func (b *SQLBackend) Close() error {
 	b.cancel()
 	b.wg.Wait()

 	// Close all subscriber channels
+	b.registry.mu.Lock()
 	for _, gw := range b.registry.getAll() {
-		b.registry.mu.Lock()
 		for _, ch := range gw.subscribers {
 			close(ch)
 		}
 		gw.subscribers = nil
-		b.registry.mu.Unlock()
 	}
+	b.registry.mu.Unlock()

 	// Close prepared statements
 	b.stmtMu.Lock()
 	defer b.stmtMu.Unlock()
 	b.closeStatements()

 	b.logger.Info("SQL event hub backend closed")
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@common/eventhub/sqlbackend.go` around lines 672 - 694, The Close method
currently calls b.registry.getAll() without holding b.registry.mu, then locks
per gateway which allows the registry to change concurrently; to fix, acquire
b.registry.mu once before retrieving/iterating the gateways (or modify getAll to
return a deep copy) so the cleanup loop runs under the same lock, then close
each gw.subscribers and set gw.subscribers = nil while still holding the lock;
reference SQLBackend.Close, b.registry.getAll, registry.mu and gw.subscribers
when locating the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@common/eventhub/sqlbackend.go`:
- Around line 44-68: Add a compile-time interface satisfaction assertion for
SQLBackend by adding a package-level var like var _ EventhubImpl =
(*SQLBackend)(nil) (use the exact types EventhubImpl and SQLBackend) immediately
after the SQLBackend type declaration so the compiler will fail if SQLBackend no
longer implements EventhubImpl.

---

Outside diff comments:
In `@gateway/gateway-controller/pkg/utils/api_deployment.go`:
- Around line 206-225: The conflict checks in DeployAPIConfiguration currently
query only the in-memory s.store (using GetByNameVersion and GetAll) which can
be stale in event-driven mode; update the logic so when s.eventHub != nil (async
mode) and s.db != nil you also query the authoritative DB (e.g., call
s.db.GetConfigByNameVersion for name/version and the DB equivalent for handle)
and apply the same isUpdate/UUID check as with s.store, ensuring both s.store
and s.db are consulted before returning storage.ErrConflict; this mirrors the
approach used in saveOrUpdateConfig.

In `@gateway/gateway-controller/pkg/utils/api_key.go`:
- Around line 1396-1413: The updateAPIKeyFromRequest function builds a new
models.APIKey (updatedKey) but fails to preserve immutable fields; ensure you
copy existingKey.ExternalRefId and existingKey.Issuer into updatedKey (e.g., set
ExternalRefId: existingKey.ExternalRefId and Issuer: existingKey.Issuer) so
those values remain unchanged when regenerating/updating the key, while keeping
other updated fields (APIKey, MaskedAPIKey, UpdatedAt, etc.) as-is.
- Around line 422-428: RevokeAPIKey currently calls s.store.GetByHandle which
skips the DB fallback and can miss newly deployed APIs; change it to call
getAPIConfigByHandle(ctx, s, params.Handle) (the same helper used by
CreateAPIKey/UpdateAPIKey/RegenerateAPIKey) and handle its returned config and
error the same way (log via logger.Warn/slog.Any and return a formatted error if
not found) so event-driven mode will fall back to the DB when needed.
- Around line 951-958: In ListAPIKeys, replace the direct call to
s.store.GetByHandle with the helper getAPIConfigByHandle to ensure DB fallback
and consistent behavior (locate the code around the ListAPIKeys function where
s.store.GetByHandle is used); use the returned config and error from
getAPIConfigByHandle and preserve existing logger/error returns so ListAPIKeys
mirrors RevokeAPIKey's behavior when a config is not found or fallback is
required.

---

Nitpick comments:
In `@common/eventhub/eventhub_test.go`:
- Around line 34-62: The two test helpers setupTestDB and
setupTestDBAllowTimestampOverlap are identical; consolidate them by removing the
duplicate and either rename setupTestDBAllowTimestampOverlap to setupTestDB (or
replace both with a single setupTestDB that accepts an options flag like
allowTimestampOverlap bool), update all test callers to use the single helper
(or pass the option), and ensure the shared behavior (table creation and cleanup
in setupTestDB) is preserved; adjust any tests that relied on the separate name
to use the unified helper to avoid duplicate code.

In `@common/eventhub/sqlbackend.go`:
- Around line 672-694: The Close method currently calls b.registry.getAll()
without holding b.registry.mu, then locks per gateway which allows the registry
to change concurrently; to fix, acquire b.registry.mu once before
retrieving/iterating the gateways (or modify getAll to return a deep copy) so
the cleanup loop runs under the same lock, then close each gw.subscribers and
set gw.subscribers = nil while still holding the lock; reference
SQLBackend.Close, b.registry.getAll, registry.mu and gw.subscribers when
locating the change.

In `@gateway/gateway-controller/pkg/utils/api_key_test.go`:
- Around line 881-893: The helper newTestSQLiteStorage currently creates a
storage.Storage via storage.NewStorage but doesn’t close it; register a
t.Cleanup callback that closes the returned storage (call the storage’s Close
method) so the DB connection is closed when the test ends. Update
newTestSQLiteStorage to call t.Cleanup(func(){ _ = db.Close() }) (or handle
Close error as appropriate) after db is created, referencing the
newTestSQLiteStorage function and the storage.NewStorage / storage.Storage Close
method.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2d375b96-256a-4ebf-a569-8a3f8436c924

📥 Commits

Reviewing files that changed from the base of the PR and between d21bbd3 and 6d9060b.

📒 Files selected for processing (5)
  • common/eventhub/eventhub_test.go
  • common/eventhub/sqlbackend.go
  • gateway/gateway-controller/pkg/utils/api_deployment.go
  • gateway/gateway-controller/pkg/utils/api_key.go
  • gateway/gateway-controller/pkg/utils/api_key_test.go

Comment thread common/eventhub/sqlbackend.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (5)
gateway/gateway-controller/pkg/utils/api_deployment.go (1)

106-124: ⚠️ Potential issue | 🔴 Critical

Treat EventHub publish as part of the deployment write.

In async mode this is the only trigger that tells replicas to rebuild snapshots. Logging and continuing here lets the DB save succeed while no gateway ever applies the config.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/utils/api_deployment.go` around lines 106 -
124, publishEvent currently swallows publish failures which allows DB writes to
succeed without notifying replicas; change publishEvent (in
APIDeploymentService) to return error and propagate that error to its callers so
the deployment write is aborted/rolled back when PublishEvent fails.
Specifically, update func publishEvent(...) to return error (and include the
eventHub.PublishEvent error) and modify the APIDeploymentService methods that
call publishEvent (e.g., Create/Update/Delete deployment handlers) to check the
returned error and cease/rollback the DB transaction and return failure to the
caller when publishing fails.
gateway/gateway-controller/pkg/utils/api_key.go (3)

351-354: ⚠️ Potential issue | 🟠 Major

Async API-key writes still leave this replica stale.

These branches stop updating s.store inline, but the follow-up read paths still rely on the local cache first. A write can therefore be invisible on the same replica until the listener catches up, causing false 404s or the create-on-update path to run unexpectedly.

Also applies to: 685-688, 894-897

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/utils/api_key.go` around lines 351 - 354, The
async branch that calls s.publishAPIKeyEvent("CREATE", apiId, apiKey.UUID,
params.CorrelationID, logger) does not update the local cache s.store inline,
leaving the replica stale and causing immediate reads to miss the change; before
publishing the event, write the new API key into s.store (use the same
store-upsert/save method used by the synchronous branch) so the local cache
reflects the create immediately, and apply the same fix to the other
event-publishing branches that omit s.store updates (the analogous blocks around
the other publish calls).

218-236: ⚠️ Potential issue | 🔴 Critical

Treat EventHub publish as part of the API-key write.

In EventHub mode this is the only signal that causes replicas to apply the create/update/delete. Logging and continuing here lets the DB mutation succeed while no gateway ever observes it, so callers can get a success response for a key that never becomes usable or revoked.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/utils/api_key.go` around lines 218 - 236,
publishAPIKeyEvent currently swallows publish errors which lets DB writes
succeed without informing replicas; change publishAPIKeyEvent to return error
(func (s *APIKeyService) publishAPIKeyEvent(...) error) and stop silently
logging-and-continuing: call s.eventHub.PublishEvent and on error log and return
that error; update all callers (e.g., CreateAPIKey, UpdateAPIKey, DeleteAPIKey)
to call the new publishAPIKeyEvent, propagate the error up to the API write
path, and ensure the write either rolls back or returns a failure to the client
when PublishEvent fails so the EventHub publish is treated as part of the
API-key write.

899-913: ⚠️ Potential issue | 🔴 Critical

Rollback regeneration to the previous key, not a delete.

At this point the database already contains regeneratedKey. If StoreAPIKey fails, deleting by API/name removes the key entirely instead of restoring the prior credential.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/utils/api_key.go` around lines 899 - 913,
Before calling s.store.StoreAPIKey with regeneratedKey, capture the existing API
key (e.g., previousKey) so you can restore it if storage fails; replace the
current rollback that calls
s.db.RemoveAPIKeyAPIAndName(regeneratedKey.ArtifactUUID, regeneratedKey.Name)
with logic that re-inserts or upserts the previousKey into the DB (use/implement
a DB method such as s.db.StoreAPIKey / s.db.UpsertAPIKey or s.db.InsertAPIKey)
so the original credential is restored on failure, and keep the existing error
logging using params.CorrelationID and params.Handle for context.
common/eventhub/sqlbackend.go (1)

562-597: ⚠️ Potential issue | 🔴 Critical

Don't advance gateway state when there are no subscribers.

With an empty subscriber set this loop still updates latestDeliveredTimestamp, deliveredCount, and then knownVersion/lastPolled. Any event published before a listener subscribes, or after all subscribers disconnect, will be skipped permanently once a subscriber appears.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@common/eventhub/sqlbackend.go` around lines 562 - 597, The loop currently
advances gateway state even when there are no subscribers; update the code so
that if gw.subscribers is empty you do not modify latestDeliveredTimestamp,
deliveredCount, gw.knownVersion or gw.lastPolled. Specifically, before entering
the per-event delivery logic (and before updating
latestDeliveredTimestamp/deliveredCount) check len(subscribers) == 0 (or
subscriberChannelsAvailable(subscribers) && len(subscribers) > 0) and skip
delivery and all state updates for that gateway when empty; keep the existing
behavior for channel-full (deliveryBlocked) cases but ensure
gw.knownVersion/gw.lastPolled are only advanced when an actual delivery to at
least one subscriber occurred (use deliveredSubscriberCount or a boolean
deliveryMade to gate the final b.registry.mu.Lock() updates).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@common/eventhub/eventhub_test.go`:
- Around line 40-57: The test DDL in eventhub_test.go omits the production
constraints: update the CREATE TABLE for events used in the test (the DDL string
in the test helper in eventhub_test.go) to match production by adding a FOREIGN
KEY on events.gateway_id referencing gateway_states(gateway_id) and creating the
(gateway_id, processed_timestamp) index (and do the same for the second DDL
block around lines 69-86); modify the SQL strings used in the test helpers so
the events table includes "FOREIGN KEY (gateway_id) REFERENCES
gateway_states(gateway_id)" and add "CREATE INDEX IF NOT EXISTS
idx_events_gateway_processed ON events (gateway_id, processed_timestamp);"
immediately after the table creation to mirror the production schema.

In `@gateway/gateway-controller/pkg/utils/api_key.go`:
- Around line 499-505: The DB row is being deleted immediately after calling
RemoveAPIKeyAPIAndName which makes revocation unrecoverable if downstream
propagation (EventHub/xDS) fails; instead, update the flow in the revoke
function to (1) mark the API key as revoked in the DB (keep the row), (2)
perform propagation to EventHub/xDS and wait for success, and only on successful
propagation call s.db.RemoveAPIKeyAPIAndName to delete the row, and (3) if
propagation fails, either leave the row marked revoked or attempt to restore the
previous state and return an error. Locate the revoke logic that touches
s.db.RemoveAPIKeyAPIAndName and the surrounding calls that publish to
EventHub/xDS and change the order/transactionality so deletion happens last and
rollback/restore is attempted on failure.
- Around line 1398-1403: When reconstructing a models.APIKey from an existingKey
(the helpers that build the struct using UUID, Name, DisplayName,
APIKey/MaskedAPIKey, ArtifactUUID and variables like displayName,
hashedAPIKeyValue, maskedAPIKeyValue), also copy immutable metadata fields from
existingKey by setting Issuer = existingKey.Issuer and ExternalRefId =
existingKey.ExternalRefId so those values are preserved on update/regeneration;
apply this change to both places where the APIKey struct is rebuilt (the block
around UUID..ArtifactUUID and the similar block at 1532-1536).

---

Duplicate comments:
In `@common/eventhub/sqlbackend.go`:
- Around line 562-597: The loop currently advances gateway state even when there
are no subscribers; update the code so that if gw.subscribers is empty you do
not modify latestDeliveredTimestamp, deliveredCount, gw.knownVersion or
gw.lastPolled. Specifically, before entering the per-event delivery logic (and
before updating latestDeliveredTimestamp/deliveredCount) check len(subscribers)
== 0 (or subscriberChannelsAvailable(subscribers) && len(subscribers) > 0) and
skip delivery and all state updates for that gateway when empty; keep the
existing behavior for channel-full (deliveryBlocked) cases but ensure
gw.knownVersion/gw.lastPolled are only advanced when an actual delivery to at
least one subscriber occurred (use deliveredSubscriberCount or a boolean
deliveryMade to gate the final b.registry.mu.Lock() updates).

In `@gateway/gateway-controller/pkg/utils/api_deployment.go`:
- Around line 106-124: publishEvent currently swallows publish failures which
allows DB writes to succeed without notifying replicas; change publishEvent (in
APIDeploymentService) to return error and propagate that error to its callers so
the deployment write is aborted/rolled back when PublishEvent fails.
Specifically, update func publishEvent(...) to return error (and include the
eventHub.PublishEvent error) and modify the APIDeploymentService methods that
call publishEvent (e.g., Create/Update/Delete deployment handlers) to check the
returned error and cease/rollback the DB transaction and return failure to the
caller when publishing fails.

In `@gateway/gateway-controller/pkg/utils/api_key.go`:
- Around line 351-354: The async branch that calls
s.publishAPIKeyEvent("CREATE", apiId, apiKey.UUID, params.CorrelationID, logger)
does not update the local cache s.store inline, leaving the replica stale and
causing immediate reads to miss the change; before publishing the event, write
the new API key into s.store (use the same store-upsert/save method used by the
synchronous branch) so the local cache reflects the create immediately, and
apply the same fix to the other event-publishing branches that omit s.store
updates (the analogous blocks around the other publish calls).
- Around line 218-236: publishAPIKeyEvent currently swallows publish errors
which lets DB writes succeed without informing replicas; change
publishAPIKeyEvent to return error (func (s *APIKeyService)
publishAPIKeyEvent(...) error) and stop silently logging-and-continuing: call
s.eventHub.PublishEvent and on error log and return that error; update all
callers (e.g., CreateAPIKey, UpdateAPIKey, DeleteAPIKey) to call the new
publishAPIKeyEvent, propagate the error up to the API write path, and ensure the
write either rolls back or returns a failure to the client when PublishEvent
fails so the EventHub publish is treated as part of the API-key write.
- Around line 899-913: Before calling s.store.StoreAPIKey with regeneratedKey,
capture the existing API key (e.g., previousKey) so you can restore it if
storage fails; replace the current rollback that calls
s.db.RemoveAPIKeyAPIAndName(regeneratedKey.ArtifactUUID, regeneratedKey.Name)
with logic that re-inserts or upserts the previousKey into the DB (use/implement
a DB method such as s.db.StoreAPIKey / s.db.UpsertAPIKey or s.db.InsertAPIKey)
so the original credential is restored on failure, and keep the existing error
logging using params.CorrelationID and params.Handle for context.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c4c736bc-d050-4d35-a3a0-23cbaf037647

📥 Commits

Reviewing files that changed from the base of the PR and between 6d9060b and bf2a4de.

📒 Files selected for processing (5)
  • common/eventhub/eventhub_test.go
  • common/eventhub/sqlbackend.go
  • gateway/gateway-controller/pkg/utils/api_deployment.go
  • gateway/gateway-controller/pkg/utils/api_key.go
  • gateway/gateway-controller/pkg/utils/api_key_test.go

Comment thread common/eventhub/eventhub_test.go
Comment thread gateway/gateway-controller/pkg/utils/api_key.go Outdated
Comment thread gateway/gateway-controller/pkg/utils/api_key.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (8)
common/eventhub/sqlbackend.go (1)

562-597: ⚠️ Potential issue | 🔴 Critical

Don't advance the cursor when nobody is subscribed.

With len(subscribers) == 0, this loop still counts events as delivered and updates knownVersion/lastPolled, so events published before the listener subscribes can be skipped permanently.

🛠️ Suggested change
 	b.registry.mu.RLock()
 	subscribers := gw.subscribers
+	if len(subscribers) == 0 {
+		b.registry.mu.RUnlock()
+		return nil
+	}
 	for _, evt := range events {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@common/eventhub/sqlbackend.go` around lines 562 - 597, The current loop
treats events as delivered even when gw.subscribers is empty, causing
gw.knownVersion and gw.lastPolled to advance and skip events; change the logic
in the delivery loop to detect zero subscribers (e.g., check
len(gw.subscribers)==0 or use subscriberChannelsAvailable(subscribers) before
counting deliveries) and avoid updating deliveredCount,
latestDeliveredTimestamp, gw.knownVersion or gw.lastPolled when there are no
subscribers; preserve setting of deliveryBlocked and only update
gw.lastPolled/knownVersion inside the b.registry.mu.Lock block when
deliveredCount>0 (or latestDeliveredTimestamp non-zero and subscribers existed)
so cursor advances only when actual deliveries to gw.subscribers occurred.
common/eventhub/eventhub_test.go (1)

40-57: ⚠️ Potential issue | 🟠 Major

Keep the test DDL aligned with production.

These helpers omit the events.gateway_id foreign key and the (gateway_id, processed_timestamp) index, so the tests are exercising a schema the real backend does not run with.

🛠️ Suggested change
 		CREATE TABLE IF NOT EXISTS events (
 			gateway_id TEXT NOT NULL,
 			processed_timestamp TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
 			originated_timestamp TIMESTAMP NOT NULL,
 			entity_type TEXT NOT NULL,
 			action TEXT NOT NULL CHECK(action IN ('CREATE', 'UPDATE', 'DELETE')),
 			entity_id TEXT NOT NULL,
 			event_id TEXT NOT NULL,
 			event_data TEXT NOT NULL,
-			PRIMARY KEY (event_id)
+			PRIMARY KEY (event_id),
+			FOREIGN KEY (gateway_id) REFERENCES gateway_states(gateway_id)
 		);
+		CREATE INDEX IF NOT EXISTS idx_events_gateway_processed
+			ON events (gateway_id, processed_timestamp);

Also applies to: 69-86

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@common/eventhub/eventhub_test.go` around lines 40 - 57, The test DDL in
eventhub_test.go defines gateway_states and events but omits the production
constraints — add a foreign key on events.gateway_id referencing
gateway_states(gateway_id) and create the (gateway_id, processed_timestamp)
index so tests use the real schema; specifically update the CREATE TABLE for
events to include "FOREIGN KEY(gateway_id) REFERENCES
gateway_states(gateway_id)" and after the tables run "CREATE INDEX IF NOT EXISTS
idx_events_gateway_processed ON events(gateway_id, processed_timestamp)"; make
the same changes in the second helper DDL block as well so both occurrences
match production.
gateway/gateway-controller/pkg/service/restapi/service.go (1)

494-533: ⚠️ Potential issue | 🔴 Critical

Don't make event publication best-effort in the only sync path.

Update and Delete now stop after the DB mutation and rely on this helper for replica convergence. If eventHub is nil, gatewayID is blank, or PublishEvent fails, the method still returns success and this replica never updates its in-memory/xDS state. Either fail the request or fall back to the old inline sync path.

🛠️ Suggested direction
-func (s *RestAPIService) publishEvent(eventType eventhub.EventType, action, entityID, correlationID string, logger *slog.Logger) {
+func (s *RestAPIService) publishEvent(eventType eventhub.EventType, action, entityID, correlationID string, logger *slog.Logger) error {
 	if s.eventHub == nil {
-		return
+		return fmt.Errorf("event hub not configured")
 	}

 	gatewayID := strings.TrimSpace(s.systemConfig.Controller.Server.GatewayID)
 	if gatewayID == "" {
-		logger.Warn("Skipping event hub publish because gateway ID is not configured",
-			slog.String("event_type", string(eventType)),
-			slog.String("action", action),
-			slog.String("entity_id", entityID))
-		return
+		return fmt.Errorf("gateway ID is not configured")
 	}
@@
-	if err := s.eventHub.PublishEvent(gatewayID, event); err != nil {
-		logger.Warn("Failed to publish event to event hub",
-			slog.String("gateway_id", gatewayID),
-			slog.String("event_type", string(eventType)),
-			slog.String("action", action),
-			slog.String("entity_id", entityID),
-			slog.Any("error", err))
-	} else {
-		logger.Debug("Published event to event hub",
-			slog.String("gateway_id", gatewayID),
-			slog.String("event_type", string(eventType)),
-			slog.String("action", action),
-			slog.String("entity_id", entityID))
-	}
+	if err := s.eventHub.PublishEvent(gatewayID, event); err != nil {
+		return fmt.Errorf("publish event: %w", err)
+	}
+	logger.Debug("Published event to event hub",
+		slog.String("gateway_id", gatewayID),
+		slog.String("event_type", string(eventType)),
+		slog.String("action", action),
+		slog.String("entity_id", entityID))
+	return nil
 }

Then make Update/Delete either return that error or execute the legacy inline path when EventHub-driven sync is unavailable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/service/restapi/service.go` around lines 494 -
533, The publishEvent helper (publishEvent) currently treats missing eventHub,
blank GatewayID (systemConfig.Controller.Server.GatewayID), or PublishEvent
errors as best-effort; change this so Update and Delete do not silently succeed
when event-driven sync is unavailable: have publishEvent return an error instead
of void and surface failures (s.eventHub == nil, gatewayID == "", or
s.eventHub.PublishEvent(...) returning err) to callers; then update the Update
and Delete handlers to either return that error to the client or, if you need
backwards compatibility, invoke the legacy inline sync path (the previous inline
xDS/memory update logic) as a fallback when publishEvent fails. Ensure
references to eventhub.PublishEvent, eventhub.EmptyEventData, and the
publishEvent call sites in Update/Delete are updated accordingly.
gateway/gateway-controller/pkg/utils/api_key.go (4)

457-472: ⚠️ Potential issue | 🔴 Critical

Don’t delete the DB record before revoke propagation succeeds.

Line 460 removes the row before Line 472 publishes DELETE. If propagation fails, recoverability is reduced and retries may not have authoritative state to reconcile from.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/utils/api_key.go` around lines 457 - 472, The
code currently calls s.db.RemoveAPIKeyAPIAndName(...) before publishing the
DELETE event (publishAPIKeyEvent), which risks losing authoritative state if
event propagation fails; change the flow so the DB deletion is performed only
after a successful publish (check the return/error from publishAPIKeyEvent or
the eventHub publish path on s.eventHub), and if publish fails, log the failure
and abort deletion (or enqueue a retry/backoff) so the record remains available
for reconciliation; specifically relocate or conditionally execute the call to
RemoveAPIKeyAPIAndName around the publishAPIKeyEvent/s.eventHub logic and handle
publish errors explicitly.

177-195: ⚠️ Potential issue | 🔴 Critical

Surface EventHub publish failures to callers.

Line 188 only logs and continues. In EventHub mode this is the propagation path; swallowing failures can return success while replicas never receive the key change.

Proposed fix
-func (s *APIKeyService) publishAPIKeyEvent(action, apiID, keyID, correlationID string, logger *slog.Logger) {
+func (s *APIKeyService) publishAPIKeyEvent(action, apiID, keyID, correlationID string, logger *slog.Logger) error {
 	if s.eventHub == nil {
-		return
+		return nil
 	}
 	event := eventhub.Event{
 		EventType: eventhub.EventTypeAPIKey,
 		Action:    action,
 		EntityID:  eventhub.BuildAPIKeyEntityID(apiID, keyID),
 		EventID:   correlationID,
 		EventData: eventhub.EmptyEventData,
 	}
 	if err := s.eventHub.PublishEvent(s.gatewayID, event); err != nil {
 		logger.Error("Failed to publish API key event",
 			slog.String("action", action),
 			slog.String("api_id", apiID),
 			slog.String("key_id", keyID),
 			slog.Any("error", err))
+		return err
 	}
+	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/utils/api_key.go` around lines 177 - 195, The
publishAPIKeyEvent function currently swallows eventHub.PublishEvent failures
(in APIKeyService.publishAPIKeyEvent) which can mask replication problems;
change publishAPIKeyEvent to return an error (instead of only logging) and
propagate the error from eventHub.PublishEvent back to callers (e.g., the
functions that call publishAPIKeyEvent such as
CreateAPIKey/UpdateAPIKey/DeleteAPIKey in APIKeyService) so callers can abort
and return failure to the client; keep the existing log but also return the err
from eventHub.PublishEvent (include gatewayID, action, apiID, keyID in the log)
and update all call sites to handle and return that error.

863-870: ⚠️ Potential issue | 🔴 Critical

Rollback regeneration by restoring the previous key, not deleting it.

After DB update, deleting on rollback (Line 865) drops the record entirely. Restore existingKey in DB instead of RemoveAPIKeyAPIAndName.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/utils/api_key.go` around lines 863 - 870, The
rollback currently deletes the regenerated record (calling
RemoveAPIKeyAPIAndName with regeneratedKey), which drops the API key instead of
restoring the prior state; change the rollback to re-insert or update the DB
with the prior existingKey (use your DB restore/upsert API such as
SaveAPIKey/UpsertAPIKey/InsertAPIKey on s.db) passing the existingKey object (or
its fields) so the previous key is restored rather than removed, and keep the
error logging (logger.Error / slog.Any) if that restore call fails; replace the
call to RemoveAPIKeyAPIAndName(regeneratedKey.ArtifactUUID, regeneratedKey.Name)
with the appropriate s.db method that restores existingKey.

1356-1371: ⚠️ Potential issue | 🟠 Major

Preserve immutable API key metadata when rebuilding key models.

Both struct rebuilds omit immutable metadata, so Issuer/ExternalRefId can be cleared during update/regeneration. Copy them from existingKey.

Proposed fix
 	updatedKey := &models.APIKey{
 		UUID:         existingKey.UUID,
 		Name:         existingKey.Name,
 		DisplayName:  displayName,
 		APIKey:       hashedAPIKeyValue,
 		MaskedAPIKey: maskedAPIKeyValue,
 		ArtifactUUID: existingKey.ArtifactUUID,
 		Operations:   operations,
 		Status:       models.APIKeyStatusActive,
 		CreatedAt:    existingKey.CreatedAt,
 		CreatedBy:    existingKey.CreatedBy,
 		UpdatedAt:    now,
 		ExpiresAt:    expiresAt,
 		Unit:         unit,
 		Duration:     duration,
 		Source:       existingKey.Source,
+		Issuer:       existingKey.Issuer,
+		ExternalRefId: existingKey.ExternalRefId,
 	}
 	regeneratedKey := &models.APIKey{
 		UUID:         existingKey.UUID,
 		Name:         existingKey.Name,
 		APIKey:       hashedAPIKeyValue,
 		MaskedAPIKey: maskedAPIKeyValue,
 		ArtifactUUID: existingKey.ArtifactUUID,
 		Operations:   existingKey.Operations,
 		Status:       models.APIKeyStatusActive,
 		CreatedAt:    existingKey.CreatedAt,
 		CreatedBy:    existingKey.CreatedBy,
 		UpdatedAt:    now,
 		ExpiresAt:    expiresAt,
 		Unit:         unit,
 		Duration:     duration,
 		Source:       existingKey.Source,
+		Issuer:       existingKey.Issuer,
+		ExternalRefId: existingKey.ExternalRefId,
 	}

Based on learnings, in gateway/gateway-controller/pkg/utils/api_key.go, Issuer and ExternalRefId must remain immutable and be copied from existingKey during updates.

Also applies to: 1489-1504

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/utils/api_key.go` around lines 1356 - 1371,
When rebuilding the API key model, preserve immutable metadata by copying Issuer
and ExternalRefId from the existingKey into the new struct literal (add Issuer:
existingKey.Issuer and ExternalRefId: existingKey.ExternalRefId). Apply this
change in both places where the key model is reconstructed (the struct literal
around UUID/Name/APIKey/MaskedAPIKey/etc. and the second rebuild at the later
block referenced), ensuring the new object keeps existingKey.Issuer and
existingKey.ExternalRefId so they are not cleared on update/regeneration.
gateway/gateway-controller/pkg/utils/api_deployment.go (1)

107-125: ⚠️ Potential issue | 🔴 Critical

Make EventHub publish failure fail the deployment operation.

Line 118 logs PublishEvent failures but continues. In EventHub mode, that can persist the write while no replica applies it. publishEvent should return an error, and callers (for example Line 364) should propagate it so the request fails.

Proposed fix
-func (s *APIDeploymentService) publishEvent(eventType eventhub.EventType, action, entityID, correlationID string, logger *slog.Logger) {
+func (s *APIDeploymentService) publishEvent(eventType eventhub.EventType, action, entityID, correlationID string, logger *slog.Logger) error {
 	if s.eventHub == nil {
-		return
+		return nil
 	}
 	event := eventhub.Event{
 		EventType: eventType,
 		Action:    action,
 		EntityID:  entityID,
 		EventID:   correlationID,
 		EventData: eventhub.EmptyEventData,
 	}
 	if err := s.eventHub.PublishEvent(s.gatewayID, event); err != nil {
 		logger.Error("Failed to publish event",
 			slog.String("event_type", string(eventType)),
 			slog.String("action", action),
 			slog.String("entity_id", entityID),
 			slog.Any("error", err))
+		return err
 	}
+	return nil
 }
-		s.publishEvent(eventhub.EventTypeAPI, action, apiID, params.CorrelationID, params.Logger)
+		if err := s.publishEvent(eventhub.EventTypeAPI, action, apiID, params.CorrelationID, params.Logger); err != nil {
+			return nil, fmt.Errorf("failed to publish deployment event: %w", err)
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/utils/api_deployment.go` around lines 107 -
125, The publishEvent helper (APIDeploymentService.publishEvent) currently
swallows eventHub.PublishEvent errors after logging; change publishEvent to
return error and propagate the error from eventHub.PublishEvent instead of only
logging, and update all callers (places that call s.publishEvent, e.g., the
deployment path that currently ignores its result) to check the returned error
and return/propagate it so the deployment request fails on publish errors;
ensure you still log the failure with logger.Error but also return the error up
the call chain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@common/eventhub/eventhub_test.go`:
- Around line 34-37: The tests use a private in-memory SQLite DSN (":memory:")
which yields separate databases per connection and causes flaky behavior when
Initialize() spawns pollLoop() and cleanupLoop() that open additional
connections; change setupTestDB to open a shared in-memory DB (e.g. use a URI
like "file::memory:?mode=memory&cache=shared" or
"file:testdb?mode=memory&cache=shared" with the sqlite driver’s URI flag) and/or
constrain the connection pool by calling db.SetMaxOpenConns(1) after sql.Open to
ensure all goroutines see the same schema and avoid multiple private in-memory
databases in pollLoop/cleanupLoop/Initialize.

In `@common/eventhub/sqlbackend.go`:
- Around line 233-250: Normalize gatewayID once at the start of RegisterGateway
(e.g., g := strings.TrimSpace(gatewayID)) and use that normalized variable for
the insert (insertGatewayStmt.Exec), the existence check
(getGatewayStateStmt.QueryRow) and the in-memory registration
(registry.register) and error messages; ensure comparisons against
ErrGatewayAlreadyExists use the normalized value so DB and registry use the
identical key.
- Around line 317-325: The branch that handles result.RowsAffected() == 0
returns a new error without setting the function's named err variable or rolling
back tx, so the deferred rollback/commit logic isn't triggered and the
transaction leaks; fix by either calling tx.Rollback() before returning or
assigning the error to the named return variable (e.g., err =
fmt.Errorf("gateway %q is not registered", gatewayID)) and then returning err so
the existing defer sees a non-nil err and rolls back; locate the check after
Exec on updateGatewayVersionStmt and modify that branch accordingly.

In `@gateway/gateway-controller/pkg/utils/api_key_test.go`:
- Around line 872-884: The helper newTestSQLiteStorage opens a real SQLite
storage and never closes it; register a t.Cleanup to close the returned storage
to avoid leaking file descriptors (use t.Cleanup(func(){ if err := db.Close();
err != nil { t.Fatalf("failed to close sqlite storage: %v", err) }})). Reference
newTestSQLiteStorage, t.Cleanup, db.Close, and the storage.Storage value to
locate and update the helper.

In `@gateway/gateway-controller/pkg/utils/api_key.go`:
- Around line 161-174: In getAPIConfigByHandle: avoid dereferencing s.db when
it's nil and respect the incoming kind parameter instead of hardcoding
models.KindRestApi; first try the DB only if s.db != nil (calling
s.db.GetConfigByKindAndHandle(kind, handle)) and if s.db is nil or returns a
not-found, fall back to the in-memory store (e.g.,
s.store.GetConfigByKindAndHandle(kind, handle)); preserve the existing error
handling (wrap non-not-found DB errors, translate storage.IsNotFoundError to
storage.ErrNotFound) and return the found *models.StoredConfig or
storage.ErrNotFound accordingly.

---

Duplicate comments:
In `@common/eventhub/eventhub_test.go`:
- Around line 40-57: The test DDL in eventhub_test.go defines gateway_states and
events but omits the production constraints — add a foreign key on
events.gateway_id referencing gateway_states(gateway_id) and create the
(gateway_id, processed_timestamp) index so tests use the real schema;
specifically update the CREATE TABLE for events to include "FOREIGN
KEY(gateway_id) REFERENCES gateway_states(gateway_id)" and after the tables run
"CREATE INDEX IF NOT EXISTS idx_events_gateway_processed ON events(gateway_id,
processed_timestamp)"; make the same changes in the second helper DDL block as
well so both occurrences match production.

In `@common/eventhub/sqlbackend.go`:
- Around line 562-597: The current loop treats events as delivered even when
gw.subscribers is empty, causing gw.knownVersion and gw.lastPolled to advance
and skip events; change the logic in the delivery loop to detect zero
subscribers (e.g., check len(gw.subscribers)==0 or use
subscriberChannelsAvailable(subscribers) before counting deliveries) and avoid
updating deliveredCount, latestDeliveredTimestamp, gw.knownVersion or
gw.lastPolled when there are no subscribers; preserve setting of deliveryBlocked
and only update gw.lastPolled/knownVersion inside the b.registry.mu.Lock block
when deliveredCount>0 (or latestDeliveredTimestamp non-zero and subscribers
existed) so cursor advances only when actual deliveries to gw.subscribers
occurred.

In `@gateway/gateway-controller/pkg/service/restapi/service.go`:
- Around line 494-533: The publishEvent helper (publishEvent) currently treats
missing eventHub, blank GatewayID (systemConfig.Controller.Server.GatewayID), or
PublishEvent errors as best-effort; change this so Update and Delete do not
silently succeed when event-driven sync is unavailable: have publishEvent return
an error instead of void and surface failures (s.eventHub == nil, gatewayID ==
"", or s.eventHub.PublishEvent(...) returning err) to callers; then update the
Update and Delete handlers to either return that error to the client or, if you
need backwards compatibility, invoke the legacy inline sync path (the previous
inline xDS/memory update logic) as a fallback when publishEvent fails. Ensure
references to eventhub.PublishEvent, eventhub.EmptyEventData, and the
publishEvent call sites in Update/Delete are updated accordingly.

In `@gateway/gateway-controller/pkg/utils/api_deployment.go`:
- Around line 107-125: The publishEvent helper
(APIDeploymentService.publishEvent) currently swallows eventHub.PublishEvent
errors after logging; change publishEvent to return error and propagate the
error from eventHub.PublishEvent instead of only logging, and update all callers
(places that call s.publishEvent, e.g., the deployment path that currently
ignores its result) to check the returned error and return/propagate it so the
deployment request fails on publish errors; ensure you still log the failure
with logger.Error but also return the error up the call chain.

In `@gateway/gateway-controller/pkg/utils/api_key.go`:
- Around line 457-472: The code currently calls s.db.RemoveAPIKeyAPIAndName(...)
before publishing the DELETE event (publishAPIKeyEvent), which risks losing
authoritative state if event propagation fails; change the flow so the DB
deletion is performed only after a successful publish (check the return/error
from publishAPIKeyEvent or the eventHub publish path on s.eventHub), and if
publish fails, log the failure and abort deletion (or enqueue a retry/backoff)
so the record remains available for reconciliation; specifically relocate or
conditionally execute the call to RemoveAPIKeyAPIAndName around the
publishAPIKeyEvent/s.eventHub logic and handle publish errors explicitly.
- Around line 177-195: The publishAPIKeyEvent function currently swallows
eventHub.PublishEvent failures (in APIKeyService.publishAPIKeyEvent) which can
mask replication problems; change publishAPIKeyEvent to return an error (instead
of only logging) and propagate the error from eventHub.PublishEvent back to
callers (e.g., the functions that call publishAPIKeyEvent such as
CreateAPIKey/UpdateAPIKey/DeleteAPIKey in APIKeyService) so callers can abort
and return failure to the client; keep the existing log but also return the err
from eventHub.PublishEvent (include gatewayID, action, apiID, keyID in the log)
and update all call sites to handle and return that error.
- Around line 863-870: The rollback currently deletes the regenerated record
(calling RemoveAPIKeyAPIAndName with regeneratedKey), which drops the API key
instead of restoring the prior state; change the rollback to re-insert or update
the DB with the prior existingKey (use your DB restore/upsert API such as
SaveAPIKey/UpsertAPIKey/InsertAPIKey on s.db) passing the existingKey object (or
its fields) so the previous key is restored rather than removed, and keep the
error logging (logger.Error / slog.Any) if that restore call fails; replace the
call to RemoveAPIKeyAPIAndName(regeneratedKey.ArtifactUUID, regeneratedKey.Name)
with the appropriate s.db method that restores existingKey.
- Around line 1356-1371: When rebuilding the API key model, preserve immutable
metadata by copying Issuer and ExternalRefId from the existingKey into the new
struct literal (add Issuer: existingKey.Issuer and ExternalRefId:
existingKey.ExternalRefId). Apply this change in both places where the key model
is reconstructed (the struct literal around UUID/Name/APIKey/MaskedAPIKey/etc.
and the second rebuild at the later block referenced), ensuring the new object
keeps existingKey.Issuer and existingKey.ExternalRefId so they are not cleared
on update/regeneration.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0e41ed40-d892-45f1-91b6-fb54727e62c1

📥 Commits

Reviewing files that changed from the base of the PR and between bf2a4de and f64ffdd.

📒 Files selected for processing (11)
  • common/eventhub/eventhub_test.go
  • common/eventhub/sqlbackend.go
  • gateway/gateway-controller/pkg/api/handlers/handlers.go
  • gateway/gateway-controller/pkg/api/handlers/handlers_test.go
  • gateway/gateway-controller/pkg/controlplane/api_deleted_test.go
  • gateway/gateway-controller/pkg/controlplane/client_integration_test.go
  • gateway/gateway-controller/pkg/controlplane/controlplane_test.go
  • gateway/gateway-controller/pkg/service/restapi/service.go
  • gateway/gateway-controller/pkg/utils/api_deployment.go
  • gateway/gateway-controller/pkg/utils/api_key.go
  • gateway/gateway-controller/pkg/utils/api_key_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • gateway/gateway-controller/pkg/api/handlers/handlers_test.go

Comment thread common/eventhub/eventhub_test.go
Comment thread common/eventhub/sqlbackend.go
Comment thread common/eventhub/sqlbackend.go
Comment thread gateway/gateway-controller/pkg/utils/api_key_test.go
Comment thread gateway/gateway-controller/pkg/utils/api_key.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
gateway/gateway-controller/pkg/api/handlers/handlers.go (1)

173-179: Consider simplifying the status logic and clarifying DeployedAt semantics.

The fix correctly preserves StatusUndeployed, but:

  1. The first branch is redundant—no need to reassign the same value.
  2. DeployedAt and DeployedVersion are set even when the status is Undeployed, which may be semantically misleading.
♻️ Suggested simplification
 	if success {
-		if cfg.Status == models.StatusUndeployed {
-			cfg.Status = models.StatusUndeployed
-		} else {
+		if cfg.Status != models.StatusUndeployed {
 			cfg.Status = models.StatusDeployed
+			cfg.DeployedAt = &now
 		}
-		cfg.DeployedAt = &now
 		cfg.DeployedVersion = version

Alternatively, if DeployedAt is intentionally used to track "last successful xDS sync" regardless of deploy/undeploy, consider adding a brief comment to clarify this design decision.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gateway/gateway-controller/pkg/api/handlers/handlers.go` around lines 173 -
179, The status branch is redundant and sets DeployedAt/DeployedVersion even
when cfg.Status is undeployed; simplify by removing the no-op assignment (do not
set cfg.Status = models.StatusUndeployed when it already is) and only set
cfg.Status = models.StatusDeployed, cfg.DeployedAt and cfg.DeployedVersion when
the new state is a deployment; if DeployedAt is intended to record the last xDS
sync regardless of deploy/undeploy, leave the assignments but add a clarifying
comment above cfg.DeployedAt/cfg.DeployedVersion explaining that semantic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@gateway/gateway-controller/pkg/api/handlers/handlers.go`:
- Around line 173-179: The status branch is redundant and sets
DeployedAt/DeployedVersion even when cfg.Status is undeployed; simplify by
removing the no-op assignment (do not set cfg.Status = models.StatusUndeployed
when it already is) and only set cfg.Status = models.StatusDeployed,
cfg.DeployedAt and cfg.DeployedVersion when the new state is a deployment; if
DeployedAt is intended to record the last xDS sync regardless of
deploy/undeploy, leave the assignments but add a clarifying comment above
cfg.DeployedAt/cfg.DeployedVersion explaining that semantic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a6e41eb8-1916-4004-9a87-d652f4110829

📥 Commits

Reviewing files that changed from the base of the PR and between f64ffdd and 7db5b0e.

📒 Files selected for processing (1)
  • gateway/gateway-controller/pkg/api/handlers/handlers.go

Comment thread gateway/gateway-controller/Makefile Outdated
--build-context sdk=../../sdk \
--build-context common=../../common \
--platform linux/amd64,linux/arm64 \
--platform linux/arm64 \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an intended change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops my bad, it is not

@Krishanx92 Krishanx92 merged commit db2f756 into wso2:main Mar 17, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants