NM-271: Scalability Improvements#3921
Conversation
…d for schema functions;
Co-authored-by: tenki-reviewer[bot] <262613592+tenki-reviewer[bot]@users.noreply.github.com>
Co-authored-by: tenki-reviewer[bot] <262613592+tenki-reviewer[bot]@users.noreply.github.com>
Co-authored-by: tenki-reviewer[bot] <262613592+tenki-reviewer[bot]@users.noreply.github.com>
Co-authored-by: tenki-reviewer[bot] <262613592+tenki-reviewer[bot]@users.noreply.github.com>
…ller - Add missing return after error response in updateUserAccountStatus to prevent double-response and spurious ext-client side-effects - Use switch statements in listUsers to skip unrecognized account_status and mfa_status filter values
|
Tenki Code Review - Complete Files Reviewed: 29 By Severity:
This PR introduces HA (multi-pod) support, a debounced peer-update worker, SQLite WAL mode, and metrics batching. Two concurrency bugs were found: a data race in Files Reviewed (29 files) |
There was a problem hiding this comment.
Overview
This PR is a substantial refactor targeting high-availability (K8s StatefulSet), performance (debounced peer updates, node check-in batching, peer-info caching), and observability (batched metrics export, Prometheus/Grafana optional install). The structural changes are well-designed, but three actionable issues were found.
Issues Found
1. Data Race — logic/settings.go InvalidateServerSettingsCache() (High)
InvalidateServerSettingsCache() resets the cache by replacing the package-level variable with a new atomic.Value{}. While atomic.Value's own Load/Store methods are race-safe, directly assigning the variable (serverSettingsCache = atomic.Value{}) while other goroutines call serverSettingsCache.Load() or serverSettingsCache.Store() is a data race per the Go memory model — it is a plain struct copy of a concurrently accessed variable. This is called from handleServerSync (MQTT goroutine) concurrently with HTTP handlers calling GetServerSettings(). The Go race detector will flag this.
2. Race Condition — mq/publishers.go StartPeerUpdateWorker() (Medium)
In the debounce worker, the peerUpdateReplace atomic flag is Swap(false)-d after the drain loop. A PublishPeerUpdate(true) call that fires between the drain loop and the Swap will: set peerUpdateReplace=true, fail to send on the already-drained channel, and then have its flag consumed by Swap(false) — resulting in a broadcast that omits replacePeers=true. The fix is to Swap the flag before draining.
3. Weak Credentials — mq/publishers.go PushAllMetricsToExporter() (Medium)
req.SetBasicAuth(metricsUser, os.Getenv("METRICS_SECRET")) sends an empty password if METRICS_SECRET is not set. The exporter is an internal service, but proceeding with no credentials is a weak default. The function should warn and skip export if the secret is absent.
Positive Observations
- SQLite WAL + busy-timeout +
MaxOpenConns(1): Removing the Go-leveldbMutexis correct — WAL mode with a single connection and 5 s busy timeout is the right SQLite concurrency model. The old double-locking was unnecessary overhead. - Debounced peer-update worker: The design (500 ms debounce, 3 s max-wait, semaphore-bounded goroutines) is well-thought-out and will dramatically reduce broadcast storms.
- Node check-in batching (
FlushNodeCheckins): Correctly batches DB writes; the swap-then-flush pattern avoids blocking incoming check-ins during the flush. IsMasterPod()hostname parsing: The-0suffix check correctly avoids false positives likenetmaker-10matching a naiveHasSuffixcheck.gzip.Writerpool: The pool withReset(io.Discard)beforePutis the correct pattern for reusing compression writers.
- Use pointer type in atomic.Value for serverSettingsCache to avoid replacing the variable non-atomically in InvalidateServerSettingsCache - Swap peerUpdateReplace flag before draining the channel to prevent a concurrent replacePeers=true from being consumed by the wrong cycle
Describe your changes
Provide Issue ticket number if applicable/not in title
Provide testing steps
Checklist before requesting a review