Skip to content

NM-271: Scalability Improvements#3921

Merged
abhishek9686 merged 171 commits intodevelopfrom
NM-271-v2
Mar 17, 2026
Merged

NM-271: Scalability Improvements#3921
abhishek9686 merged 171 commits intodevelopfrom
NM-271-v2

Conversation

@abhishek9686
Copy link
Member

Describe your changes

Provide Issue ticket number if applicable/not in title

Provide testing steps

Checklist before requesting a review

  • My changes affect only 10 files or less.
  • I have performed a self-review of my code and tested it.
  • If it is a new feature, I have added thorough tests, my code is <= 1450 lines.
  • If it is a bugfix, my code is <= 200 lines.
  • My functions are <= 80 lines.
  • I have had my code reviewed by a peer.
  • My unit tests pass locally.
  • Netmaker is awesome.

VishalDalwadi and others added 30 commits January 5, 2026 10:26
VishalDalwadi and others added 24 commits March 16, 2026 13:00
Co-authored-by: tenki-reviewer[bot] <262613592+tenki-reviewer[bot]@users.noreply.github.com>
Co-authored-by: tenki-reviewer[bot] <262613592+tenki-reviewer[bot]@users.noreply.github.com>
Co-authored-by: tenki-reviewer[bot] <262613592+tenki-reviewer[bot]@users.noreply.github.com>
Co-authored-by: tenki-reviewer[bot] <262613592+tenki-reviewer[bot]@users.noreply.github.com>
…ller

- Add missing return after error response in updateUserAccountStatus
  to prevent double-response and spurious ext-client side-effects
- Use switch statements in listUsers to skip unrecognized
  account_status and mfa_status filter values
@abhishek9686
Copy link
Member Author

@tenki-reviewer

@tenki-reviewer
Copy link

tenki-reviewer bot commented Mar 17, 2026

Tenki Code Review - Complete

Files Reviewed: 29
Findings: 3

By Severity:

  • 🔴 High: 1
  • 🟠 Medium: 2

This PR introduces HA (multi-pod) support, a debounced peer-update worker, SQLite WAL mode, and metrics batching. Two concurrency bugs were found: a data race in InvalidateServerSettingsCache() and a flag-loss race in the peer-update debounce worker, plus a weak-credentials issue in the metrics exporter.

Files Reviewed (29 files)
Dockerfile
compose/docker-compose.pro.yml
config/config.go
controllers/gateway.go
controllers/hosts.go
controllers/node.go
database/database.go
db/sqlite.go
go.mod
logic/extpeers.go
logic/hooks.go
logic/networks.go
logic/nodes.go
logic/peers.go
logic/settings.go
main.go
mq/mq.go
mq/publishers.go
mq/serversync.go
mq/util.go
pro/auth/sync.go
pro/initialize.go
pro/logic/auto_relay.go
pro/logic/failover.go
pro/logic/metrics.go
scripts/netmaker.default.env
scripts/nm-quick.sh
servercfg/serverconf.go
utils/utils.go

Copy link

@tenki-reviewer tenki-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

This PR is a substantial refactor targeting high-availability (K8s StatefulSet), performance (debounced peer updates, node check-in batching, peer-info caching), and observability (batched metrics export, Prometheus/Grafana optional install). The structural changes are well-designed, but three actionable issues were found.

Issues Found

1. Data Race — logic/settings.go InvalidateServerSettingsCache() (High)

InvalidateServerSettingsCache() resets the cache by replacing the package-level variable with a new atomic.Value{}. While atomic.Value's own Load/Store methods are race-safe, directly assigning the variable (serverSettingsCache = atomic.Value{}) while other goroutines call serverSettingsCache.Load() or serverSettingsCache.Store() is a data race per the Go memory model — it is a plain struct copy of a concurrently accessed variable. This is called from handleServerSync (MQTT goroutine) concurrently with HTTP handlers calling GetServerSettings(). The Go race detector will flag this.

2. Race Condition — mq/publishers.go StartPeerUpdateWorker() (Medium)

In the debounce worker, the peerUpdateReplace atomic flag is Swap(false)-d after the drain loop. A PublishPeerUpdate(true) call that fires between the drain loop and the Swap will: set peerUpdateReplace=true, fail to send on the already-drained channel, and then have its flag consumed by Swap(false) — resulting in a broadcast that omits replacePeers=true. The fix is to Swap the flag before draining.

3. Weak Credentials — mq/publishers.go PushAllMetricsToExporter() (Medium)

req.SetBasicAuth(metricsUser, os.Getenv("METRICS_SECRET")) sends an empty password if METRICS_SECRET is not set. The exporter is an internal service, but proceeding with no credentials is a weak default. The function should warn and skip export if the secret is absent.

Positive Observations

  • SQLite WAL + busy-timeout + MaxOpenConns(1): Removing the Go-level dbMutex is correct — WAL mode with a single connection and 5 s busy timeout is the right SQLite concurrency model. The old double-locking was unnecessary overhead.
  • Debounced peer-update worker: The design (500 ms debounce, 3 s max-wait, semaphore-bounded goroutines) is well-thought-out and will dramatically reduce broadcast storms.
  • Node check-in batching (FlushNodeCheckins): Correctly batches DB writes; the swap-then-flush pattern avoids blocking incoming check-ins during the flush.
  • IsMasterPod() hostname parsing: The -0 suffix check correctly avoids false positives like netmaker-10 matching a naive HasSuffix check.
  • gzip.Writer pool: The pool with Reset(io.Discard) before Put is the correct pattern for reusing compression writers.

- Use pointer type in atomic.Value for serverSettingsCache to avoid
  replacing the variable non-atomically in InvalidateServerSettingsCache
- Swap peerUpdateReplace flag before draining the channel to prevent
  a concurrent replacePeers=true from being consumed by the wrong cycle
@abhishek9686 abhishek9686 merged commit 292af31 into develop Mar 17, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants