NM-271: Scalability Improvements by abhishek9686 · Pull Request #3921 · gravitl/netmaker

abhishek9686 · 2026-03-17T14:07:36Z

Describe your changes

Provide Issue ticket number if applicable/not in title

Provide testing steps

Checklist before requesting a review

My changes affect only 10 files or less.
I have performed a self-review of my code and tested it.
If it is a new feature, I have added thorough tests, my code is <= 1450 lines.
If it is a bugfix, my code is <= 200 lines.
My functions are <= 80 lines.
I have had my code reviewed by a peer.
My unit tests pass locally.
Netmaker is awesome.

…d for schema functions;

…-163

Co-authored-by: tenki-reviewer[bot] <262613592+tenki-reviewer[bot]@users.noreply.github.com>

…ller - Add missing return after error response in updateUserAccountStatus to prevent double-response and spurious ext-client side-effects - Use switch statements in listUsers to skip unrecognized account_status and mfa_status filter values

abhishek9686 · 2026-03-17T17:50:23Z

@tenki-reviewer

tenki-reviewer · 2026-03-17T17:50:53Z

Tenki Code Review - Complete

Files Reviewed: 29
Findings: 3

By Severity:

🔴 High: 1
🟠 Medium: 2

This PR introduces HA (multi-pod) support, a debounced peer-update worker, SQLite WAL mode, and metrics batching. Two concurrency bugs were found: a data race in InvalidateServerSettingsCache() and a flag-loss race in the peer-update debounce worker, plus a weak-credentials issue in the metrics exporter.

Files Reviewed (29 files)

Dockerfile
compose/docker-compose.pro.yml
config/config.go
controllers/gateway.go
controllers/hosts.go
controllers/node.go
database/database.go
db/sqlite.go
go.mod
logic/extpeers.go
logic/hooks.go
logic/networks.go
logic/nodes.go
logic/peers.go
logic/settings.go
main.go
mq/mq.go
mq/publishers.go
mq/serversync.go
mq/util.go
pro/auth/sync.go
pro/initialize.go
pro/logic/auto_relay.go
pro/logic/failover.go
pro/logic/metrics.go
scripts/netmaker.default.env
scripts/nm-quick.sh
servercfg/serverconf.go
utils/utils.go

tenki-reviewer

Overview

This PR is a substantial refactor targeting high-availability (K8s StatefulSet), performance (debounced peer updates, node check-in batching, peer-info caching), and observability (batched metrics export, Prometheus/Grafana optional install). The structural changes are well-designed, but three actionable issues were found.

Issues Found

1. Data Race — `logic/settings.go` `InvalidateServerSettingsCache()` (High)

InvalidateServerSettingsCache() resets the cache by replacing the package-level variable with a new atomic.Value{}. While atomic.Value's own Load/Store methods are race-safe, directly assigning the variable (serverSettingsCache = atomic.Value{}) while other goroutines call serverSettingsCache.Load() or serverSettingsCache.Store() is a data race per the Go memory model — it is a plain struct copy of a concurrently accessed variable. This is called from handleServerSync (MQTT goroutine) concurrently with HTTP handlers calling GetServerSettings(). The Go race detector will flag this.

2. Race Condition — `mq/publishers.go` `StartPeerUpdateWorker()` (Medium)

In the debounce worker, the peerUpdateReplace atomic flag is Swap(false)-d after the drain loop. A PublishPeerUpdate(true) call that fires between the drain loop and the Swap will: set peerUpdateReplace=true, fail to send on the already-drained channel, and then have its flag consumed by Swap(false) — resulting in a broadcast that omits replacePeers=true. The fix is to Swap the flag before draining.

3. Weak Credentials — `mq/publishers.go` `PushAllMetricsToExporter()` (Medium)

req.SetBasicAuth(metricsUser, os.Getenv("METRICS_SECRET")) sends an empty password if METRICS_SECRET is not set. The exporter is an internal service, but proceeding with no credentials is a weak default. The function should warn and skip export if the secret is absent.

Positive Observations

SQLite WAL + busy-timeout + MaxOpenConns(1): Removing the Go-level dbMutex is correct — WAL mode with a single connection and 5 s busy timeout is the right SQLite concurrency model. The old double-locking was unnecessary overhead.
Debounced peer-update worker: The design (500 ms debounce, 3 s max-wait, semaphore-bounded goroutines) is well-thought-out and will dramatically reduce broadcast storms.
Node check-in batching (FlushNodeCheckins): Correctly batches DB writes; the swap-then-flush pattern avoids blocking incoming check-ins during the flush.
IsMasterPod() hostname parsing: The -0 suffix check correctly avoids false positives like netmaker-10 matching a naive HasSuffix check.
gzip.Writer pool: The pool with Reset(io.Discard) before Put is the correct pattern for reusing compression writers.

logic/settings.go

mq/publishers.go

- Use pointer type in atomic.Value for serverSettingsCache to avoid replacing the variable non-atomically in InvalidateServerSettingsCache - Swap peerUpdateReplace flag before draining the channel to prevent a concurrent replacePeers=true from being consumed by the wrong cycle

VishalDalwadi and others added 30 commits January 5, 2026 10:26

feat(go): add user schema;

172f993

feat(go): migrate to user schema;

1b725e4

feat(go): add audit fields;

efc75b6

feat(go): remove unused fields from the network model;

7abc3ec

feat(go): add network schema;

7009a71

feat(go): migrate to network schema;

6a9a2cf

refactor(go): add comment to clarify migration logic;

e82edb0

Merge branch 'master' into NM-163

2c510f1

Merge branch 'NM-163' of https://github.com/gravitl/netmaker into NM-163

aa527fb

fix(go): test failures;

0b50ebb

fix(go): test failures;

1b96b04

feat(go): change membership table to store memberships at all scopes;

7bbd58d

feat(go): add schema for access grants;

4a15156

feat(go): remove nameservers from new networks table; ensure db passe…

de4fbfa

…d for schema functions;

feat(go): set max conns for sqlite to 1;

1a27338

Merge branch 'develop' of https://github.com/gravitl/netmaker into NM…

405aff0

…-163

Merge branch 'develop' of https://github.com/gravitl/netmaker into NM…

33c234c

…-163

Merge branch 'develop' into NM-163

aeced0a

fix(go): issues updating user account status;

e6a723c

NM-236: streamline operations in HA mode

84ca8d4

NM-236: only master pod should subscribe to updates from clients

70a27c5

Merge branch 'develop' into NM-236

3d1886e

refactor(go): remove converters and access grants;

3f87f1b

refactor(go): add json tags in schema models;

ce6d3fb

refactor(go): rename file to migrate_v1_6_0.go;

aacd3c7

refactor(go): add user groups and user roles tables; use schema tables;

a00c1e9

refactor(go): inline get and list from schema package;

2d1682b

refactor(go): inline get network and list users from schema package;

ad749fe

fix(go): staticcheck issues;

43736a5

fix(go): remove test not in use; fix test case;

425b0b8

VishalDalwadi and others added 24 commits March 16, 2026 13:00

fix(go): review errors;

7a78535

fix(go): review errors;

01b3941

Update controllers/user.go

ec9ff3e

Co-authored-by: tenki-reviewer[bot] <262613592+tenki-reviewer[bot]@users.noreply.github.com>

Update controllers/user.go

ad96680

Co-authored-by: tenki-reviewer[bot] <262613592+tenki-reviewer[bot]@users.noreply.github.com>

NM-163: fix errors:

67febfa

Update db/types/options.go

a0c8bf5

Co-authored-by: tenki-reviewer[bot] <262613592+tenki-reviewer[bot]@users.noreply.github.com>

fix(go): persist return user in event;

46b272e

Merge branch 'NM-163' of https://github.com/gravitl/netmaker into NM-163

9de4b97

Update db/types/options.go

c6cdb58

Co-authored-by: tenki-reviewer[bot] <262613592+tenki-reviewer[bot]@users.noreply.github.com>

NM-271: signal pull on ip changes

c595f4f

NM-163: duplicate lines of code

b159b61

NM-271: signal pull req on node ip change

d38a84d

fix(go): check for both min and max page size;

6a9e3bf

Merge branch 'NM-163' of https://github.com/gravitl/netmaker into NM-163

7ab813c

NM-271: refresh node object before update

a92faa6

fix(go): enclose transfer superadmin in transaction;

dddbdd6

fix(go): review errors;

9afba4e

fix(go): remove free tier checks;

baf13c5

fix(go): review fixes;

394e5a4

NM-271: streamline ip pool ops

ca33faa

NM-271: resolve merge conflicts

80763a0

NM-271: resolve merge conflicts

f1d9f95

NM-271: fix tests, set max idle conns

09d290e

tenki-reviewer bot reviewed Mar 17, 2026

View reviewed changes

logic/settings.go Outdated Show resolved Hide resolved

mq/publishers.go Show resolved Hide resolved

mq/publishers.go Outdated Show resolved Hide resolved

Mejdii approved these changes Mar 17, 2026

View reviewed changes

abhishek9686 merged commit 292af31 into develop Mar 17, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NM-271: Scalability Improvements#3921

NM-271: Scalability Improvements#3921
abhishek9686 merged 171 commits intodevelopfrom
NM-271-v2

abhishek9686 commented Mar 17, 2026

Uh oh!

abhishek9686 commented Mar 17, 2026

Uh oh!

tenki-reviewer bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

tenki-reviewer bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

abhishek9686 commented Mar 17, 2026

Describe your changes

Provide Issue ticket number if applicable/not in title

Provide testing steps

Checklist before requesting a review

Uh oh!

abhishek9686 commented Mar 17, 2026

Uh oh!

tenki-reviewer bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tenki-reviewer bot left a comment

Choose a reason for hiding this comment

Overview

Issues Found

1. Data Race — logic/settings.go InvalidateServerSettingsCache() (High)

2. Race Condition — mq/publishers.go StartPeerUpdateWorker() (Medium)

3. Weak Credentials — mq/publishers.go PushAllMetricsToExporter() (Medium)

Positive Observations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tenki-reviewer bot commented Mar 17, 2026 •

edited

Loading

1. Data Race — `logic/settings.go` `InvalidateServerSettingsCache()` (High)

2. Race Condition — `mq/publishers.go` `StartPeerUpdateWorker()` (Medium)

3. Weak Credentials — `mq/publishers.go` `PushAllMetricsToExporter()` (Medium)