Performance Benchmarks

Performance analysis of NFTBan from v1.18.0 transport benchmarks through v1.32.0 cache-first architecture, with real kernel-level measurements across 5 lab servers and 3 kernel families.

Executive Summary

Current Findings (v1.32.0)

Metric	Value	Notes
Set count read (cache)	30-53ms	File read, O(1), kernel-independent
Set count read (kernel)	13,270-19,300ms	`nft list set` on 437K entries, varies by kernel
Cache vs kernel speedup	300-640x	Eliminates routine kernel reads
IPC scale query	239-1,867ms	Daemon counter via socket, varies by kernel
Ban latency (empty set)	77-82ms	Hash set or small interval set
Ban latency (437K interval)	139-198ms	Single op, no contention
Ban latency (437K, concurrent)	73-118s	Known limitation — kernel O(n) interval tree
Daemon memory (437K entries)	388-546 MB	RSS
Cache reads during ban	100% success	Non-blocking, zero failures
Daemon survival (stress)	6/6 runs	No crashes, no SSH drops across all kernels

Transport Findings (v1.18.0, still valid)

Metric	Value	Notes
IPC Ban Latency	~90us	Unix socket round-trip
IPC Throughput	~11,100 bans/sec	Single-threaded
Netlink Throughput	9,000-16,000 elem/sec	Varies by kernel
Optimal Batch Size	5,000 elements	Best throughput

Scalability Guidance (Current)

v1.32.0 fixes observability: routine counting uses daemon cache (30-53ms), not kernel (13-19s)
v1.32.0 does NOT fix write-path: single-IP ban into 437K interval set = 73-118s under concurrent load
500K is a practical bulk-feed target for batch operations, not for interactive single-IP inserts
Interactive/manual bans must not share the same huge interval set as feeds (v1.33.0 delivered)
1M+ is feasible for bulk-only staged loads on stronger systems, but is not a blanket guarantee

Test Environments

Lab Server Specifications (March 2026)

Server	OS	Kernel	CPU	MHz	RAM	L3 Cache
lab	Debian 12	6.1.0-38-amd64	Intel Xeon Skylake (2 vCPU)	2295	3.7 GiB	16 MiB
lab (prev.)	Debian 13	6.12.41+deb13-cloud	Intel Xeon Skylake (2 vCPU)	2295	3.7 GiB	16 MiB
lab1	Rocky 10.1	6.12.0-124.21.1.el10_1	AMD EPYC-Rome (2 vCPU)	2495	3.5 GiB	16 MiB
lab2	Ubuntu 24.04.3	6.8.0-71-generic	AMD EPYC-Rome (2 vCPU)	2495	3.7 GiB	16 MiB
lab3	AlmaLinux 9.7	5.14.0-611.16.1.el9_7	AMD EPYC-Rome (2 vCPU)	2495	3.5 GiB	16 MiB
lab4	AlmaLinux 9.7	5.14.0-611.13.1.el9_7	AMD EPYC-Rome (2 vCPU)	2495	3.5 GiB	16 MiB

All servers: 1 NUMA node, no HT/SMT exposed.

Note: lab uses Intel Xeon Skylake (200 MHz slower than AMD EPYC-Rome on lab1-4). This affects absolute timings but not relative comparisons on the same server. For fair cross-distro comparison, tests should run on the same hardware with different OS installations.

Software Versions

Go: 1.24.0 (daemon build, CGO_ENABLED=0 static binary)
NFTBan: v1.32.0 (cache-first architecture)
Test data: lab has 437,441 real entries (FIREHOL_PROXIES + TOR_EXITS feeds)

v1.32.0 Cache-First Counting

The core v1.32.0 change replaces routine nft list set kernel calls with daemon-owned in-memory counters and a JSON cache file.

Cache vs Kernel Read Comparison (lab, 437K entries)

Method	Debian 12 (6.1)	Debian 13 (6.12)	Relative
Cache file read (`/run/nftban/set_counts.json`)	30-33ms	36-53ms	1x (baseline)
IPC daemon counter (`nftban scale`)	1,407-1,867ms	239-296ms	~6-50x
Kernel (`nft list set`)	19,300ms	13,270-13,786ms	300-640x slower

Key finding: Kernel nft list set on Debian 12 (kernel 6.1) is 30% slower than Debian 13 (kernel 6.12) for the same 437K set. Cache reads are similar across kernels. This confirms the cache-first approach is critical — it eliminates the kernel version dependency entirely.

What Changed

Consumer	Before (v1.31)	After (v1.32)
Unified exporter	`nft list set` (13-19s)	Cache file read (30-53ms)
Prometheus exporter	`nft -j list set` (13-19s)	Daemon counter (instant)
`nftban scale` CLI	N/A	Daemon IPC (260ms)
`nftban stats`	`nft list set`	Daemon counter
Watchdog	`GetSetElements()`	`GetSetCount()`

Cache File

Path: /run/nftban/set_counts.json

Written by daemon via atomic rename, maximum once per 10 seconds. All consumers (exporters, CLI, shell scripts) read this file instead of querying the kernel.

See Large Set Management for full architecture details.

Interactive Ban Latency

Single nftban ban <ip> measurements at different set sizes.

By Set Size (Single Operation, No Contention)

Set Size	Distro	Kernel	Ban Latency	Scale Level
0 (empty)	Rocky 10.1	6.12.0	78-82ms	NORMAL
18	Rocky 10.1	6.12.0	78-82ms	NORMAL
1,064	AlmaLinux 9.7	5.14.0	78-82ms	NORMAL
437,441	Debian 12	6.1.0	134-150ms	EXTREME
437,441	Debian 13	6.12.41	139-198ms	EXTREME

Under Concurrent Load (437K entries, 4 ban workers)

Scenario	Ban Latency	Notes
Single ban, no contention	139-198ms	Kernel interval tree O(n)
4 concurrent ban workers	73-113s	IPC timeout, kernel serialization
Ban + maintenance concurrent	73-113s	Maintenance alone takes 15min
Unban, concurrent	196-198s	Kernel deletion even slower

Root cause: nftables interval sets use a kernel interval tree with O(n) insert/delete. At 437K entries, a single add or delete traverses the entire tree. Under concurrent load, operations serialize at the kernel level.

Fix (v1.33.0 delivered): Separate hash sets for manual/interactive bans (O(1) lookup/insert), interval sets reserved for bulk feed data.

Concurrent Stress Test

3-minute stress test with 13 concurrent workers: 4 ban, 2 unban, 2 scale, 4 exporter, 1 maintenance.

Results by Server

Server	OS	Kernel	Set Size	Ban Avg	Ban Max	Scale Avg	Daemon RSS	Survived
lab	Debian 12	6.1.0	437K	99-118s	118s	6.2-7.4s	388-447 MB	YES
lab	Debian 13	6.12.41	437K	73-113s	113s	2.4-3.4s	414-546 MB	YES
lab1	Rocky 10.1	6.12.0	18	222ms	661ms	331ms	18-20 MB	YES
lab2	Ubuntu 24.04	6.8.0	~0	236ms	591ms	371ms	16-16 MB	YES
lab3	AlmaLinux 9	5.14.0	1K	232ms	666ms	424ms	18-20 MB	YES
lab4	AlmaLinux 9	5.14.0	~0	201ms	735ms	354ms	17-19 MB	YES

Key Observations

All 6 test runs (5 servers, lab tested on both Debian 12 and 13) survived full duration with no crashes
No SSH connection drops on any server (the original P0-3 symptom is eliminated)
Cache file: 0 missing events across all servers, always readable during kernel operations
Maintenance script on 437K set: 899 seconds (15 minutes) — this is the exact pattern that caused 100% CPU + SSH drops pre-v1.32.0. Now routed through cache.
On small sets (<10K), ban/unban averages 191-236ms across all kernels — no degradation
Debian 12 (kernel 6.1) is ~30% slower than Debian 13 (kernel 6.12) on the same hardware for large set operations

rc=7 Failures Explained

The stress test reports ~520-564 rc=7 failures per run from exporter workers. These are not code bugs. The exporter workers test curl http://127.0.0.1:9108/metrics — the Prometheus exporter endpoint. On lab servers where the Prometheus exporter service is not running, curl returns exit code 7 (connection refused). This validates that the test harness correctly detects when the exporter is unavailable, and confirms the exporter is not a required dependency.

Verdict: What v1.32.0 Fixes and What It Does Not

PASS — Observability and counting:

Routine counting reads come from daemon cache (30-53ms), not kernel (13-19s)
Eliminates the 3-way concurrent nft list set contention that caused 100% CPU and SSH drops
All servers survive stress tests; cache is always available

KNOWN LIMITATION — Write-path latency on huge interval sets:

Single-IP ban/unban into a 437K interval set still takes 99-198ms (single) or 73-118s (concurrent)
This is a kernel O(n) interval tree constraint — no userspace fix exists
The v1.32.0 global flock prevents concurrent writes from compounding, but cannot speed up individual operations
v1.33.0 delivered fix: separate hash sets for manual/interactive bans (O(1) insert), interval sets reserved for batch feed data only

Maintenance Script Impact

Server	Set Size	Maint Avg	Maint Max
lab	437K	~899s	~899s
lab1	18	22.7s	25.7s
lab2	~0	25.7s	30.7s
lab3	1K	29.0s	34.5s
lab4	~0	30.3s	33.2s

The maintenance script runs shell-level nft list set calls. On the 437K set, this takes 15 minutes per invocation. v1.32.0 routes these reads through the cache file instead.

IPC Transport Performance

Measurements via Unix socket (/run/nftban/nftband.sock). These v1.18.0 benchmarks remain valid for the transport layer.

Operation	Latency	Throughput
Ping	57us	17,600 ops/sec
Ban IP	90us	11,100 ops/sec
Unban IP	~85us	~11,700 ops/sec

In-Memory Set Operations

Operation	Latency	Throughput
Set Add	178ns	5.6M ops/sec
Set Lookup	169ns	5.9M ops/sec
Set Union (10k)	152us	6,600 ops/sec
Diff (100k)	65ms	15 ops/sec

IPC round-trip (~90us) is the bottleneck for interactive operations, not in-memory set operations.

Netlink Batch Insert Performance

Real kernel measurements using nftables hash sets with timeout flag (v1.18.0 benchmarks).

AlmaLinux 9.7 (lab1, kernel 5.14.0)

Batch Size	Latency	Throughput	Memory
1,000	106ms	9,452 elem/sec	13.3 MB
5,000	482ms	10,368 elem/sec	66.7 MB
10,000	911ms	10,983 elem/sec	133.3 MB
20,000	2,013ms	9,936 elem/sec	266.6 MB

Ubuntu 24.04 LTS (lab2, kernel 6.8.0)

Batch Size	Latency	Throughput	Memory
1,000	69ms	14,578 elem/sec	13.4 MB
5,000	312ms	16,007 elem/sec	66.8 MB
10,000	690ms	14,491 elem/sec	133.6 MB
20,000	1,313ms	15,230 elem/sec	267.2 MB

AlmaLinux 9.7 (lab3, kernel 5.14.0)

Batch Size	Latency	Throughput	Memory
1,000	107ms	9,346 elem/sec	13.4 MB
5,000	542ms	9,229 elem/sec	66.8 MB
10,000	997ms	10,027 elem/sec	133.6 MB
20,000	2,178ms	9,182 elem/sec	267.2 MB

OS Distribution Comparison

Throughput Summary

OS	Kernel	Batch Throughput	Ban Latency (small set)	Notes
Debian 13	6.12.41	Not yet measured	139-198ms (437K set)	Newest kernel
Rocky 10.1	6.12.0	Not yet measured	78-82ms	Newest RHEL
Ubuntu 24.04	6.8.0	~15,000 elem/sec	236ms avg	Newer kernel
AlmaLinux 9.7	5.14.0	~9,500 elem/sec	200-232ms avg	RHEL 9 kernel

Key Observations

Ubuntu 24.04 is ~50% faster than AlmaLinux 9.7 for netlink batch operations
Kernel version matters: 6.8.0 vs 5.14.0 shows significant improvement in netlink batching
Interactive ban latency is consistent across kernels for small sets (77-236ms)
Large interval sets degrade on all kernels — this is a kernel data structure issue, not distro-specific
Memory usage scales linearly: ~13MB per 1,000 kernel elements across all distros

Scalability Guidance

Capacity Targets by System Size

System	Max Bulk Entries	Max Interactive Set	Daemon RSS
2 vCPU / 4 GB	500K (aggregated)	10K	~500 MB
4 vCPU / 8 GB	1M (aggregated)	50K	~1 GB
8+ vCPU / 16+ GB	2M+ (aggregated)	100K	~2 GB

Why 500K is Not a Universal Number

The v1.18.0 benchmarks showed 500K IPs loading in ~50 seconds via streaming batch insert. This is correct for bulk feed loading into hash sets or staged interval sets.

However, v1.32.0 investigation revealed a separate constraint:

Operation	10K set	100K set	437K set
Single ban (add)	<100ms	~100ms	139-198ms
Single ban (concurrent)	<100ms	~1s	73-113s
`nft list set`	<100ms	~3s	13.5s
Maintenance cycle	~1s	~30s	899s

A 500K unified interval set is not acceptable for interactive single-IP insertions under concurrent load. The fix is architectural:

Bulk feeds/geoban load into interval sets via batch operations (acceptable)
Manual/interactive bans use separate hash sets with O(1) insert (v1.33.0 delivered)
Observability reads from daemon cache, never from kernel (v1.32.0 implemented)

Feed Load Time Estimates (Streaming Batch)

Feed Size	AlmaLinux (~10k/s)	Ubuntu (~15k/s)
10,000 IPs	~1 second	~0.7 seconds
50,000 IPs	~5 seconds	~3.3 seconds
100,000 IPs	~10 seconds	~6.7 seconds
500,000 IPs	~50 seconds	~33 seconds

Memory Behavior

Mode	Memory Usage	Description
Streaming batch	O(batch_size)	Fixed ~10MB regardless of file size
Daemon counters	O(sets)	~1KB per tracked set
Cache file	O(sets)	~1.7KB for 8 sets
Kernel resident	O(n)	~13MB per 1,000 elements

Architecture Validation

What v1.32.0 Benchmarks Confirm

Architecture Decision	Validated By
Cache-first counting	300-640x faster than kernel reads (30-53ms vs 13-19s), tested on 2 kernels
Non-blocking cache	0 read failures during concurrent kernel writes, across 6 test runs
Daemon survival	All 6 test runs survived 3-min stress test (no crashes, no SSH drops)
IntervalEnd filtering	874K halved to 437K (correct count)
Scale-adaptive intervals	EXTREME sets use 600s exporter, not 60s
Global flock	No concurrent `nft list set` during writes
WatchdogSec=120s	Daemon startup on 437K set takes 20-30s for reconciliation
Cross-kernel consistency	Tested on kernel 5.14, 6.1, 6.8, 6.12 — cache-first works on all

What v1.18.0 Benchmarks Confirm (Still Valid)

Architecture Decision	Validated By
Async IPC	IPC overhead (90us) does not block caller
Single writer daemon	No lock contention, consistent throughput
Streaming replace_set	O(batch) memory, constant regardless of file size
Priority scheduling	Fast lane not blocked by bulk operations
5k batch size default	Optimal throughput across tested systems

Known Limitations

1. Kernel Interval Tree O(n) — No Userspace Fix

The nftables kernel uses an interval tree (rbtree) for interval sets (CIDR ranges). Each insert or delete traverses the full tree:

Set Size	Single Insert	Concurrent (4 workers)
10K	<100ms	<100ms
100K	~100ms	~1s
437K	139-198ms	73-118s

This is a kernel data structure constraint. No userspace optimization can change it.

What is affected:

nftban ban <ip> into a large interval set (manual single-IP bans)
nftban unban <ip> from a large interval set
Any concurrent write operations to the same large interval set

What is NOT affected:

Bulk feed loading (uses batch replace_set, not single-IP insert)
Reading set counts (v1.32.0 cache-first)
Bans into small sets or hash sets (O(1))
Packet matching/filtering (kernel fast path, not affected by set size for hash sets)

Mitigation (current v1.32.0):

Global flock prevents concurrent writes from compounding
Observability reads never touch the kernel

Fix (v1.33.0 delivered):

Separate hash sets for manual/interactive bans (O(1) insert/delete)
Interval sets reserved exclusively for bulk feed data (batch operations only)
Manual ban: ~82ms (hash set) instead of 73-118s (interval set under load)

2. IntervalEnd Markers Double Kernel Element Count

The kernel stores each CIDR range as two elements (start + end). A set with 437K logical entries has 874K kernel elements. All counting must filter IntervalEnd markers to report accurate numbers. v1.32.0+ handles this in the stats package (internal/stats/set_counters.go).

3. Maintenance Script Remains Shell-Based

The nftban_maintenance.sh script still calls nft list set directly for some operations. On 437K sets, a single maintenance cycle takes ~15 minutes. v1.32.0 routes counting reads through the cache, but any operation that needs the full element list (expiry checking) remains kernel-bound.

4. WatchdogSec Must Accommodate Startup Reconciliation

On systems with 437K+ entries, daemon startup reconciliation (verifying counters against kernel state) takes 20-30 seconds. WatchdogSec must be set to at least 120s to prevent systemd from killing the daemon before it reports ready. This is configured in nftband.service.

Improvements Since Benchmarks

Version	Change	Impact
v1.33.0	Separate hash sets for manual bans (`blacklist_manual_*`)	Ban latency 73-118s → ~82ms on huge systems
v1.33.0	Set separation: feeds → interval, manual → hash	Eliminated O(n) penalty for interactive operations
v1.34.0	Periodic reconciliation, schema validation	Drift detection between daemon and kernel state
v1.32.0-v1.39.0	Maintenance script cache integration	15-min cycle → seconds for count-based operations

Recommendations

Production Configuration

# /etc/nftban/conf.d/daemon/opqueue.conf
OPQUEUE_MAX_BATCH_SIZE=5000
OPQUEUE_FLUSH_INTERVAL_MS=100
OPQUEUE_MAX_QUEUE_DEPTH=50000

Feed Refresh Strategy

Feed Size	Recommended Interval
<50k IPs	Every 15 minutes
50k-200k IPs	Every 30 minutes
200k-500k IPs	Every 60 minutes
>500k IPs	Every 2-4 hours

OS Selection

Use Case	Recommended OS
Maximum throughput	Ubuntu 24.04 LTS
Enterprise stability	AlmaLinux 9 / RHEL 9
Newest kernel features	Rocky 10.1 / Debian 13

Monitoring Metrics

Monitor these Prometheus metrics in production:

nftban_set_elements{family,set}           - Element count per set
nftban_set_scale_level{family,set}        - Scale level (0-5)
nftban_global_scale_level                 - System-wide scale
nftban_set_last_reconciled_seconds        - Time since kernel verify
nftban_opqueue_pending_count              - Queue depth
nftban_opqueue_total_applied              - Operations applied

Interpreting Results

The three performance domains are independent:

Bulk load throughput — batch netlink insertion speed (9-16K elem/sec). Determines feed sync time.
Interactive ban latency — single-IP add/delete. Depends on set type (hash=O(1), interval=O(n)) and set size.
Observability overhead — cost of counting/listing sets. v1.32.0 reduces this from O(n) kernel reads to O(1) cache reads.

A system that loads 500K IPs in 50 seconds (bulk) can still have 113-second interactive ban latency on that same set (interactive). These are not contradictory — they measure different operations.

Running Benchmarks

Transport Benchmarks (Go test suite)

# Requires root for netlink operations
sudo go test -bench=BenchmarkNetlink -benchtime=3s -v ./pkg/opqueue/...

# Quick benchmark (1k/5k only)
sudo go test -bench='BenchmarkNetlinkAddElements_(1|5)k' ./pkg/opqueue/...

v1.32.0 Stress Test

# Run on lab servers only — tests concurrent operations
DURATION=180 bash /tmp/nftban_stress.sh

Cache vs Kernel Comparison

# Cache read (v1.32.0 path)
time python3 -c "import json; print(json.load(open('/run/nftban/set_counts.json'))['sets']['blacklist_ipv4']['count'])"

# Kernel read (old path)
time nft list set ip nftban blacklist_ipv4 | wc -l

Benchmark History

Benchmarks were conducted at key architecture milestones:

v1.32.0 (March 2026): Cache-first counting, IntervalEnd fix, global flock, cross-kernel stress tests
v1.33.0 (March 2026): Set separation delivered — hash sets for interactive bans
v1.18.0 (February 2026): Initial IPC transport and netlink batch benchmarks

Performance Benchmarks

Performance Benchmarks

Table of Contents

Executive Summary

Current Findings (v1.32.0)

Transport Findings (v1.18.0, still valid)

Scalability Guidance (Current)

Test Environments

Lab Server Specifications (March 2026)

Software Versions

v1.32.0 Cache-First Counting

Cache vs Kernel Read Comparison (lab, 437K entries)

What Changed

Cache File

Interactive Ban Latency

By Set Size (Single Operation, No Contention)

Under Concurrent Load (437K entries, 4 ban workers)

Concurrent Stress Test

Results by Server

Key Observations

rc=7 Failures Explained

Verdict: What v1.32.0 Fixes and What It Does Not

Maintenance Script Impact

IPC Transport Performance

In-Memory Set Operations

Netlink Batch Insert Performance

AlmaLinux 9.7 (lab1, kernel 5.14.0)

Ubuntu 24.04 LTS (lab2, kernel 6.8.0)

AlmaLinux 9.7 (lab3, kernel 5.14.0)

OS Distribution Comparison

Throughput Summary

Key Observations

Scalability Guidance

Capacity Targets by System Size

Why 500K is Not a Universal Number

Feed Load Time Estimates (Streaming Batch)

Memory Behavior

Architecture Validation

What v1.32.0 Benchmarks Confirm

What v1.18.0 Benchmarks Confirm (Still Valid)

Known Limitations

1. Kernel Interval Tree O(n) — No Userspace Fix

2. IntervalEnd Markers Double Kernel Element Count

3. Maintenance Script Remains Shell-Based

4. WatchdogSec Must Accommodate Startup Reconciliation

Improvements Since Benchmarks

Recommendations

Production Configuration

Feed Refresh Strategy

OS Selection

Monitoring Metrics

Interpreting Results

Running Benchmarks

Transport Benchmarks (Go test suite)

v1.32.0 Stress Test

Cache vs Kernel Comparison

Benchmark History

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!