Skip to content

Performance Benchmarks

Antonios Voulvoulis edited this page Apr 14, 2026 · 1 revision

Performance Benchmarks

Performance analysis of NFTBan from v1.18.0 transport benchmarks through v1.32.0 cache-first architecture, with real kernel-level measurements across 5 lab servers and 3 kernel families.


Table of Contents

  1. Executive Summary
  2. Test Environments
  3. v1.32.0 Cache-First Counting
  4. Interactive Ban Latency
  5. Concurrent Stress Test
  6. IPC Transport Performance
  7. Netlink Batch Insert Performance
  8. OS Distribution Comparison
  9. Scalability Guidance
  10. Architecture Validation
  11. Recommendations

Executive Summary

Current Findings (v1.32.0)

Metric Value Notes
Set count read (cache) 30-53ms File read, O(1), kernel-independent
Set count read (kernel) 13,270-19,300ms nft list set on 437K entries, varies by kernel
Cache vs kernel speedup 300-640x Eliminates routine kernel reads
IPC scale query 239-1,867ms Daemon counter via socket, varies by kernel
Ban latency (empty set) 77-82ms Hash set or small interval set
Ban latency (437K interval) 139-198ms Single op, no contention
Ban latency (437K, concurrent) 73-118s Known limitation — kernel O(n) interval tree
Daemon memory (437K entries) 388-546 MB RSS
Cache reads during ban 100% success Non-blocking, zero failures
Daemon survival (stress) 6/6 runs No crashes, no SSH drops across all kernels

Transport Findings (v1.18.0, still valid)

Metric Value Notes
IPC Ban Latency ~90us Unix socket round-trip
IPC Throughput ~11,100 bans/sec Single-threaded
Netlink Throughput 9,000-16,000 elem/sec Varies by kernel
Optimal Batch Size 5,000 elements Best throughput

Scalability Guidance (Current)

  • v1.32.0 fixes observability: routine counting uses daemon cache (30-53ms), not kernel (13-19s)
  • v1.32.0 does NOT fix write-path: single-IP ban into 437K interval set = 73-118s under concurrent load
  • 500K is a practical bulk-feed target for batch operations, not for interactive single-IP inserts
  • Interactive/manual bans must not share the same huge interval set as feeds (v1.33.0 delivered)
  • 1M+ is feasible for bulk-only staged loads on stronger systems, but is not a blanket guarantee

Test Environments

Lab Server Specifications (March 2026)

Server OS Kernel CPU MHz RAM L3 Cache
lab Debian 12 6.1.0-38-amd64 Intel Xeon Skylake (2 vCPU) 2295 3.7 GiB 16 MiB
lab (prev.) Debian 13 6.12.41+deb13-cloud Intel Xeon Skylake (2 vCPU) 2295 3.7 GiB 16 MiB
lab1 Rocky 10.1 6.12.0-124.21.1.el10_1 AMD EPYC-Rome (2 vCPU) 2495 3.5 GiB 16 MiB
lab2 Ubuntu 24.04.3 6.8.0-71-generic AMD EPYC-Rome (2 vCPU) 2495 3.7 GiB 16 MiB
lab3 AlmaLinux 9.7 5.14.0-611.16.1.el9_7 AMD EPYC-Rome (2 vCPU) 2495 3.5 GiB 16 MiB
lab4 AlmaLinux 9.7 5.14.0-611.13.1.el9_7 AMD EPYC-Rome (2 vCPU) 2495 3.5 GiB 16 MiB

All servers: 1 NUMA node, no HT/SMT exposed.

Note: lab uses Intel Xeon Skylake (200 MHz slower than AMD EPYC-Rome on lab1-4). This affects absolute timings but not relative comparisons on the same server. For fair cross-distro comparison, tests should run on the same hardware with different OS installations.

Software Versions

  • Go: 1.24.0 (daemon build, CGO_ENABLED=0 static binary)
  • NFTBan: v1.32.0 (cache-first architecture)
  • Test data: lab has 437,441 real entries (FIREHOL_PROXIES + TOR_EXITS feeds)

v1.32.0 Cache-First Counting

The core v1.32.0 change replaces routine nft list set kernel calls with daemon-owned in-memory counters and a JSON cache file.

Cache vs Kernel Read Comparison (lab, 437K entries)

Method Debian 12 (6.1) Debian 13 (6.12) Relative
Cache file read (/run/nftban/set_counts.json) 30-33ms 36-53ms 1x (baseline)
IPC daemon counter (nftban scale) 1,407-1,867ms 239-296ms ~6-50x
Kernel (nft list set) 19,300ms 13,270-13,786ms 300-640x slower

Key finding: Kernel nft list set on Debian 12 (kernel 6.1) is 30% slower than Debian 13 (kernel 6.12) for the same 437K set. Cache reads are similar across kernels. This confirms the cache-first approach is critical — it eliminates the kernel version dependency entirely.

What Changed

Consumer Before (v1.31) After (v1.32)
Unified exporter nft list set (13-19s) Cache file read (30-53ms)
Prometheus exporter nft -j list set (13-19s) Daemon counter (instant)
nftban scale CLI N/A Daemon IPC (260ms)
nftban stats nft list set Daemon counter
Watchdog GetSetElements() GetSetCount()

Cache File

Path: /run/nftban/set_counts.json

Written by daemon via atomic rename, maximum once per 10 seconds. All consumers (exporters, CLI, shell scripts) read this file instead of querying the kernel.

See Large Set Management for full architecture details.


Interactive Ban Latency

Single nftban ban <ip> measurements at different set sizes.

By Set Size (Single Operation, No Contention)

Set Size Distro Kernel Ban Latency Scale Level
0 (empty) Rocky 10.1 6.12.0 78-82ms NORMAL
18 Rocky 10.1 6.12.0 78-82ms NORMAL
1,064 AlmaLinux 9.7 5.14.0 78-82ms NORMAL
437,441 Debian 12 6.1.0 134-150ms EXTREME
437,441 Debian 13 6.12.41 139-198ms EXTREME

Under Concurrent Load (437K entries, 4 ban workers)

Scenario Ban Latency Notes
Single ban, no contention 139-198ms Kernel interval tree O(n)
4 concurrent ban workers 73-113s IPC timeout, kernel serialization
Ban + maintenance concurrent 73-113s Maintenance alone takes 15min
Unban, concurrent 196-198s Kernel deletion even slower

Root cause: nftables interval sets use a kernel interval tree with O(n) insert/delete. At 437K entries, a single add or delete traverses the entire tree. Under concurrent load, operations serialize at the kernel level.

Fix (v1.33.0 delivered): Separate hash sets for manual/interactive bans (O(1) lookup/insert), interval sets reserved for bulk feed data.


Concurrent Stress Test

3-minute stress test with 13 concurrent workers: 4 ban, 2 unban, 2 scale, 4 exporter, 1 maintenance.

Results by Server

Server OS Kernel Set Size Ban Avg Ban Max Scale Avg Daemon RSS Survived
lab Debian 12 6.1.0 437K 99-118s 118s 6.2-7.4s 388-447 MB YES
lab Debian 13 6.12.41 437K 73-113s 113s 2.4-3.4s 414-546 MB YES
lab1 Rocky 10.1 6.12.0 18 222ms 661ms 331ms 18-20 MB YES
lab2 Ubuntu 24.04 6.8.0 ~0 236ms 591ms 371ms 16-16 MB YES
lab3 AlmaLinux 9 5.14.0 1K 232ms 666ms 424ms 18-20 MB YES
lab4 AlmaLinux 9 5.14.0 ~0 201ms 735ms 354ms 17-19 MB YES

Key Observations

  • All 6 test runs (5 servers, lab tested on both Debian 12 and 13) survived full duration with no crashes
  • No SSH connection drops on any server (the original P0-3 symptom is eliminated)
  • Cache file: 0 missing events across all servers, always readable during kernel operations
  • Maintenance script on 437K set: 899 seconds (15 minutes) — this is the exact pattern that caused 100% CPU + SSH drops pre-v1.32.0. Now routed through cache.
  • On small sets (<10K), ban/unban averages 191-236ms across all kernels — no degradation
  • Debian 12 (kernel 6.1) is ~30% slower than Debian 13 (kernel 6.12) on the same hardware for large set operations

rc=7 Failures Explained

The stress test reports ~520-564 rc=7 failures per run from exporter workers. These are not code bugs. The exporter workers test curl http://127.0.0.1:9108/metrics — the Prometheus exporter endpoint. On lab servers where the Prometheus exporter service is not running, curl returns exit code 7 (connection refused). This validates that the test harness correctly detects when the exporter is unavailable, and confirms the exporter is not a required dependency.

Verdict: What v1.32.0 Fixes and What It Does Not

PASS — Observability and counting:

  • Routine counting reads come from daemon cache (30-53ms), not kernel (13-19s)
  • Eliminates the 3-way concurrent nft list set contention that caused 100% CPU and SSH drops
  • All servers survive stress tests; cache is always available

KNOWN LIMITATION — Write-path latency on huge interval sets:

  • Single-IP ban/unban into a 437K interval set still takes 99-198ms (single) or 73-118s (concurrent)
  • This is a kernel O(n) interval tree constraint — no userspace fix exists
  • The v1.32.0 global flock prevents concurrent writes from compounding, but cannot speed up individual operations
  • v1.33.0 delivered fix: separate hash sets for manual/interactive bans (O(1) insert), interval sets reserved for batch feed data only

Maintenance Script Impact

Server Set Size Maint Avg Maint Max
lab 437K ~899s ~899s
lab1 18 22.7s 25.7s
lab2 ~0 25.7s 30.7s
lab3 1K 29.0s 34.5s
lab4 ~0 30.3s 33.2s

The maintenance script runs shell-level nft list set calls. On the 437K set, this takes 15 minutes per invocation. v1.32.0 routes these reads through the cache file instead.


IPC Transport Performance

Measurements via Unix socket (/run/nftban/nftband.sock). These v1.18.0 benchmarks remain valid for the transport layer.

Operation Latency Throughput
Ping 57us 17,600 ops/sec
Ban IP 90us 11,100 ops/sec
Unban IP ~85us ~11,700 ops/sec

In-Memory Set Operations

Operation Latency Throughput
Set Add 178ns 5.6M ops/sec
Set Lookup 169ns 5.9M ops/sec
Set Union (10k) 152us 6,600 ops/sec
Diff (100k) 65ms 15 ops/sec

IPC round-trip (~90us) is the bottleneck for interactive operations, not in-memory set operations.


Netlink Batch Insert Performance

Real kernel measurements using nftables hash sets with timeout flag (v1.18.0 benchmarks).

AlmaLinux 9.7 (lab1, kernel 5.14.0)

Batch Size Latency Throughput Memory
1,000 106ms 9,452 elem/sec 13.3 MB
5,000 482ms 10,368 elem/sec 66.7 MB
10,000 911ms 10,983 elem/sec 133.3 MB
20,000 2,013ms 9,936 elem/sec 266.6 MB

Ubuntu 24.04 LTS (lab2, kernel 6.8.0)

Batch Size Latency Throughput Memory
1,000 69ms 14,578 elem/sec 13.4 MB
5,000 312ms 16,007 elem/sec 66.8 MB
10,000 690ms 14,491 elem/sec 133.6 MB
20,000 1,313ms 15,230 elem/sec 267.2 MB

AlmaLinux 9.7 (lab3, kernel 5.14.0)

Batch Size Latency Throughput Memory
1,000 107ms 9,346 elem/sec 13.4 MB
5,000 542ms 9,229 elem/sec 66.8 MB
10,000 997ms 10,027 elem/sec 133.6 MB
20,000 2,178ms 9,182 elem/sec 267.2 MB

OS Distribution Comparison

Throughput Summary

OS Kernel Batch Throughput Ban Latency (small set) Notes
Debian 13 6.12.41 Not yet measured 139-198ms (437K set) Newest kernel
Rocky 10.1 6.12.0 Not yet measured 78-82ms Newest RHEL
Ubuntu 24.04 6.8.0 ~15,000 elem/sec 236ms avg Newer kernel
AlmaLinux 9.7 5.14.0 ~9,500 elem/sec 200-232ms avg RHEL 9 kernel

Key Observations

  1. Ubuntu 24.04 is ~50% faster than AlmaLinux 9.7 for netlink batch operations
  2. Kernel version matters: 6.8.0 vs 5.14.0 shows significant improvement in netlink batching
  3. Interactive ban latency is consistent across kernels for small sets (77-236ms)
  4. Large interval sets degrade on all kernels — this is a kernel data structure issue, not distro-specific
  5. Memory usage scales linearly: ~13MB per 1,000 kernel elements across all distros

Scalability Guidance

Capacity Targets by System Size

System Max Bulk Entries Max Interactive Set Daemon RSS
2 vCPU / 4 GB 500K (aggregated) 10K ~500 MB
4 vCPU / 8 GB 1M (aggregated) 50K ~1 GB
8+ vCPU / 16+ GB 2M+ (aggregated) 100K ~2 GB

Why 500K is Not a Universal Number

The v1.18.0 benchmarks showed 500K IPs loading in ~50 seconds via streaming batch insert. This is correct for bulk feed loading into hash sets or staged interval sets.

However, v1.32.0 investigation revealed a separate constraint:

Operation 10K set 100K set 437K set
Single ban (add) <100ms ~100ms 139-198ms
Single ban (concurrent) <100ms ~1s 73-113s
nft list set <100ms ~3s 13.5s
Maintenance cycle ~1s ~30s 899s

A 500K unified interval set is not acceptable for interactive single-IP insertions under concurrent load. The fix is architectural:

  • Bulk feeds/geoban load into interval sets via batch operations (acceptable)
  • Manual/interactive bans use separate hash sets with O(1) insert (v1.33.0 delivered)
  • Observability reads from daemon cache, never from kernel (v1.32.0 implemented)

Feed Load Time Estimates (Streaming Batch)

Feed Size AlmaLinux (~10k/s) Ubuntu (~15k/s)
10,000 IPs ~1 second ~0.7 seconds
50,000 IPs ~5 seconds ~3.3 seconds
100,000 IPs ~10 seconds ~6.7 seconds
500,000 IPs ~50 seconds ~33 seconds

Memory Behavior

Mode Memory Usage Description
Streaming batch O(batch_size) Fixed ~10MB regardless of file size
Daemon counters O(sets) ~1KB per tracked set
Cache file O(sets) ~1.7KB for 8 sets
Kernel resident O(n) ~13MB per 1,000 elements

Architecture Validation

What v1.32.0 Benchmarks Confirm

Architecture Decision Validated By
Cache-first counting 300-640x faster than kernel reads (30-53ms vs 13-19s), tested on 2 kernels
Non-blocking cache 0 read failures during concurrent kernel writes, across 6 test runs
Daemon survival All 6 test runs survived 3-min stress test (no crashes, no SSH drops)
IntervalEnd filtering 874K halved to 437K (correct count)
Scale-adaptive intervals EXTREME sets use 600s exporter, not 60s
Global flock No concurrent nft list set during writes
WatchdogSec=120s Daemon startup on 437K set takes 20-30s for reconciliation
Cross-kernel consistency Tested on kernel 5.14, 6.1, 6.8, 6.12 — cache-first works on all

What v1.18.0 Benchmarks Confirm (Still Valid)

Architecture Decision Validated By
Async IPC IPC overhead (90us) does not block caller
Single writer daemon No lock contention, consistent throughput
Streaming replace_set O(batch) memory, constant regardless of file size
Priority scheduling Fast lane not blocked by bulk operations
5k batch size default Optimal throughput across tested systems

Known Limitations

1. Kernel Interval Tree O(n) — No Userspace Fix

The nftables kernel uses an interval tree (rbtree) for interval sets (CIDR ranges). Each insert or delete traverses the full tree:

Set Size Single Insert Concurrent (4 workers)
10K <100ms <100ms
100K ~100ms ~1s
437K 139-198ms 73-118s

This is a kernel data structure constraint. No userspace optimization can change it.

What is affected:

  • nftban ban <ip> into a large interval set (manual single-IP bans)
  • nftban unban <ip> from a large interval set
  • Any concurrent write operations to the same large interval set

What is NOT affected:

  • Bulk feed loading (uses batch replace_set, not single-IP insert)
  • Reading set counts (v1.32.0 cache-first)
  • Bans into small sets or hash sets (O(1))
  • Packet matching/filtering (kernel fast path, not affected by set size for hash sets)

Mitigation (current v1.32.0):

  • Global flock prevents concurrent writes from compounding
  • Observability reads never touch the kernel

Fix (v1.33.0 delivered):

  • Separate hash sets for manual/interactive bans (O(1) insert/delete)
  • Interval sets reserved exclusively for bulk feed data (batch operations only)
  • Manual ban: ~82ms (hash set) instead of 73-118s (interval set under load)

2. IntervalEnd Markers Double Kernel Element Count

The kernel stores each CIDR range as two elements (start + end). A set with 437K logical entries has 874K kernel elements. All counting must filter IntervalEnd markers to report accurate numbers. v1.32.0+ handles this in the stats package (internal/stats/set_counters.go).

3. Maintenance Script Remains Shell-Based

The nftban_maintenance.sh script still calls nft list set directly for some operations. On 437K sets, a single maintenance cycle takes ~15 minutes. v1.32.0 routes counting reads through the cache, but any operation that needs the full element list (expiry checking) remains kernel-bound.

4. WatchdogSec Must Accommodate Startup Reconciliation

On systems with 437K+ entries, daemon startup reconciliation (verifying counters against kernel state) takes 20-30 seconds. WatchdogSec must be set to at least 120s to prevent systemd from killing the daemon before it reports ready. This is configured in nftband.service.


Improvements Since Benchmarks

Version Change Impact
v1.33.0 Separate hash sets for manual bans (blacklist_manual_*) Ban latency 73-118s → ~82ms on huge systems
v1.33.0 Set separation: feeds → interval, manual → hash Eliminated O(n) penalty for interactive operations
v1.34.0 Periodic reconciliation, schema validation Drift detection between daemon and kernel state
v1.32.0-v1.39.0 Maintenance script cache integration 15-min cycle → seconds for count-based operations

Recommendations

Production Configuration

# /etc/nftban/conf.d/daemon/opqueue.conf
OPQUEUE_MAX_BATCH_SIZE=5000
OPQUEUE_FLUSH_INTERVAL_MS=100
OPQUEUE_MAX_QUEUE_DEPTH=50000

Feed Refresh Strategy

Feed Size Recommended Interval
<50k IPs Every 15 minutes
50k-200k IPs Every 30 minutes
200k-500k IPs Every 60 minutes
>500k IPs Every 2-4 hours

OS Selection

Use Case Recommended OS
Maximum throughput Ubuntu 24.04 LTS
Enterprise stability AlmaLinux 9 / RHEL 9
Newest kernel features Rocky 10.1 / Debian 13

Monitoring Metrics

Monitor these Prometheus metrics in production:

nftban_set_elements{family,set}           - Element count per set
nftban_set_scale_level{family,set}        - Scale level (0-5)
nftban_global_scale_level                 - System-wide scale
nftban_set_last_reconciled_seconds        - Time since kernel verify
nftban_opqueue_pending_count              - Queue depth
nftban_opqueue_total_applied              - Operations applied

Interpreting Results

The three performance domains are independent:

  1. Bulk load throughput — batch netlink insertion speed (9-16K elem/sec). Determines feed sync time.
  2. Interactive ban latency — single-IP add/delete. Depends on set type (hash=O(1), interval=O(n)) and set size.
  3. Observability overhead — cost of counting/listing sets. v1.32.0 reduces this from O(n) kernel reads to O(1) cache reads.

A system that loads 500K IPs in 50 seconds (bulk) can still have 113-second interactive ban latency on that same set (interactive). These are not contradictory — they measure different operations.


Running Benchmarks

Transport Benchmarks (Go test suite)

# Requires root for netlink operations
sudo go test -bench=BenchmarkNetlink -benchtime=3s -v ./pkg/opqueue/...

# Quick benchmark (1k/5k only)
sudo go test -bench='BenchmarkNetlinkAddElements_(1|5)k' ./pkg/opqueue/...

v1.32.0 Stress Test

# Run on lab servers only — tests concurrent operations
DURATION=180 bash /tmp/nftban_stress.sh

Cache vs Kernel Comparison

# Cache read (v1.32.0 path)
time python3 -c "import json; print(json.load(open('/run/nftban/set_counts.json'))['sets']['blacklist_ipv4']['count'])"

# Kernel read (old path)
time nft list set ip nftban blacklist_ipv4 | wc -l

Benchmark History

Benchmarks were conducted at key architecture milestones:

  • v1.32.0 (March 2026): Cache-first counting, IntervalEnd fix, global flock, cross-kernel stress tests
  • v1.33.0 (March 2026): Set separation delivered — hash sets for interactive bans
  • v1.18.0 (February 2026): Initial IPC transport and netlink batch benchmarks

See Also


v1.32.0 benchmarks conducted 21 March 2026 across 5 lab servers (6 test runs: Debian 12, Debian 13, Rocky 10.1, Ubuntu 24.04, AlmaLinux 9 x2) with real measured data. v1.18.0 transport benchmarks conducted February 2026.

Clone this wiki locally