-
Notifications
You must be signed in to change notification settings - Fork 0
Performance Benchmarks
Performance analysis of NFTBan from v1.18.0 transport benchmarks through v1.32.0 cache-first architecture, with real kernel-level measurements across 5 lab servers and 3 kernel families.
- Executive Summary
- Test Environments
- v1.32.0 Cache-First Counting
- Interactive Ban Latency
- Concurrent Stress Test
- IPC Transport Performance
- Netlink Batch Insert Performance
- OS Distribution Comparison
- Scalability Guidance
- Architecture Validation
- Recommendations
| Metric | Value | Notes |
|---|---|---|
| Set count read (cache) | 30-53ms | File read, O(1), kernel-independent |
| Set count read (kernel) | 13,270-19,300ms |
nft list set on 437K entries, varies by kernel |
| Cache vs kernel speedup | 300-640x | Eliminates routine kernel reads |
| IPC scale query | 239-1,867ms | Daemon counter via socket, varies by kernel |
| Ban latency (empty set) | 77-82ms | Hash set or small interval set |
| Ban latency (437K interval) | 139-198ms | Single op, no contention |
| Ban latency (437K, concurrent) | 73-118s | Known limitation — kernel O(n) interval tree |
| Daemon memory (437K entries) | 388-546 MB | RSS |
| Cache reads during ban | 100% success | Non-blocking, zero failures |
| Daemon survival (stress) | 6/6 runs | No crashes, no SSH drops across all kernels |
| Metric | Value | Notes |
|---|---|---|
| IPC Ban Latency | ~90us | Unix socket round-trip |
| IPC Throughput | ~11,100 bans/sec | Single-threaded |
| Netlink Throughput | 9,000-16,000 elem/sec | Varies by kernel |
| Optimal Batch Size | 5,000 elements | Best throughput |
- v1.32.0 fixes observability: routine counting uses daemon cache (30-53ms), not kernel (13-19s)
- v1.32.0 does NOT fix write-path: single-IP ban into 437K interval set = 73-118s under concurrent load
- 500K is a practical bulk-feed target for batch operations, not for interactive single-IP inserts
- Interactive/manual bans must not share the same huge interval set as feeds (v1.33.0 delivered)
- 1M+ is feasible for bulk-only staged loads on stronger systems, but is not a blanket guarantee
| Server | OS | Kernel | CPU | MHz | RAM | L3 Cache |
|---|---|---|---|---|---|---|
| lab | Debian 12 | 6.1.0-38-amd64 | Intel Xeon Skylake (2 vCPU) | 2295 | 3.7 GiB | 16 MiB |
| lab (prev.) | Debian 13 | 6.12.41+deb13-cloud | Intel Xeon Skylake (2 vCPU) | 2295 | 3.7 GiB | 16 MiB |
| lab1 | Rocky 10.1 | 6.12.0-124.21.1.el10_1 | AMD EPYC-Rome (2 vCPU) | 2495 | 3.5 GiB | 16 MiB |
| lab2 | Ubuntu 24.04.3 | 6.8.0-71-generic | AMD EPYC-Rome (2 vCPU) | 2495 | 3.7 GiB | 16 MiB |
| lab3 | AlmaLinux 9.7 | 5.14.0-611.16.1.el9_7 | AMD EPYC-Rome (2 vCPU) | 2495 | 3.5 GiB | 16 MiB |
| lab4 | AlmaLinux 9.7 | 5.14.0-611.13.1.el9_7 | AMD EPYC-Rome (2 vCPU) | 2495 | 3.5 GiB | 16 MiB |
All servers: 1 NUMA node, no HT/SMT exposed.
Note: lab uses Intel Xeon Skylake (200 MHz slower than AMD EPYC-Rome on lab1-4). This affects absolute timings but not relative comparisons on the same server. For fair cross-distro comparison, tests should run on the same hardware with different OS installations.
- Go: 1.24.0 (daemon build, CGO_ENABLED=0 static binary)
- NFTBan: v1.32.0 (cache-first architecture)
- Test data: lab has 437,441 real entries (FIREHOL_PROXIES + TOR_EXITS feeds)
The core v1.32.0 change replaces routine nft list set kernel calls with daemon-owned in-memory counters and a JSON cache file.
| Method | Debian 12 (6.1) | Debian 13 (6.12) | Relative |
|---|---|---|---|
Cache file read (/run/nftban/set_counts.json) |
30-33ms | 36-53ms | 1x (baseline) |
IPC daemon counter (nftban scale) |
1,407-1,867ms | 239-296ms | ~6-50x |
Kernel (nft list set) |
19,300ms | 13,270-13,786ms | 300-640x slower |
Key finding: Kernel nft list set on Debian 12 (kernel 6.1) is 30% slower than Debian 13 (kernel 6.12) for the same 437K set. Cache reads are similar across kernels. This confirms the cache-first approach is critical — it eliminates the kernel version dependency entirely.
| Consumer | Before (v1.31) | After (v1.32) |
|---|---|---|
| Unified exporter |
nft list set (13-19s) |
Cache file read (30-53ms) |
| Prometheus exporter |
nft -j list set (13-19s) |
Daemon counter (instant) |
nftban scale CLI |
N/A | Daemon IPC (260ms) |
nftban stats |
nft list set |
Daemon counter |
| Watchdog | GetSetElements() |
GetSetCount() |
Path: /run/nftban/set_counts.json
Written by daemon via atomic rename, maximum once per 10 seconds. All consumers (exporters, CLI, shell scripts) read this file instead of querying the kernel.
See Large Set Management for full architecture details.
Single nftban ban <ip> measurements at different set sizes.
| Set Size | Distro | Kernel | Ban Latency | Scale Level |
|---|---|---|---|---|
| 0 (empty) | Rocky 10.1 | 6.12.0 | 78-82ms | NORMAL |
| 18 | Rocky 10.1 | 6.12.0 | 78-82ms | NORMAL |
| 1,064 | AlmaLinux 9.7 | 5.14.0 | 78-82ms | NORMAL |
| 437,441 | Debian 12 | 6.1.0 | 134-150ms | EXTREME |
| 437,441 | Debian 13 | 6.12.41 | 139-198ms | EXTREME |
| Scenario | Ban Latency | Notes |
|---|---|---|
| Single ban, no contention | 139-198ms | Kernel interval tree O(n) |
| 4 concurrent ban workers | 73-113s | IPC timeout, kernel serialization |
| Ban + maintenance concurrent | 73-113s | Maintenance alone takes 15min |
| Unban, concurrent | 196-198s | Kernel deletion even slower |
Root cause: nftables interval sets use a kernel interval tree with O(n) insert/delete. At 437K entries, a single add or delete traverses the entire tree. Under concurrent load, operations serialize at the kernel level.
Fix (v1.33.0 delivered): Separate hash sets for manual/interactive bans (O(1) lookup/insert), interval sets reserved for bulk feed data.
3-minute stress test with 13 concurrent workers: 4 ban, 2 unban, 2 scale, 4 exporter, 1 maintenance.
| Server | OS | Kernel | Set Size | Ban Avg | Ban Max | Scale Avg | Daemon RSS | Survived |
|---|---|---|---|---|---|---|---|---|
| lab | Debian 12 | 6.1.0 | 437K | 99-118s | 118s | 6.2-7.4s | 388-447 MB | YES |
| lab | Debian 13 | 6.12.41 | 437K | 73-113s | 113s | 2.4-3.4s | 414-546 MB | YES |
| lab1 | Rocky 10.1 | 6.12.0 | 18 | 222ms | 661ms | 331ms | 18-20 MB | YES |
| lab2 | Ubuntu 24.04 | 6.8.0 | ~0 | 236ms | 591ms | 371ms | 16-16 MB | YES |
| lab3 | AlmaLinux 9 | 5.14.0 | 1K | 232ms | 666ms | 424ms | 18-20 MB | YES |
| lab4 | AlmaLinux 9 | 5.14.0 | ~0 | 201ms | 735ms | 354ms | 17-19 MB | YES |
- All 6 test runs (5 servers, lab tested on both Debian 12 and 13) survived full duration with no crashes
- No SSH connection drops on any server (the original P0-3 symptom is eliminated)
- Cache file: 0 missing events across all servers, always readable during kernel operations
- Maintenance script on 437K set: 899 seconds (15 minutes) — this is the exact pattern that caused 100% CPU + SSH drops pre-v1.32.0. Now routed through cache.
- On small sets (<10K), ban/unban averages 191-236ms across all kernels — no degradation
- Debian 12 (kernel 6.1) is ~30% slower than Debian 13 (kernel 6.12) on the same hardware for large set operations
The stress test reports ~520-564 rc=7 failures per run from exporter workers. These are not code bugs. The exporter workers test curl http://127.0.0.1:9108/metrics — the Prometheus exporter endpoint. On lab servers where the Prometheus exporter service is not running, curl returns exit code 7 (connection refused). This validates that the test harness correctly detects when the exporter is unavailable, and confirms the exporter is not a required dependency.
PASS — Observability and counting:
- Routine counting reads come from daemon cache (30-53ms), not kernel (13-19s)
- Eliminates the 3-way concurrent
nft list setcontention that caused 100% CPU and SSH drops - All servers survive stress tests; cache is always available
KNOWN LIMITATION — Write-path latency on huge interval sets:
- Single-IP ban/unban into a 437K interval set still takes 99-198ms (single) or 73-118s (concurrent)
- This is a kernel O(n) interval tree constraint — no userspace fix exists
- The v1.32.0 global flock prevents concurrent writes from compounding, but cannot speed up individual operations
- v1.33.0 delivered fix: separate hash sets for manual/interactive bans (O(1) insert), interval sets reserved for batch feed data only
| Server | Set Size | Maint Avg | Maint Max |
|---|---|---|---|
| lab | 437K | ~899s | ~899s |
| lab1 | 18 | 22.7s | 25.7s |
| lab2 | ~0 | 25.7s | 30.7s |
| lab3 | 1K | 29.0s | 34.5s |
| lab4 | ~0 | 30.3s | 33.2s |
The maintenance script runs shell-level nft list set calls. On the 437K set, this takes 15 minutes per invocation. v1.32.0 routes these reads through the cache file instead.
Measurements via Unix socket (/run/nftban/nftband.sock). These v1.18.0 benchmarks remain valid for the transport layer.
| Operation | Latency | Throughput |
|---|---|---|
| Ping | 57us | 17,600 ops/sec |
| Ban IP | 90us | 11,100 ops/sec |
| Unban IP | ~85us | ~11,700 ops/sec |
| Operation | Latency | Throughput |
|---|---|---|
| Set Add | 178ns | 5.6M ops/sec |
| Set Lookup | 169ns | 5.9M ops/sec |
| Set Union (10k) | 152us | 6,600 ops/sec |
| Diff (100k) | 65ms | 15 ops/sec |
IPC round-trip (~90us) is the bottleneck for interactive operations, not in-memory set operations.
Real kernel measurements using nftables hash sets with timeout flag (v1.18.0 benchmarks).
| Batch Size | Latency | Throughput | Memory |
|---|---|---|---|
| 1,000 | 106ms | 9,452 elem/sec | 13.3 MB |
| 5,000 | 482ms | 10,368 elem/sec | 66.7 MB |
| 10,000 | 911ms | 10,983 elem/sec | 133.3 MB |
| 20,000 | 2,013ms | 9,936 elem/sec | 266.6 MB |
| Batch Size | Latency | Throughput | Memory |
|---|---|---|---|
| 1,000 | 69ms | 14,578 elem/sec | 13.4 MB |
| 5,000 | 312ms | 16,007 elem/sec | 66.8 MB |
| 10,000 | 690ms | 14,491 elem/sec | 133.6 MB |
| 20,000 | 1,313ms | 15,230 elem/sec | 267.2 MB |
| Batch Size | Latency | Throughput | Memory |
|---|---|---|---|
| 1,000 | 107ms | 9,346 elem/sec | 13.4 MB |
| 5,000 | 542ms | 9,229 elem/sec | 66.8 MB |
| 10,000 | 997ms | 10,027 elem/sec | 133.6 MB |
| 20,000 | 2,178ms | 9,182 elem/sec | 267.2 MB |
| OS | Kernel | Batch Throughput | Ban Latency (small set) | Notes |
|---|---|---|---|---|
| Debian 13 | 6.12.41 | Not yet measured | 139-198ms (437K set) | Newest kernel |
| Rocky 10.1 | 6.12.0 | Not yet measured | 78-82ms | Newest RHEL |
| Ubuntu 24.04 | 6.8.0 | ~15,000 elem/sec | 236ms avg | Newer kernel |
| AlmaLinux 9.7 | 5.14.0 | ~9,500 elem/sec | 200-232ms avg | RHEL 9 kernel |
- Ubuntu 24.04 is ~50% faster than AlmaLinux 9.7 for netlink batch operations
- Kernel version matters: 6.8.0 vs 5.14.0 shows significant improvement in netlink batching
- Interactive ban latency is consistent across kernels for small sets (77-236ms)
- Large interval sets degrade on all kernels — this is a kernel data structure issue, not distro-specific
- Memory usage scales linearly: ~13MB per 1,000 kernel elements across all distros
| System | Max Bulk Entries | Max Interactive Set | Daemon RSS |
|---|---|---|---|
| 2 vCPU / 4 GB | 500K (aggregated) | 10K | ~500 MB |
| 4 vCPU / 8 GB | 1M (aggregated) | 50K | ~1 GB |
| 8+ vCPU / 16+ GB | 2M+ (aggregated) | 100K | ~2 GB |
The v1.18.0 benchmarks showed 500K IPs loading in ~50 seconds via streaming batch insert. This is correct for bulk feed loading into hash sets or staged interval sets.
However, v1.32.0 investigation revealed a separate constraint:
| Operation | 10K set | 100K set | 437K set |
|---|---|---|---|
| Single ban (add) | <100ms | ~100ms | 139-198ms |
| Single ban (concurrent) | <100ms | ~1s | 73-113s |
nft list set |
<100ms | ~3s | 13.5s |
| Maintenance cycle | ~1s | ~30s | 899s |
A 500K unified interval set is not acceptable for interactive single-IP insertions under concurrent load. The fix is architectural:
- Bulk feeds/geoban load into interval sets via batch operations (acceptable)
- Manual/interactive bans use separate hash sets with O(1) insert (v1.33.0 delivered)
- Observability reads from daemon cache, never from kernel (v1.32.0 implemented)
| Feed Size | AlmaLinux (~10k/s) | Ubuntu (~15k/s) |
|---|---|---|
| 10,000 IPs | ~1 second | ~0.7 seconds |
| 50,000 IPs | ~5 seconds | ~3.3 seconds |
| 100,000 IPs | ~10 seconds | ~6.7 seconds |
| 500,000 IPs | ~50 seconds | ~33 seconds |
| Mode | Memory Usage | Description |
|---|---|---|
| Streaming batch | O(batch_size) | Fixed ~10MB regardless of file size |
| Daemon counters | O(sets) | ~1KB per tracked set |
| Cache file | O(sets) | ~1.7KB for 8 sets |
| Kernel resident | O(n) | ~13MB per 1,000 elements |
| Architecture Decision | Validated By |
|---|---|
| Cache-first counting | 300-640x faster than kernel reads (30-53ms vs 13-19s), tested on 2 kernels |
| Non-blocking cache | 0 read failures during concurrent kernel writes, across 6 test runs |
| Daemon survival | All 6 test runs survived 3-min stress test (no crashes, no SSH drops) |
| IntervalEnd filtering | 874K halved to 437K (correct count) |
| Scale-adaptive intervals | EXTREME sets use 600s exporter, not 60s |
| Global flock | No concurrent nft list set during writes |
| WatchdogSec=120s | Daemon startup on 437K set takes 20-30s for reconciliation |
| Cross-kernel consistency | Tested on kernel 5.14, 6.1, 6.8, 6.12 — cache-first works on all |
| Architecture Decision | Validated By |
|---|---|
| Async IPC | IPC overhead (90us) does not block caller |
| Single writer daemon | No lock contention, consistent throughput |
| Streaming replace_set | O(batch) memory, constant regardless of file size |
| Priority scheduling | Fast lane not blocked by bulk operations |
| 5k batch size default | Optimal throughput across tested systems |
The nftables kernel uses an interval tree (rbtree) for interval sets (CIDR ranges). Each insert or delete traverses the full tree:
| Set Size | Single Insert | Concurrent (4 workers) |
|---|---|---|
| 10K | <100ms | <100ms |
| 100K | ~100ms | ~1s |
| 437K | 139-198ms | 73-118s |
This is a kernel data structure constraint. No userspace optimization can change it.
What is affected:
-
nftban ban <ip>into a large interval set (manual single-IP bans) -
nftban unban <ip>from a large interval set - Any concurrent write operations to the same large interval set
What is NOT affected:
- Bulk feed loading (uses batch
replace_set, not single-IP insert) - Reading set counts (v1.32.0 cache-first)
- Bans into small sets or hash sets (O(1))
- Packet matching/filtering (kernel fast path, not affected by set size for hash sets)
Mitigation (current v1.32.0):
- Global flock prevents concurrent writes from compounding
- Observability reads never touch the kernel
Fix (v1.33.0 delivered):
- Separate hash sets for manual/interactive bans (O(1) insert/delete)
- Interval sets reserved exclusively for bulk feed data (batch operations only)
- Manual ban: ~82ms (hash set) instead of 73-118s (interval set under load)
The kernel stores each CIDR range as two elements (start + end). A set with 437K logical entries has 874K kernel elements. All counting must filter IntervalEnd markers to report accurate numbers. v1.32.0+ handles this in the stats package (internal/stats/set_counters.go).
The nftban_maintenance.sh script still calls nft list set directly for some operations. On 437K sets, a single maintenance cycle takes ~15 minutes. v1.32.0 routes counting reads through the cache, but any operation that needs the full element list (expiry checking) remains kernel-bound.
On systems with 437K+ entries, daemon startup reconciliation (verifying counters against kernel state) takes 20-30 seconds. WatchdogSec must be set to at least 120s to prevent systemd from killing the daemon before it reports ready. This is configured in nftband.service.
| Version | Change | Impact |
|---|---|---|
| v1.33.0 | Separate hash sets for manual bans (blacklist_manual_*) |
Ban latency 73-118s → ~82ms on huge systems |
| v1.33.0 | Set separation: feeds → interval, manual → hash | Eliminated O(n) penalty for interactive operations |
| v1.34.0 | Periodic reconciliation, schema validation | Drift detection between daemon and kernel state |
| v1.32.0-v1.39.0 | Maintenance script cache integration | 15-min cycle → seconds for count-based operations |
# /etc/nftban/conf.d/daemon/opqueue.conf
OPQUEUE_MAX_BATCH_SIZE=5000
OPQUEUE_FLUSH_INTERVAL_MS=100
OPQUEUE_MAX_QUEUE_DEPTH=50000| Feed Size | Recommended Interval |
|---|---|
| <50k IPs | Every 15 minutes |
| 50k-200k IPs | Every 30 minutes |
| 200k-500k IPs | Every 60 minutes |
| >500k IPs | Every 2-4 hours |
| Use Case | Recommended OS |
|---|---|
| Maximum throughput | Ubuntu 24.04 LTS |
| Enterprise stability | AlmaLinux 9 / RHEL 9 |
| Newest kernel features | Rocky 10.1 / Debian 13 |
Monitor these Prometheus metrics in production:
nftban_set_elements{family,set} - Element count per set
nftban_set_scale_level{family,set} - Scale level (0-5)
nftban_global_scale_level - System-wide scale
nftban_set_last_reconciled_seconds - Time since kernel verify
nftban_opqueue_pending_count - Queue depth
nftban_opqueue_total_applied - Operations applied
The three performance domains are independent:
- Bulk load throughput — batch netlink insertion speed (9-16K elem/sec). Determines feed sync time.
- Interactive ban latency — single-IP add/delete. Depends on set type (hash=O(1), interval=O(n)) and set size.
- Observability overhead — cost of counting/listing sets. v1.32.0 reduces this from O(n) kernel reads to O(1) cache reads.
A system that loads 500K IPs in 50 seconds (bulk) can still have 113-second interactive ban latency on that same set (interactive). These are not contradictory — they measure different operations.
# Requires root for netlink operations
sudo go test -bench=BenchmarkNetlink -benchtime=3s -v ./pkg/opqueue/...
# Quick benchmark (1k/5k only)
sudo go test -bench='BenchmarkNetlinkAddElements_(1|5)k' ./pkg/opqueue/...# Run on lab servers only — tests concurrent operations
DURATION=180 bash /tmp/nftban_stress.sh# Cache read (v1.32.0 path)
time python3 -c "import json; print(json.load(open('/run/nftban/set_counts.json'))['sets']['blacklist_ipv4']['count'])"
# Kernel read (old path)
time nft list set ip nftban blacklist_ipv4 | wc -lBenchmarks were conducted at key architecture milestones:
- v1.32.0 (March 2026): Cache-first counting, IntervalEnd fix, global flock, cross-kernel stress tests
- v1.33.0 (March 2026): Set separation delivered — hash sets for interactive bans
- v1.18.0 (February 2026): Initial IPC transport and netlink batch benchmarks
- Large Set Management — Scale levels, cache architecture, adaptive timers
- Optimization Tools and Tweaks — CIDR aggregation and feed optimization
- Metrics Architecture — Metrics pipeline and exporters
- Systemd Units Overview — Service and timer configuration
v1.32.0 benchmarks conducted 21 March 2026 across 5 lab servers (6 test runs: Debian 12, Debian 13, Rocky 10.1, Ubuntu 24.04, AlmaLinux 9 x2) with real measured data. v1.18.0 transport benchmarks conducted February 2026.
NFTBan Wiki
Getting Started
Architecture
Modules
- BotGuard (HTTP L7)
- DDoS Protection (L3/L4)
- Portscan Detection
- Login Monitoring
- Blacklist & Threat Intelligence
- Suricata IDS Integration
- DNS Tunnel Suspicion
Operator Reference
- CLI Commands Reference
- Configuration Reference
- Systemd Units & Timers
- Optimization & Tuning
- Security Operations Guide
- GeoIP Database Guide
- FHS Compliance
- Troubleshooting: Smoke & Selftest
Verification & Trust
- Glossary & Vocabulary
- Known Limitations
- Metrics & Evidence Model
- Binary Verification (SLSA)
- Security Architecture
Reference
Legal