Skip to content

Commit 8fc77b1

Browse files
jhfclaude
andcommitted
doc: Add system requirements guide for on-premise deployments
Hardware sizing guidance for statistics offices provisioning VMware/Hyper-V servers, with T-shirt sizes (S/M/L/XL) based on legal unit count. Includes per-unit storage formulas derived from the Norwegian BRREG import (1.96M units, 18 GB), population-based estimation for countries without registers, and import duration benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent a3bca5b commit 8fc77b1

File tree

1 file changed

+188
-0
lines changed

1 file changed

+188
-0
lines changed

doc/system-requirements.md

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
# System Requirements
2+
3+
This document helps IT administrators provision hardware for on-premise StatBus deployments (VMware, Hyper-V, or bare metal). All values are **minimums** — more resources are always beneficial and recommended when budget allows.
4+
5+
StatBus development is funded by NORAD (Norwegian Agency for Development Cooperation). Current and prospective deployments include statistics offices across Africa (Ghana, Ethiopia, Kenya, Uganda, Morocco), Asia (Mongolia, Laos), and European accession countries (Albania, Northern Cyprus, Ukraine). We hope Norway's own statistics office will also adopt StatBus in the future.
6+
7+
## Quick Reference
8+
9+
Size your deployment based on the total number of legal units in your register:
10+
11+
| Size | Legal Units | Example Countries | Disk (min) | RAM (min) | CPU (min) |
12+
|------|-------------|-------------------|------------|-----------|-----------|
13+
| S | < 50K | Laos, Mongolia, Northern Cyprus | 20 GB | 4 GB | 2 cores |
14+
| M | 50K – 500K | Albania, Ghana, Kenya, Uganda, Ethiopia, Morocco | 50 GB | 8 GB | 4 cores |
15+
| L | 500K – 2M | Nigeria, South Africa, Egypt; Norway (aspirational) | 120 GB | 16 GB | 4 cores |
16+
| XL | 2M – 5M | Large economies if they adopt StatBus | 250 GB | 32 GB | 4 cores |
17+
18+
**All sizes are floors.** More disk provides room for longer backup retention and import history. More RAM means faster search responses. More CPU cores speed up initial imports (PostgreSQL uses parallel query). There is no upper limit where more resources become wasteful.
19+
20+
**CPU note:** 4 cores is recommended for all sizes. PostgreSQL's parallel query and the 4 concurrent analytics backends benefit from 4 cores. More cores (8+) help during initial imports of very large registries but are not required — the system remains fully functional on 4 cores at any scale, just with longer import times. VMs are typically provisioned in powers of 2 (2, 4, 8 cores), so 4 is the practical sweet spot.
21+
22+
**Disk includes:** OS (~5 GB), Docker images (~3 GB), PostgreSQL WAL and temp space (~10–20%), import job retention (upload and data tables persist up to 18 months), local database backups, and 40% free headroom for VACUUM, reindex, and growth.
23+
24+
## Sizing Formula
25+
26+
Derived from the full Norwegian Business Register (BRREG) import: 1.96M legal units producing 18 GB of data.
27+
28+
### Per Legal Unit Storage
29+
30+
Each legal unit — including all related entities, derived tables, and indexes — requires:
31+
32+
| Component | Per LU | Notes |
33+
|-----------|--------|-------|
34+
| Base tables (LU + ES + enterprise + activity + location + ext_ident + contact + stats) | ~3.1 KB | Includes indexes |
35+
| Derived tables (statistical_unit, timelines, timepoints, timesegments, facets, history) | ~8.5 KB | Search page indexes account for most of this |
36+
| Import overhead (temporary, during initial load) | ~2.0 KB | Upload + data tables; reclaimable after import |
37+
| **Total per LU** | **~13.6 KB** | **~13 GB per million legal units** |
38+
39+
Raw data formula:
40+
41+
Data (GB) = (legal_units / 1,000,000) x 13
42+
43+
### Recommended Disk Provisioning
44+
45+
The recommended disk size accounts for:
46+
47+
- **OS and system packages:** ~5 GB
48+
- **PostgreSQL WAL and temp space:** ~15% of data size
49+
- **Import job retention:** upload and data tables persist for 18 months (~2 KB/LU per active import)
50+
- **Local backup snapshots:** at least 1x data size (recommend `pg_dump` on schedule)
51+
- **VACUUM overhead:** needs ~20% free for table maintenance
52+
- **Growth headroom:** ~20% for new imports and temporal history accumulation
53+
54+
Formula:
55+
56+
Recommended Disk (GB) = Data x 3 + 10
57+
58+
This means roughly **40 GB per million legal units** as a minimum, including all overheads:
59+
60+
- 1M LU -> 50 GB minimum (more is better for backup retention)
61+
- 500K LU -> 30 GB minimum
62+
- 50K LU -> 20 GB minimum (OS + Docker images set this floor regardless of data size)
63+
64+
### Establishment Multiplier
65+
66+
Norway has 0.73 establishments per legal unit. Countries with more multi-establishment enterprises will need proportionally more storage. The formulas above already include establishments at the Norwegian ratio.
67+
68+
### Informal Sector and Census Data
69+
70+
StatBus's temporal model supports loading entire census datasets for fixed time periods (e.g., informal sector census 2015–2020). Establishments from these censuses exist only within their census period and don't require change tracking across censuses — but they still contribute to graphs, aggregations, and the search index.
71+
72+
In developing economies with large informal sectors, the number of establishments from censuses can far exceed formal legal units. For example, a country with 200K formal legal units might load 2M informal establishments from a census.
73+
74+
**Size the system based on total units across all time periods**, not just current formal registrations.
75+
76+
## Estimating Your Country's Unit Count
77+
78+
If your country doesn't yet have a register, estimate the number of legal units from population:
79+
80+
| Economy type | LU per 1,000 population | Examples |
81+
|--------------|------------------------|----------|
82+
| Developed (Nordic, EU) | 70–200 | Norway, EU member states |
83+
| Middle-income | 30–80 | Morocco, Albania, Ukraine |
84+
| Developing (formal sector only) | 10–40 | Ghana, Ethiopia, Uganda, Mongolia, Laos |
85+
86+
**Important:** These ratios cover formal legal units only. Countries with large informal sectors may also load census data covering informal establishments — potentially 5–10x the formal count. When sizing, estimate the **total units across all sources and time periods**.
87+
88+
### Reference Points from Public Data
89+
90+
| Country | Legal Units | Population | LU per 1,000 | Notes |
91+
|---------|-------------|------------|---------------|-------|
92+
| Norway | 1.13M | 5.5M | 206 | Provides our empirical sizing baseline |
93+
| EU average | ~33M | 450M | ~73 | Across all member states |
94+
| South Africa | ~2M | 60M | ~33 | Estimated formal businesses |
95+
| Kenya | ~1.5M | 55M | ~27 | Estimated registered businesses |
96+
| Ghana | ~500K | 34M | ~15 | Estimated registered businesses |
97+
| Ethiopia | ~200K | 126M | ~2 | Formal only; informal census data could add millions |
98+
99+
## Memory Requirements
100+
101+
RAM is consumed by:
102+
103+
- **OS + Docker overhead:** ~1 GB
104+
- **PostgreSQL shared_buffers:** 25% of total RAM (main query cache)
105+
- **PostgreSQL work_mem per backend:** default 4 MB x concurrent backends (up to ~5 during analytics)
106+
- **OS page cache:** remaining RAM serves as secondary disk cache — this is where "warm" indexes live
107+
- **Other containers:** worker (~50 MB), PostgREST (~100 MB), Next.js (~200 MB), Caddy (~30 MB) = ~400 MB total
108+
109+
### Why RAM Matters for Search
110+
111+
The search page's responsiveness depends on index caching. At Norway scale, `statistical_unit` has 3.9 GB of indexes. If these fit in shared_buffers + OS page cache, search is fast (~50 ms). If they spill to disk, search degrades to ~500 ms+ (still functional, but noticeably slower).
112+
113+
### RAM Sizing Formula
114+
115+
RAM (GB) = max(4, data_size_GB x 0.5)
116+
117+
- **4 GB minimum:** enough for small registries where all indexes fit in RAM
118+
- **8 GB:** covers medium registries; most hot indexes cached
119+
- **16 GB:** covers Norway-scale; all search indexes fit in memory
120+
- **32 GB:** large registries with headroom for concurrent analytics + queries
121+
122+
## CPU and Container Architecture
123+
124+
StatBus runs 5 Docker containers, but nearly all computation happens inside PostgreSQL:
125+
126+
| Container | Image Size | CPU Profile | RAM Profile |
127+
|-----------|-----------|-------------|-------------|
128+
| **db** (PostgreSQL 18) | 1.65 GB | Heavy — all query processing, analytics, imports | Dominant — shared_buffers + OS cache |
129+
| **worker** (Crystal CLI) | 28 MB | Minimal — orchestrator only, dispatches SQL calls | ~50 MB |
130+
| **rest** (PostgREST) | 613 MB | Light — HTTP-to-SQL translation | ~100 MB |
131+
| **app** (Next.js) | 370 MB | Light — SSR minimal, mostly client-side rendering | ~200 MB |
132+
| **proxy** (Caddy) | 127 MB | Negligible — reverse proxy + TLS | ~30 MB |
133+
134+
Docker images total: ~2.8 GB on disk.
135+
136+
The worker doesn't use separate threads — it dispatches tasks to PostgreSQL, which processes them as database backends. The analytics queue runs up to 4 concurrent PostgreSQL backends. During import, PostgreSQL is the bottleneck (I/O-bound for bulk inserts, CPU-bound for index maintenance).
137+
138+
### CPU Recommendation
139+
140+
- **2 cores:** functional minimum for small registries; import will be slow but the system works
141+
- **4 cores:** recommended for all sizes — PostgreSQL uses parallel query and the 4 concurrent analytics backends benefit from it. This is the practical sweet spot for VM provisioning (VMs are typically allocated in powers of 2)
142+
- **8+ cores:** beneficial for large registries (2M+ LU) where import speed matters, or when concurrent imports and user queries should not compete. Not required — 4 cores handles any registry size, just with longer imports
143+
144+
The jump from 4 to 8 is where diminishing returns set in. 4 cores is the right floor for any production deployment.
145+
146+
## Import Duration Estimates
147+
148+
Based on Norway (1.96M units processed in 5h 23min wall clock on 4 cores):
149+
150+
- Import processing: ~6.8 ms per row (includes parsing, validation, upsert)
151+
- Derivation pipeline: ~14.3 ms per statistical_unit row
152+
153+
Formula:
154+
155+
Hours = legal_units x 0.01 / 3600
156+
157+
Roughly **3 hours per million legal units** for a full initial import.
158+
159+
| Legal Units | Estimated Import Time |
160+
|-------------|----------------------|
161+
| 50K | ~10 minutes |
162+
| 200K | ~35 minutes |
163+
| 500K | ~1.5 hours |
164+
| 1M | ~3 hours |
165+
| 2M | ~6 hours |
166+
167+
Update imports (monthly/yearly) are much faster — only changed rows are processed, and the derived pipeline is incremental.
168+
169+
## Disk I/O Considerations
170+
171+
- **SSD:** strongly recommended (indexes rely on random reads)
172+
- **HDD:** functional but search response times will be 5–10x slower
173+
- **NVMe:** diminishing returns vs SATA SSD for this workload
174+
175+
SSD is the single most impactful hardware choice after having sufficient RAM. If budget is constrained, prioritize SSD over extra RAM or CPU.
176+
177+
## Network Requirements
178+
179+
- **Outbound HTTPS (port 443):** required during setup for Docker image pulls and package updates
180+
- **Inbound HTTPS (port 443):** required for user access (Caddy handles TLS termination)
181+
- **Inbound PostgreSQL (port 5432):** optional, only if external tools need direct database access
182+
- **Bandwidth:** minimal for normal operation; the web interface transfers small JSON payloads. Initial Docker image pull requires ~3 GB download.
183+
184+
## Operating System
185+
186+
StatBus is tested on Ubuntu LTS 24.04. Any Linux distribution with Docker Engine 24+ and Docker Compose v2 will work. See `doc/harden-ubuntu-lts-24.md` for security hardening guidance.
187+
188+
Windows Server with Docker (WSL2 or Hyper-V backend) is not tested but should work. Linux is recommended for production.

0 commit comments

Comments
 (0)