|
| 1 | +# System Requirements |
| 2 | + |
| 3 | +This document helps IT administrators provision hardware for on-premise StatBus deployments (VMware, Hyper-V, or bare metal). All values are **minimums** — more resources are always beneficial and recommended when budget allows. |
| 4 | + |
| 5 | +StatBus development is funded by NORAD (Norwegian Agency for Development Cooperation). Current and prospective deployments include statistics offices across Africa (Ghana, Ethiopia, Kenya, Uganda, Morocco), Asia (Mongolia, Laos), and European accession countries (Albania, Northern Cyprus, Ukraine). We hope Norway's own statistics office will also adopt StatBus in the future. |
| 6 | + |
| 7 | +## Quick Reference |
| 8 | + |
| 9 | +Size your deployment based on the total number of legal units in your register: |
| 10 | + |
| 11 | +| Size | Legal Units | Example Countries | Disk (min) | RAM (min) | CPU (min) | |
| 12 | +|------|-------------|-------------------|------------|-----------|-----------| |
| 13 | +| S | < 50K | Laos, Mongolia, Northern Cyprus | 20 GB | 4 GB | 2 cores | |
| 14 | +| M | 50K – 500K | Albania, Ghana, Kenya, Uganda, Ethiopia, Morocco | 50 GB | 8 GB | 4 cores | |
| 15 | +| L | 500K – 2M | Nigeria, South Africa, Egypt; Norway (aspirational) | 120 GB | 16 GB | 4 cores | |
| 16 | +| XL | 2M – 5M | Large economies if they adopt StatBus | 250 GB | 32 GB | 4 cores | |
| 17 | + |
| 18 | +**All sizes are floors.** More disk provides room for longer backup retention and import history. More RAM means faster search responses. More CPU cores speed up initial imports (PostgreSQL uses parallel query). There is no upper limit where more resources become wasteful. |
| 19 | + |
| 20 | +**CPU note:** 4 cores is recommended for all sizes. PostgreSQL's parallel query and the 4 concurrent analytics backends benefit from 4 cores. More cores (8+) help during initial imports of very large registries but are not required — the system remains fully functional on 4 cores at any scale, just with longer import times. VMs are typically provisioned in powers of 2 (2, 4, 8 cores), so 4 is the practical sweet spot. |
| 21 | + |
| 22 | +**Disk includes:** OS (~5 GB), Docker images (~3 GB), PostgreSQL WAL and temp space (~10–20%), import job retention (upload and data tables persist up to 18 months), local database backups, and 40% free headroom for VACUUM, reindex, and growth. |
| 23 | + |
| 24 | +## Sizing Formula |
| 25 | + |
| 26 | +Derived from the full Norwegian Business Register (BRREG) import: 1.96M legal units producing 18 GB of data. |
| 27 | + |
| 28 | +### Per Legal Unit Storage |
| 29 | + |
| 30 | +Each legal unit — including all related entities, derived tables, and indexes — requires: |
| 31 | + |
| 32 | +| Component | Per LU | Notes | |
| 33 | +|-----------|--------|-------| |
| 34 | +| Base tables (LU + ES + enterprise + activity + location + ext_ident + contact + stats) | ~3.1 KB | Includes indexes | |
| 35 | +| Derived tables (statistical_unit, timelines, timepoints, timesegments, facets, history) | ~8.5 KB | Search page indexes account for most of this | |
| 36 | +| Import overhead (temporary, during initial load) | ~2.0 KB | Upload + data tables; reclaimable after import | |
| 37 | +| **Total per LU** | **~13.6 KB** | **~13 GB per million legal units** | |
| 38 | + |
| 39 | +Raw data formula: |
| 40 | + |
| 41 | + Data (GB) = (legal_units / 1,000,000) x 13 |
| 42 | + |
| 43 | +### Recommended Disk Provisioning |
| 44 | + |
| 45 | +The recommended disk size accounts for: |
| 46 | + |
| 47 | +- **OS and system packages:** ~5 GB |
| 48 | +- **PostgreSQL WAL and temp space:** ~15% of data size |
| 49 | +- **Import job retention:** upload and data tables persist for 18 months (~2 KB/LU per active import) |
| 50 | +- **Local backup snapshots:** at least 1x data size (recommend `pg_dump` on schedule) |
| 51 | +- **VACUUM overhead:** needs ~20% free for table maintenance |
| 52 | +- **Growth headroom:** ~20% for new imports and temporal history accumulation |
| 53 | + |
| 54 | +Formula: |
| 55 | + |
| 56 | + Recommended Disk (GB) = Data x 3 + 10 |
| 57 | + |
| 58 | +This means roughly **40 GB per million legal units** as a minimum, including all overheads: |
| 59 | + |
| 60 | +- 1M LU -> 50 GB minimum (more is better for backup retention) |
| 61 | +- 500K LU -> 30 GB minimum |
| 62 | +- 50K LU -> 20 GB minimum (OS + Docker images set this floor regardless of data size) |
| 63 | + |
| 64 | +### Establishment Multiplier |
| 65 | + |
| 66 | +Norway has 0.73 establishments per legal unit. Countries with more multi-establishment enterprises will need proportionally more storage. The formulas above already include establishments at the Norwegian ratio. |
| 67 | + |
| 68 | +### Informal Sector and Census Data |
| 69 | + |
| 70 | +StatBus's temporal model supports loading entire census datasets for fixed time periods (e.g., informal sector census 2015–2020). Establishments from these censuses exist only within their census period and don't require change tracking across censuses — but they still contribute to graphs, aggregations, and the search index. |
| 71 | + |
| 72 | +In developing economies with large informal sectors, the number of establishments from censuses can far exceed formal legal units. For example, a country with 200K formal legal units might load 2M informal establishments from a census. |
| 73 | + |
| 74 | +**Size the system based on total units across all time periods**, not just current formal registrations. |
| 75 | + |
| 76 | +## Estimating Your Country's Unit Count |
| 77 | + |
| 78 | +If your country doesn't yet have a register, estimate the number of legal units from population: |
| 79 | + |
| 80 | +| Economy type | LU per 1,000 population | Examples | |
| 81 | +|--------------|------------------------|----------| |
| 82 | +| Developed (Nordic, EU) | 70–200 | Norway, EU member states | |
| 83 | +| Middle-income | 30–80 | Morocco, Albania, Ukraine | |
| 84 | +| Developing (formal sector only) | 10–40 | Ghana, Ethiopia, Uganda, Mongolia, Laos | |
| 85 | + |
| 86 | +**Important:** These ratios cover formal legal units only. Countries with large informal sectors may also load census data covering informal establishments — potentially 5–10x the formal count. When sizing, estimate the **total units across all sources and time periods**. |
| 87 | + |
| 88 | +### Reference Points from Public Data |
| 89 | + |
| 90 | +| Country | Legal Units | Population | LU per 1,000 | Notes | |
| 91 | +|---------|-------------|------------|---------------|-------| |
| 92 | +| Norway | 1.13M | 5.5M | 206 | Provides our empirical sizing baseline | |
| 93 | +| EU average | ~33M | 450M | ~73 | Across all member states | |
| 94 | +| South Africa | ~2M | 60M | ~33 | Estimated formal businesses | |
| 95 | +| Kenya | ~1.5M | 55M | ~27 | Estimated registered businesses | |
| 96 | +| Ghana | ~500K | 34M | ~15 | Estimated registered businesses | |
| 97 | +| Ethiopia | ~200K | 126M | ~2 | Formal only; informal census data could add millions | |
| 98 | + |
| 99 | +## Memory Requirements |
| 100 | + |
| 101 | +RAM is consumed by: |
| 102 | + |
| 103 | +- **OS + Docker overhead:** ~1 GB |
| 104 | +- **PostgreSQL shared_buffers:** 25% of total RAM (main query cache) |
| 105 | +- **PostgreSQL work_mem per backend:** default 4 MB x concurrent backends (up to ~5 during analytics) |
| 106 | +- **OS page cache:** remaining RAM serves as secondary disk cache — this is where "warm" indexes live |
| 107 | +- **Other containers:** worker (~50 MB), PostgREST (~100 MB), Next.js (~200 MB), Caddy (~30 MB) = ~400 MB total |
| 108 | + |
| 109 | +### Why RAM Matters for Search |
| 110 | + |
| 111 | +The search page's responsiveness depends on index caching. At Norway scale, `statistical_unit` has 3.9 GB of indexes. If these fit in shared_buffers + OS page cache, search is fast (~50 ms). If they spill to disk, search degrades to ~500 ms+ (still functional, but noticeably slower). |
| 112 | + |
| 113 | +### RAM Sizing Formula |
| 114 | + |
| 115 | + RAM (GB) = max(4, data_size_GB x 0.5) |
| 116 | + |
| 117 | +- **4 GB minimum:** enough for small registries where all indexes fit in RAM |
| 118 | +- **8 GB:** covers medium registries; most hot indexes cached |
| 119 | +- **16 GB:** covers Norway-scale; all search indexes fit in memory |
| 120 | +- **32 GB:** large registries with headroom for concurrent analytics + queries |
| 121 | + |
| 122 | +## CPU and Container Architecture |
| 123 | + |
| 124 | +StatBus runs 5 Docker containers, but nearly all computation happens inside PostgreSQL: |
| 125 | + |
| 126 | +| Container | Image Size | CPU Profile | RAM Profile | |
| 127 | +|-----------|-----------|-------------|-------------| |
| 128 | +| **db** (PostgreSQL 18) | 1.65 GB | Heavy — all query processing, analytics, imports | Dominant — shared_buffers + OS cache | |
| 129 | +| **worker** (Crystal CLI) | 28 MB | Minimal — orchestrator only, dispatches SQL calls | ~50 MB | |
| 130 | +| **rest** (PostgREST) | 613 MB | Light — HTTP-to-SQL translation | ~100 MB | |
| 131 | +| **app** (Next.js) | 370 MB | Light — SSR minimal, mostly client-side rendering | ~200 MB | |
| 132 | +| **proxy** (Caddy) | 127 MB | Negligible — reverse proxy + TLS | ~30 MB | |
| 133 | + |
| 134 | +Docker images total: ~2.8 GB on disk. |
| 135 | + |
| 136 | +The worker doesn't use separate threads — it dispatches tasks to PostgreSQL, which processes them as database backends. The analytics queue runs up to 4 concurrent PostgreSQL backends. During import, PostgreSQL is the bottleneck (I/O-bound for bulk inserts, CPU-bound for index maintenance). |
| 137 | + |
| 138 | +### CPU Recommendation |
| 139 | + |
| 140 | +- **2 cores:** functional minimum for small registries; import will be slow but the system works |
| 141 | +- **4 cores:** recommended for all sizes — PostgreSQL uses parallel query and the 4 concurrent analytics backends benefit from it. This is the practical sweet spot for VM provisioning (VMs are typically allocated in powers of 2) |
| 142 | +- **8+ cores:** beneficial for large registries (2M+ LU) where import speed matters, or when concurrent imports and user queries should not compete. Not required — 4 cores handles any registry size, just with longer imports |
| 143 | + |
| 144 | +The jump from 4 to 8 is where diminishing returns set in. 4 cores is the right floor for any production deployment. |
| 145 | + |
| 146 | +## Import Duration Estimates |
| 147 | + |
| 148 | +Based on Norway (1.96M units processed in 5h 23min wall clock on 4 cores): |
| 149 | + |
| 150 | +- Import processing: ~6.8 ms per row (includes parsing, validation, upsert) |
| 151 | +- Derivation pipeline: ~14.3 ms per statistical_unit row |
| 152 | + |
| 153 | +Formula: |
| 154 | + |
| 155 | + Hours = legal_units x 0.01 / 3600 |
| 156 | + |
| 157 | +Roughly **3 hours per million legal units** for a full initial import. |
| 158 | + |
| 159 | +| Legal Units | Estimated Import Time | |
| 160 | +|-------------|----------------------| |
| 161 | +| 50K | ~10 minutes | |
| 162 | +| 200K | ~35 minutes | |
| 163 | +| 500K | ~1.5 hours | |
| 164 | +| 1M | ~3 hours | |
| 165 | +| 2M | ~6 hours | |
| 166 | + |
| 167 | +Update imports (monthly/yearly) are much faster — only changed rows are processed, and the derived pipeline is incremental. |
| 168 | + |
| 169 | +## Disk I/O Considerations |
| 170 | + |
| 171 | +- **SSD:** strongly recommended (indexes rely on random reads) |
| 172 | +- **HDD:** functional but search response times will be 5–10x slower |
| 173 | +- **NVMe:** diminishing returns vs SATA SSD for this workload |
| 174 | + |
| 175 | +SSD is the single most impactful hardware choice after having sufficient RAM. If budget is constrained, prioritize SSD over extra RAM or CPU. |
| 176 | + |
| 177 | +## Network Requirements |
| 178 | + |
| 179 | +- **Outbound HTTPS (port 443):** required during setup for Docker image pulls and package updates |
| 180 | +- **Inbound HTTPS (port 443):** required for user access (Caddy handles TLS termination) |
| 181 | +- **Inbound PostgreSQL (port 5432):** optional, only if external tools need direct database access |
| 182 | +- **Bandwidth:** minimal for normal operation; the web interface transfers small JSON payloads. Initial Docker image pull requires ~3 GB download. |
| 183 | + |
| 184 | +## Operating System |
| 185 | + |
| 186 | +StatBus is tested on Ubuntu LTS 24.04. Any Linux distribution with Docker Engine 24+ and Docker Compose v2 will work. See `doc/harden-ubuntu-lts-24.md` for security hardening guidance. |
| 187 | + |
| 188 | +Windows Server with Docker (WSL2 or Hyper-V backend) is not tested but should work. Linux is recommended for production. |
0 commit comments