|
| 1 | +# ctrlb-decompose |
| 2 | + |
| 3 | +**Compress raw log lines into structural patterns with statistics, anomalies, and correlations.** |
| 4 | + |
| 5 | +Turn millions of noisy log lines into a handful of actionable patterns — with typed variables, quantile stats, anomaly flags, and severity scoring. Runs as a CLI, in the browser via WASM, or as a Rust library. |
| 6 | + |
| 7 | +``` |
| 8 | +$ cat server.log | ctrlb-decompose |
| 9 | +
|
| 10 | +┌────────────────────────────────────────────────────────────────────┐ |
| 11 | +│ ctrlb-decompose: 1,247,831 lines -> 43 patterns (99.9% reduction) │ |
| 12 | +└────────────────────────────────────────────────────────────────────┘ |
| 13 | +
|
| 14 | +#1 [ERROR] ██████████████████████ 18,402 (1.5%) |
| 15 | + <TS> ERROR [<*>] Connection to <ip> timed out after <duration> |
| 16 | +
|
| 17 | + ip IPv4 unique=12 top: 10.0.1.15 (34%), 10.0.1.22 (28%) |
| 18 | + duration Duration p50=120ms p99=4.8s |
| 19 | +
|
| 20 | +#2 [INFO] ████████████████████ 904,221 (72.5%) |
| 21 | + <TS> INFO [<*>] Request from <ip> completed in <duration> status=<status> |
| 22 | +
|
| 23 | + ip IPv4 unique=1,847 top: 10.0.1.15 (12%), 10.0.1.22 (8%) |
| 24 | + duration Duration p50=23ms p99=312ms |
| 25 | + status Enum unique=3 values: 200 (91%), 404 (6%), 500 (3%) |
| 26 | +``` |
| 27 | + |
| 28 | +> Website coming soon. |
| 29 | +
|
| 30 | +--- |
| 31 | + |
| 32 | +## How It Works |
| 33 | + |
| 34 | +ctrlb-decompose uses a **two-stage normalization and clustering pipeline** that processes logs in a single streaming pass with minimal memory footprint. |
| 35 | + |
| 36 | +``` |
| 37 | + ┌──────────────────────────────────────────────┐ |
| 38 | + │ ctrlb-decompose pipeline │ |
| 39 | + └──────────────────────────────────────────────┘ |
| 40 | +
|
| 41 | + Raw Log Lines |
| 42 | + │ |
| 43 | + ▼ |
| 44 | +┌──────────────┐ Strip & parse timestamps (ISO 8601, Apache, |
| 45 | +│ Timestamp │ syslog, Unix epoch, etc.) into normalized |
| 46 | +│ Extraction │ <TS> markers with DateTime values. |
| 47 | +└──────┬───────┘ |
| 48 | + │ |
| 49 | + ▼ |
| 50 | +┌──────────────┐ Replace integers, floats, IPs, and strings |
| 51 | +│ CLP │ with compact placeholder bytes. Structurally |
| 52 | +│ Encoding │ identical lines now produce the same "logtype." |
| 53 | +└──────┬───────┘ |
| 54 | + │ |
| 55 | + ▼ |
| 56 | +┌──────────────┐ Tree-based similarity clustering (Drain3) groups |
| 57 | +│ Drain3 │ logtypes into patterns. Differing tokens become |
| 58 | +│ Clustering │ <*> wildcards. Incremental — no second pass needed. |
| 59 | +└──────┬───────┘ |
| 60 | + │ |
| 61 | + ▼ |
| 62 | +┌──────────────┐ Merge CLP-decoded values with Drain3 wildcard |
| 63 | +│ Variable │ positions. Classify each variable into semantic |
| 64 | +│ Extraction │ types: IPv4, UUID, Duration, Enum, Integer, etc. |
| 65 | +│ & Typing │ |
| 66 | +└──────┬───────┘ |
| 67 | + │ |
| 68 | + ▼ |
| 69 | +┌──────────────┐ DDSketch quantiles (p50/p99), HyperLogLog |
| 70 | +│ Statistics │ cardinality estimation, top-k values, temporal |
| 71 | +│ Accumulation │ bucketing, and reservoir-sampled example lines. |
| 72 | +└──────┬───────┘ |
| 73 | + │ |
| 74 | + ▼ |
| 75 | +┌──────────────┐ Frequency spikes, error cascades, low-cardinality |
| 76 | +│ Anomaly │ flags, bimodal distributions, and clustered |
| 77 | +│ Detection │ numeric detection. |
| 78 | +└──────┬───────┘ |
| 79 | + │ |
| 80 | + ▼ |
| 81 | +┌──────────────┐ Keyword-based severity (ERROR > WARN > INFO > DEBUG), |
| 82 | +│ Scoring │ temporal co-occurrence, shared variable correlation, |
| 83 | +│ & Correlation│ and error cascade detection across patterns. |
| 84 | +└──────┬───────┘ |
| 85 | + │ |
| 86 | + ▼ |
| 87 | +┌──────────────┐ |
| 88 | +│ Output │──── Human (ANSI terminal) / LLM (compact markdown) / JSON |
| 89 | +└──────────────┘ |
| 90 | +``` |
| 91 | + |
| 92 | +### Stage 1 — CLP Encoding |
| 93 | + |
| 94 | +[CLP (Compact Log Pattern)](https://www.cs.toronto.edu/~zzhao/clp/) encoding normalizes variable tokens into typed placeholders, so structurally identical lines produce identical logtypes regardless of the actual values: |
| 95 | + |
| 96 | +``` |
| 97 | +Input: "Request from 10.0.1.15 completed in 45ms status=200" |
| 98 | +Logtype: "Request from <dict> completed in <float>ms status=<int>" |
| 99 | +``` |
| 100 | + |
| 101 | +### Stage 2 — Drain3 Clustering |
| 102 | + |
| 103 | +The Drain algorithm builds a prefix tree over logtypes and groups them by token similarity (configurable threshold, default 0.4). Where tokens diverge, the template gains a `<*>` wildcard. This runs incrementally — each line is processed once with no second pass. |
| 104 | + |
| 105 | +### Variable Classification |
| 106 | + |
| 107 | +Extracted variables are classified into semantic types for richer analysis: |
| 108 | + |
| 109 | +| Type | Example | Detection | |
| 110 | +|------|---------|-----------| |
| 111 | +| `IPv4` / `IPv6` | `10.0.1.15` | CIDR pattern match | |
| 112 | +| `UUID` | `550e8400-e29b-...` | 8-4-4-4-12 hex format | |
| 113 | +| `Duration` | `45ms`, `3.2s` | Numeric + time unit suffix | |
| 114 | +| `HexID` | `0x1a2b3c` | 4+ hex digits | |
| 115 | +| `Integer` | `200` | Parses as i64 | |
| 116 | +| `Float` | `3.14` | Contains `.`, parses as f64 | |
| 117 | +| `Enum` | `ERROR` | Low cardinality (<=20 unique, top-3 >= 80%) | |
| 118 | +| `Timestamp` | `2024-01-15T14:22:01Z` | RFC 3339 pattern | |
| 119 | +| `String` | anything else | Fallback | |
| 120 | + |
| 121 | +### Memory Efficiency |
| 122 | + |
| 123 | +- **Drain3 clusters**: O(k) with LRU eviction (default 10k max) |
| 124 | +- **Quantiles**: DDSketch — fixed ~200 bytes per numeric slot, no raw value storage |
| 125 | +- **Cardinality**: HyperLogLog++ — ~200 bytes per high-cardinality variable |
| 126 | +- **Examples**: Reservoir sampling — bounded buffer per pattern |
| 127 | + |
| 128 | +--- |
| 129 | + |
| 130 | +## Installation |
| 131 | + |
| 132 | +### macOS (Homebrew) |
| 133 | + |
| 134 | +```bash |
| 135 | +brew tap ctrlb-hq/tap |
| 136 | +brew install ctrlb-decompose |
| 137 | +``` |
| 138 | + |
| 139 | +### Debian / Ubuntu |
| 140 | + |
| 141 | +```bash |
| 142 | +curl -LO https://github.com/ctrlb-hq/ctrlb-decompose/releases/download/v0.1.0/ctrlb-decompose_0.1.0-1_amd64.deb |
| 143 | +sudo dpkg -i ctrlb-decompose_0.1.0-1_amd64.deb |
| 144 | +``` |
| 145 | + |
| 146 | +### Build from source |
| 147 | + |
| 148 | +```bash |
| 149 | +git clone https://github.com/ctrlb-hq/ctrlb-decompose.git |
| 150 | +cd ctrlb-decompose |
| 151 | +cargo build --release |
| 152 | +# Binary at target/release/ctrlb-decompose |
| 153 | +``` |
| 154 | + |
| 155 | +--- |
| 156 | + |
| 157 | +## Usage |
| 158 | + |
| 159 | +```bash |
| 160 | +# Pipe from stdin |
| 161 | +cat /var/log/syslog | ctrlb-decompose |
| 162 | + |
| 163 | +# Read from file |
| 164 | +ctrlb-decompose server.log |
| 165 | + |
| 166 | +# LLM-optimized output (compact, token-efficient) |
| 167 | +ctrlb-decompose --llm app.log |
| 168 | + |
| 169 | +# JSON output |
| 170 | +ctrlb-decompose --json app.log |
| 171 | + |
| 172 | +# Top 10 patterns with 3 example lines each |
| 173 | +ctrlb-decompose --top 10 --context 3 app.log |
| 174 | +``` |
| 175 | + |
| 176 | +### Options |
| 177 | + |
| 178 | +``` |
| 179 | +ctrlb-decompose [OPTIONS] [FILE] |
| 180 | +
|
| 181 | +Arguments: |
| 182 | + [FILE] Log file path (reads stdin if omitted or "-") |
| 183 | +
|
| 184 | +Options: |
| 185 | + --human Human-readable output with colors (default) |
| 186 | + --llm LLM-optimized compact markdown |
| 187 | + --json Structured JSON output |
| 188 | + --top <N> Show top N patterns (default: 20) |
| 189 | + --context <N> Example lines per pattern (default: 0) |
| 190 | + --no-color Disable ANSI colors |
| 191 | + --no-banner Suppress header/footer |
| 192 | + -q, --quiet Suppress progress messages |
| 193 | + -h, --help Show help |
| 194 | + -V, --version Show version |
| 195 | +``` |
| 196 | + |
| 197 | +--- |
| 198 | + |
| 199 | +## Output Formats |
| 200 | + |
| 201 | +| Format | Flag | Best for | |
| 202 | +|--------|------|----------| |
| 203 | +| **Human** | `--human` (default) | Terminal investigation — colored, visual bars | |
| 204 | +| **LLM** | `--llm` | Feeding into LLMs — compact, token-efficient markdown | |
| 205 | +| **JSON** | `--json` | Programmatic consumption — structured, machine-readable | |
| 206 | + |
| 207 | +--- |
| 208 | + |
| 209 | +## License |
| 210 | + |
| 211 | +[MIT](LICENSE) |
0 commit comments