Skip to content

Commit d153006

Browse files
update README
1 parent 044e1cb commit d153006

File tree

1 file changed

+211
-0
lines changed

1 file changed

+211
-0
lines changed

README.md

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# ctrlb-decompose
2+
3+
**Compress raw log lines into structural patterns with statistics, anomalies, and correlations.**
4+
5+
Turn millions of noisy log lines into a handful of actionable patterns — with typed variables, quantile stats, anomaly flags, and severity scoring. Runs as a CLI, in the browser via WASM, or as a Rust library.
6+
7+
```
8+
$ cat server.log | ctrlb-decompose
9+
10+
┌────────────────────────────────────────────────────────────────────┐
11+
│ ctrlb-decompose: 1,247,831 lines -> 43 patterns (99.9% reduction) │
12+
└────────────────────────────────────────────────────────────────────┘
13+
14+
#1 [ERROR] ██████████████████████ 18,402 (1.5%)
15+
<TS> ERROR [<*>] Connection to <ip> timed out after <duration>
16+
17+
ip IPv4 unique=12 top: 10.0.1.15 (34%), 10.0.1.22 (28%)
18+
duration Duration p50=120ms p99=4.8s
19+
20+
#2 [INFO] ████████████████████ 904,221 (72.5%)
21+
<TS> INFO [<*>] Request from <ip> completed in <duration> status=<status>
22+
23+
ip IPv4 unique=1,847 top: 10.0.1.15 (12%), 10.0.1.22 (8%)
24+
duration Duration p50=23ms p99=312ms
25+
status Enum unique=3 values: 200 (91%), 404 (6%), 500 (3%)
26+
```
27+
28+
> Website coming soon.
29+
30+
---
31+
32+
## How It Works
33+
34+
ctrlb-decompose uses a **two-stage normalization and clustering pipeline** that processes logs in a single streaming pass with minimal memory footprint.
35+
36+
```
37+
┌──────────────────────────────────────────────┐
38+
│ ctrlb-decompose pipeline │
39+
└──────────────────────────────────────────────┘
40+
41+
Raw Log Lines
42+
43+
44+
┌──────────────┐ Strip & parse timestamps (ISO 8601, Apache,
45+
│ Timestamp │ syslog, Unix epoch, etc.) into normalized
46+
│ Extraction │ <TS> markers with DateTime values.
47+
└──────┬───────┘
48+
49+
50+
┌──────────────┐ Replace integers, floats, IPs, and strings
51+
│ CLP │ with compact placeholder bytes. Structurally
52+
│ Encoding │ identical lines now produce the same "logtype."
53+
└──────┬───────┘
54+
55+
56+
┌──────────────┐ Tree-based similarity clustering (Drain3) groups
57+
│ Drain3 │ logtypes into patterns. Differing tokens become
58+
│ Clustering │ <*> wildcards. Incremental — no second pass needed.
59+
└──────┬───────┘
60+
61+
62+
┌──────────────┐ Merge CLP-decoded values with Drain3 wildcard
63+
│ Variable │ positions. Classify each variable into semantic
64+
│ Extraction │ types: IPv4, UUID, Duration, Enum, Integer, etc.
65+
│ & Typing │
66+
└──────┬───────┘
67+
68+
69+
┌──────────────┐ DDSketch quantiles (p50/p99), HyperLogLog
70+
│ Statistics │ cardinality estimation, top-k values, temporal
71+
│ Accumulation │ bucketing, and reservoir-sampled example lines.
72+
└──────┬───────┘
73+
74+
75+
┌──────────────┐ Frequency spikes, error cascades, low-cardinality
76+
│ Anomaly │ flags, bimodal distributions, and clustered
77+
│ Detection │ numeric detection.
78+
└──────┬───────┘
79+
80+
81+
┌──────────────┐ Keyword-based severity (ERROR > WARN > INFO > DEBUG),
82+
│ Scoring │ temporal co-occurrence, shared variable correlation,
83+
│ & Correlation│ and error cascade detection across patterns.
84+
└──────┬───────┘
85+
86+
87+
┌──────────────┐
88+
│ Output │──── Human (ANSI terminal) / LLM (compact markdown) / JSON
89+
└──────────────┘
90+
```
91+
92+
### Stage 1 — CLP Encoding
93+
94+
[CLP (Compact Log Pattern)](https://www.cs.toronto.edu/~zzhao/clp/) encoding normalizes variable tokens into typed placeholders, so structurally identical lines produce identical logtypes regardless of the actual values:
95+
96+
```
97+
Input: "Request from 10.0.1.15 completed in 45ms status=200"
98+
Logtype: "Request from <dict> completed in <float>ms status=<int>"
99+
```
100+
101+
### Stage 2 — Drain3 Clustering
102+
103+
The Drain algorithm builds a prefix tree over logtypes and groups them by token similarity (configurable threshold, default 0.4). Where tokens diverge, the template gains a `<*>` wildcard. This runs incrementally — each line is processed once with no second pass.
104+
105+
### Variable Classification
106+
107+
Extracted variables are classified into semantic types for richer analysis:
108+
109+
| Type | Example | Detection |
110+
|------|---------|-----------|
111+
| `IPv4` / `IPv6` | `10.0.1.15` | CIDR pattern match |
112+
| `UUID` | `550e8400-e29b-...` | 8-4-4-4-12 hex format |
113+
| `Duration` | `45ms`, `3.2s` | Numeric + time unit suffix |
114+
| `HexID` | `0x1a2b3c` | 4+ hex digits |
115+
| `Integer` | `200` | Parses as i64 |
116+
| `Float` | `3.14` | Contains `.`, parses as f64 |
117+
| `Enum` | `ERROR` | Low cardinality (<=20 unique, top-3 >= 80%) |
118+
| `Timestamp` | `2024-01-15T14:22:01Z` | RFC 3339 pattern |
119+
| `String` | anything else | Fallback |
120+
121+
### Memory Efficiency
122+
123+
- **Drain3 clusters**: O(k) with LRU eviction (default 10k max)
124+
- **Quantiles**: DDSketch — fixed ~200 bytes per numeric slot, no raw value storage
125+
- **Cardinality**: HyperLogLog++ — ~200 bytes per high-cardinality variable
126+
- **Examples**: Reservoir sampling — bounded buffer per pattern
127+
128+
---
129+
130+
## Installation
131+
132+
### macOS (Homebrew)
133+
134+
```bash
135+
brew tap ctrlb-hq/tap
136+
brew install ctrlb-decompose
137+
```
138+
139+
### Debian / Ubuntu
140+
141+
```bash
142+
curl -LO https://github.com/ctrlb-hq/ctrlb-decompose/releases/download/v0.1.0/ctrlb-decompose_0.1.0-1_amd64.deb
143+
sudo dpkg -i ctrlb-decompose_0.1.0-1_amd64.deb
144+
```
145+
146+
### Build from source
147+
148+
```bash
149+
git clone https://github.com/ctrlb-hq/ctrlb-decompose.git
150+
cd ctrlb-decompose
151+
cargo build --release
152+
# Binary at target/release/ctrlb-decompose
153+
```
154+
155+
---
156+
157+
## Usage
158+
159+
```bash
160+
# Pipe from stdin
161+
cat /var/log/syslog | ctrlb-decompose
162+
163+
# Read from file
164+
ctrlb-decompose server.log
165+
166+
# LLM-optimized output (compact, token-efficient)
167+
ctrlb-decompose --llm app.log
168+
169+
# JSON output
170+
ctrlb-decompose --json app.log
171+
172+
# Top 10 patterns with 3 example lines each
173+
ctrlb-decompose --top 10 --context 3 app.log
174+
```
175+
176+
### Options
177+
178+
```
179+
ctrlb-decompose [OPTIONS] [FILE]
180+
181+
Arguments:
182+
[FILE] Log file path (reads stdin if omitted or "-")
183+
184+
Options:
185+
--human Human-readable output with colors (default)
186+
--llm LLM-optimized compact markdown
187+
--json Structured JSON output
188+
--top <N> Show top N patterns (default: 20)
189+
--context <N> Example lines per pattern (default: 0)
190+
--no-color Disable ANSI colors
191+
--no-banner Suppress header/footer
192+
-q, --quiet Suppress progress messages
193+
-h, --help Show help
194+
-V, --version Show version
195+
```
196+
197+
---
198+
199+
## Output Formats
200+
201+
| Format | Flag | Best for |
202+
|--------|------|----------|
203+
| **Human** | `--human` (default) | Terminal investigation — colored, visual bars |
204+
| **LLM** | `--llm` | Feeding into LLMs — compact, token-efficient markdown |
205+
| **JSON** | `--json` | Programmatic consumption — structured, machine-readable |
206+
207+
---
208+
209+
## License
210+
211+
[MIT](LICENSE)

0 commit comments

Comments
 (0)