@@ -37,6 +37,7 @@ Documented command set:
3737- ` view `
3838- ` split `
3939- ` join `
40+ - ` dedup `
4041
4142Current implementation status for release ` 0.1 ` :
4243
@@ -155,6 +156,65 @@ Potential validation:
155156- sequence continuity checks
156157- timestamp monotonicity checks
157158
159+ ## ` ljx dedup `
160+
161+ Deduplicate log records by collapsing identical or structurally similar bodies.
162+
163+ Three modes, each building on the previous:
164+
165+ - ` exact ` -- collapse records with byte-identical bodies within the same bucket.
166+ - ` hash2 ` (default) -- canonicalise bodies (normalise numbers, IDs, paths, timestamps),
167+ then collapse records sharing the same canonical form.
168+ - ` full ` -- after hash2, run Drain3 template mining on remaining singletons to catch
169+ near-duplicates that differ by alphabetic tokens.
170+
171+ Records are partitioned into buckets by ` (service.name, severity_number) ` before any
172+ dedup. No stage ever merges records across buckets.
173+
174+ Intended examples:
175+
176+ ``` text
177+ ljx dedup telemetry.logjet -o deduped.logjet
178+ ljx dedup telemetry.logjet -o deduped.logjet --mode=exact
179+ ljx dedup telemetry.logjet -o deduped.logjet --mode=full
180+ ljx dedup telemetry.logjet -o deduped.logjet --bucket-by=scope
181+ ljx dedup telemetry.logjet -o deduped.logjet --mode=full --sim-th=0.8
182+ ```
183+
184+ Each output record represents a group of collapsed inputs. The original body from the
185+ first-seen record is preserved. Dedup metadata is added as attributes:
186+
187+ - ` dedup.count ` -- number of records collapsed into this group
188+ - ` dedup.mode ` -- which stage produced the group (` exact ` , ` hash2 ` , ` full/canon ` , ` full/drain3 ` )
189+ - ` dedup.signature ` -- hex hash identifying the group
190+ - ` dedup.canonical_body ` -- normalised body form (hash2 and full modes)
191+ - ` dedup.body_shape ` -- detected body type (` json ` , ` kv ` , ` prefixed ` , ` freetext ` )
192+ - ` dedup.first_seen_ns ` , ` dedup.last_seen_ns ` -- timestamp range of the group
193+ - ` dedup.time_span_ms ` -- duration the pattern was active
194+ - ` dedup.exemplar_trace_ids ` , ` dedup.exemplar_span_ids ` -- up to 3 trace/span IDs for RCA
195+ - ` dedup.drain3_template ` -- Drain3 template with ` <*> ` wildcards (full mode only)
196+ - ` dedup.drain3_cluster_id ` -- Drain3 cluster ID (full mode only)
197+
198+ Non-log records (metrics, traces) pass through unchanged.
199+
200+ Bucket extensions via ` --bucket-by ` :
201+
202+ - ` scope ` -- add ` instrumentation_scope.name ` to the bucket key
203+ - ` source_line ` -- add ` code.filepath ` + ` code.lineno ` to the bucket key
204+
205+ Drain3-specific options (full mode only):
206+
207+ - ` --sim-th ` -- similarity threshold, 0.0 to 1.0 (default 0.7)
208+ - ` --drain-depth ` -- prefix tree depth (default 3)
209+ - ` --extra-delimiters ` -- comma-separated extra token delimiters
210+
211+ Expected properties:
212+
213+ - output is valid ` .logjet `
214+ - non-log records preserved in original order
215+ - deterministic for exact and hash2 modes (same input, same output)
216+ - full mode is order-dependent (Drain3 produces different templates for different input orders)
217+
158218## Implementation Notes
159219
160220The simplest useful internal shape for ` ljx ` is:
0 commit comments