Commit 306aa40
authored
Phase 08: Adaptive High-Cardinality Indexing (#19)
* test(08-01): add failing test for adaptive hot-term promotion
- Locks adaptive config defaults for high-cardinality paths
- Expects adaptive path metadata and promoted exact terms
- Verifies hot terms outrank long-tail values by RG coverage
* feat(08-01): add adaptive finalize-time string promotion
- Add adaptive string path mode, config knobs, and summary metadata
- Build promoted exact terms and deterministic bucket fallbacks at finalize time
- Cover hot-term promotion with a focused regression test
* test(08-01): add failing tests for adaptive query fallback
- Lock bucket-backed positive lookups to no-false-negative supersets
- Require non-promoted adaptive negatives to stay conservative
- Preserve exact negative inversion for promoted hot terms
* feat(08-01): route adaptive string queries through exact and buckets
- Add shared adaptive lookup with explicit exact-vs-lossy results
- Use bucket-backed supersets for adaptive EQ and IN tail terms
- Keep adaptive NE and NIN conservative unless the match was exact
* test(08-01): update threshold property for three-mode strings
- Exercise exact, adaptive-hybrid, and bloom-only high-cardinality outcomes
- Assert threshold breaches may resolve to adaptive or bloom-only modes
- Keep positive EQ lookups false-negative-free across all three cases
* docs(08-01): complete adaptive-high-cardinality-indexing plan
Tasks completed: 3/3
- Add adaptive-hybrid path structures and finalize-time promotion
- Route string membership queries through exact promotion and bucket fallback safely
- Update property coverage for the three-mode threshold contract
SUMMARY: .planning/phases/08-adaptive-high-cardinality-indexing/08-01-SUMMARY.md
* test(08-02): add failing tests for adaptive serialization
- Covers adaptive config round-trip expectations
- Covers adaptive path metadata and bitmap round-trip
- Locks oversized adaptive bucket section rejection
* feat(08-02): persist adaptive serialization metadata
- Bump the wire format to version 5 for adaptive sections
- Serialize adaptive config knobs and per-path adaptive indexes explicitly
- Reject malformed adaptive path, term, and bucket counts on decode
* fix(08-03): restore benchmark baseline from stray 08-02 tests
- remove incomplete adaptive serialization RED tests from serialize_security_test.go
- realign verification to the requested 424aceb content boundary
* test(08-03): add skewed adaptive benchmark fixture
- add deterministic skewed head-tail fixture generation for $.user_id
- define explicit exact, bloom-only, and adaptive-hybrid benchmark configs
- add BenchmarkAdaptiveHighCardinality with mode= and shape= naming
* test(08-02): add failing tests for CLI adaptive info output
- Locks exact, bloom-only, and adaptive-hybrid mode labels
- Verifies adaptive summary counters in formatter output
- Uses local-only index fixtures without S3 dependencies
* feat(08-02): expose adaptive mode in CLI info output
- Extract index info rendering to a reusable writer helper
- Report exact, bloom-only, and adaptive-hybrid modes per path
- Include compact adaptive promoted, bucket, threshold, and cap counters
* docs(08-02): document adaptive high-cardinality behavior
- Describe exact, adaptive-hybrid, and bloom-only string path modes
- Document adaptive config defaults and tuning knobs by name
- Replace bloom-only-only language in comparisons and examples
* fix(08-03): remove stray 08-02 cli red tests
- drop unfinished adaptive CLI info tests from cmd/gin-index/main_test.go
- keep 08-03 verification scoped to the requested 424aceb baseline
* test(08-03): report adaptive pruning and size metrics
- report candidate_rgs and encoded_bytes for hot and tail probes
- assert adaptive hot probe prunes better than bloom-only before timing runs
- reuse shared skewed fixtures across exact, bloom-only, and adaptive modes
* docs(08-02): complete adaptive serialization and CLI info plan
- Tasks completed: 3/3
- SUMMARY: .planning/phases/08-adaptive-high-cardinality-indexing/08-02-SUMMARY.md
- Shared state artifacts intentionally left untouched for orchestrator ownership
* docs(08-03): complete adaptive benchmark evidence plan
- record HCARD-05 benchmark metrics and verification evidence
- document blocker fixes required to restore the requested 424aceb baseline
- leave STATE.md and ROADMAP.md untouched per execution request
* fix(08-02): restore adaptive serialization and cli tests after wave overlap
* fix(08): close review regressions in builder, cli, and decode
* fix(08): harden decode limits and align CLI/docs
* docs(phase-08): complete phase execution
* docs(phase-08): evolve PROJECT.md after phase completion
* docs(phase-08): add security threat verification
* fix: harden adaptive path mode invariants
* fix: address post-fix review hardening
* fix: tighten round-3 adaptive hardening
* docs(phase-08): clean up comments and add godoc for adaptive surface
- expand Version constant comment with migration note and v4/v5/v6 history
- dedupe PathEntry.AdaptivePromotedTerms/AdaptiveBucketCount comments into one
grouped godoc that documents the "derived, never persisted" contract
- justify the three serialize.go max-constants (maxDecodedIndexSize,
maxHeaderRowGroups, maxHeaderDocs, maxAdaptivePaths) with the reasoning
future maintainers need to tune them
- add field godoc on AdaptiveStringIndex explaining the sorted-terms +
lossy-bucket contract
- document WithAdaptive* option semantics including the disable-via-zero
conventions on PromotedTermCap and BucketCount
- rewrite the *WithIO shim comment to match the actual asymmetric pattern
(only build/extract have wrappers; query/info are *WithIO-only)
- simplify adaptivePathSummary: the fallback to PathEntry counters is
unreachable through both Decode and Finalize, so treat missing section as
invariant-violation-return-zero instead of silently reading stale data
- add cmd/gin-index/ errcheck exclusion in .golangci.yml; the CLI uses
fmt.Fprintf to stdout/stderr for user-facing output where ignoring write
errors is standard practice
* feat(phase-08): add PathMode.IsValid and decode-side enforcement
PathMode was validated only structurally (through validatePathReferences)
after the whole directory was read. Any byte >= 3 would flow through as
an "unknown" mode until the downstream switch caught it, which risked
diverging behavior if future code paths forgot the guard.
- PathMode.IsValid() reports whether the value is one of the declared
constants
- readPathDirectory now rejects unknown mode bytes immediately with
ErrInvalidFormat, matching the fail-closed posture of the other
per-field bounds checks
- cover the new helper with a table test and the decoder with a
targeted corruption test that flips a single mode byte in a valid
encoded payload to 99
* feat(phase-08): export SetAdaptiveInvariantLogger for library consumers
Adaptive invariant violations (path flagged PathModeAdaptiveHybrid with no
matching section) were always logged to log.Default() through a package-
private variable. Host processes couldn't redirect into structured logging
pipelines without poking internals, and concurrent writes to the variable
weren't safe.
- SetAdaptiveInvariantLogger swaps the logger under sync.RWMutex; nil
silences violations entirely for embedders that prefer not to have any
stderr chatter
- currentAdaptiveInvariantLogger reads the current logger with the read
lock so Evaluate paths don't serialize on logger access
- migrate the existing TestAdaptiveInvariantViolationLogs to the exported
API and add TestSetAdaptiveInvariantLoggerNilSilences covering the nil
case
* feat(phase-08): poison builder after partial merge failure
mergeStagedPaths mutates pathData, bloom, presentRGs, and nullRGs path-by-path
in a sorted loop. A mid-loop failure leaves earlier paths merged for the
current document while later ones are untouched - subsequent AddDocument
calls would compound the corruption.
validateStagedPaths' preview catches every naturally-occurring trigger today
(mixed numeric promotion), so this is defensive: if a future code path
lands a merge-time failure that bypasses the preview, the builder now
flags poisonErr and refuses further AddDocument calls with a wrapped error
pointing at the original failure. Finalize's contract is unchanged - callers
who honor AddDocument's error never reach a corrupt Finalize regardless.
* test(phase-08): cover remaining validate arm and non-EQ adaptive probes
- add the "adaptive path missing adaptive section" row to
TestValidatePathReferencesRejectsModeMismatches. All 6 arms of the mode
switch at gin.go:587-608 now have direct unit coverage instead of the
previous 5-arm subset + 1 transitive EncodeWithLevel assertion.
- expand BenchmarkAdaptiveHighCardinality beyond EQ: add NE, IN, NIN, and
Contains probes alongside the existing hot/tail EQ probes. Each probe
ReportMetric's candidate_rgs and encoded_bytes per mode so regressions in
adaptive pruning show up as metric shifts in bench diffs. The hot-EQ
soundness guard that asserts adaptive < bloom-only is preserved
unchanged; no new strict assertions are layered over the new probes
because different operators have legitimately different pruning
semantics across modes.
* test(phase-08): add cross-mode soundness property
For the same fixture, EQ results must widen monotonically:
classic ⊆ adaptive ⊆ bloom-only. A violation means an adaptive-mode index
is strictly more aggressive than the exact index - i.e. it drops row
groups that really do match, which is unsound and silently loses data.
The existing per-mode no-false-negatives property catches missed ground-
truth matches against the fixture, but not the case where adaptive's
bucket layer under-reports compared to its own exact-index peer. 1000
generated cases over the existing cardinality threshold generator.
* refactor(cli): drop outer var err error shadows in *WithIO helpers
Each *WithIO helper declared a function-scope var err error that only
worked because one "idx, err = ..." assignment per function happened to
be in a branch without a block-local err declaration. Future edits that
added or removed err := inside inner blocks could silently read or write
the wrong err variable.
Restructure the 4 helpers so every err value lives in the smallest block
that uses it: the two branches that previously needed the outer err now
use "loaded, err := ...; idx = loaded" instead of "idx, err = ...". The
remaining idx, err = ... sites are inside blocks with their own err :=
and continue to work unchanged.
* style(benchmark): apply gci formatting to long probe func literals
* test: fix lint violations in test fixtures
* perf(builder): in-place merge for adaptive tail bucket bitmaps
Adds RGSet.UnionWith for in-place merges and uses it in
buildAdaptiveStringIndex to avoid one bitmap clone per non-promoted
term. On high-cardinality paths with many tail terms (the workload
adaptive mode targets) this removes O(tail) allocations from the
hot builder path. Addresses PR #19 review item #2.
* fix(serialize): produce deterministic byte output for all index sections
Only writeAdaptiveStringIndexes was using sortedPathIDs; every other
section writer iterated maps in non-deterministic Go order. Two
Encode() calls on the same in-memory index could therefore produce
different bytes, breaking content-addressable caching and making
encoded-byte test assertions fragile.
Fixed:
- writeStringIndexes, writeStringLengthIndexes, writeNumericIndexes,
writeNullIndexes, writeTrigramIndexes, writeHyperLogLogs now use
sortedPathIDs at the outer level
- writeTrigramIndexes now sorts its inner Trigrams map by trigram
- writeConfig now sorts transformerSpecs by path
Decode is order-tolerant so this is wire-compatible. Addresses PR #19
review item #3.
* refactor(query): tighten adaptive lookup paths and silence default logger
Three small adaptive-mode cleanups from PR #19 review:
- lookupAdaptiveStringMatch: drop the unreachable bucketID bounds
check and nil bucket guard. NewAdaptiveStringIndex (used by both
build and decode paths) already enforces a power-of-two bucket
count and non-nil bucket bitmaps, and adaptiveBucketIndex masks
with bucketCount-1, so neither guard can fire. Replace with an
invariant comment. Addresses item #1.
- evaluateNIN: when a non-string element forces non-adaptive
fallback, call evaluateINNonAdaptive(values) directly with the
already-extracted slice instead of routing through evaluateIN,
which would re-enter the adaptive path before falling back.
Addresses item #7.
- adaptiveInvariantLogger: default to nil instead of log.Default().
Per Go library convention, libraries should not write to stderr
unless the consumer opts in. Existing callers can install
log.Default() explicitly. Addresses item #6.
* docs(gin): clarify v5 history and AdaptiveBucketCount=0 disable sentinel
- Version comment now notes v5 was never released; the in-tree v5
shape was iterated before the wire format was finalised in v6
and v5 payloads are always rejected on decode. Addresses item #4.
- WithAdaptiveBucketCount godoc now explains the asymmetry between
the option (rejects 0) and the validate() path (accepts 0 as the
disable sentinel for struct-literal callers). validate() also
carries an inline comment explaining why 0 is permitted there.
Addresses item #5.1 parent 0a638a2 commit 306aa40
20 files changed
Lines changed: 4268 additions & 323 deletions
File tree
- .planning
- phases/08-adaptive-high-cardinality-indexing
- cmd/gin-index
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
42 | 47 | | |
43 | 48 | | |
44 | 49 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
| 25 | + | |
24 | 26 | | |
25 | 27 | | |
26 | 28 | | |
27 | | - | |
28 | | - | |
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
44 | | - | |
45 | | - | |
| 44 | + | |
| 45 | + | |
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
| |||
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
83 | | - | |
| 83 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
16 | | - | |
| 16 | + | |
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| |||
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
58 | | - | |
| 58 | + | |
59 | 59 | | |
60 | 60 | | |
61 | 61 | | |
| |||
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
91 | | - | |
| 91 | + | |
92 | 92 | | |
93 | 93 | | |
94 | 94 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
6 | | - | |
7 | | - | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
11 | | - | |
12 | | - | |
13 | | - | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
24 | | - | |
| 24 | + | |
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
30 | | - | |
| 30 | + | |
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | | - | |
| 49 | + | |
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
| |||
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
78 | | - | |
79 | | - | |
80 | | - | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
Lines changed: 91 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
Lines changed: 93 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
0 commit comments