This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
GIN Index is a Generalized Inverted Index for JSON data, designed for row-group pruning in columnar storage (Parquet). It enables fast predicate evaluation to determine which row groups may contain matching documents.
# Build
go build ./...
# Run all tests
go test -v
# Run specific test
go test -v -run TestQueryEQ
# Run examples
go run ./examples/basic/main.go- Builder (
builder.go) - Ingests JSON documents viaAddDocument(rgID, jsonDoc), walks JSON structure, extracts paths/values - Index (
gin.go) - Final immutable index created byFinalize(), contains all index structures - Query (
query.go) - Evaluates predicates against index, returnsRGSetbitmap of matching row groups - Serialize (
serialize.go) - Binary encoding with zstd compression viaEncode()/Decode()
- StringIndex - Sorted terms with parallel RG bitmaps for exact match
- NumericIndex - Per-RG min/max stats for range query pruning
- NullIndex - Two bitmaps per path: null RGs and present RGs
- TrigramIndex - N-gram to RG bitmap mapping for CONTAINS queries
- GlobalBloom - Bloom filter for fast path=value rejection
- PathCardinality - HyperLogLog per path for cardinality estimation
RGSet(bitmap.go) - Row group bitmap with Set/Intersect/Union operationsPredicate- Query condition: Path + Operator + ValueGINConfig- Builder configuration (bloom size, trigram settings, HLL precision)
Uses ojg/jp library. Only supports: $, $.field, $['field'], $[*]. Rejects array indices, recursive descent, slices, filters - see jsonpath.go.
EQ, NE, GT, GTE, LT, LTE, IN, NIN, IsNull, IsNotNull, Contains, Regex
The Regex operator uses trigram index for candidate row-group selection before pattern matching.
Files:
regex.go-ExtractLiterals(),AnalyzeRegex(), literal extraction from regex patternsquery.go:289-evaluateRegex()implementation
How it works:
- Parse regex using
regexp/syntaxwith Perl mode - Apply
Simplify()(factors common prefixes:Toyota|Tesla→T(oyota|esla)) - Extract combined literals via Cartesian product (e.g.,
(error|warn)_msg→["error_msg", "warn_msg"]) - Query trigram index for each literal, union results
- Row groups not containing any literal are pruned
Key functions:
extractCombinedLiterals(re)- Recursive literal extraction with Cartesian product for concatenationextractConcatLiterals(subs)- HandlesOpConcatby building combined stringshasUnboundedWildcard(re)- Detects.*or.+patterns
Transform values into derived companion indexes via config options like WithISODateTransformer, WithToLowerTransformer, and WithCustomTransformer. Raw source values stay indexed on the original path; query a derived companion explicitly with As(alias, value). Internal companion paths are encoded as __derived:<source>#<alias> and are not part of the public query API. Use cases: date range queries, IP subnet filtering, version comparisons, case-insensitive search.
Types:
FieldTransformer-func(value any) (any, bool)- returns transformed value and success flag;ok=falseskips the derived companion for that document while raw indexing continuesRepresentationValue- created withAs(alias, value)to route a predicate to a derived companion index
Built-in transformers:
| Category | Transformer | Description | Example |
|---|---|---|---|
| Date | ISODateToEpochMs |
RFC3339/ISO8601 to epoch ms | 2024-01-15T10:30:00Z → 1705315800000 |
| Date | DateToEpochMs |
YYYY-MM-DD to epoch ms | 2024-01-15 → 1705276800000 |
| Date | CustomDateToEpochMs(layout) |
Custom format to epoch ms | Layout: 2006/01/02 15:04 |
| String | ToLower |
Lowercase normalization | Alice@Example.COM → alice@example.com |
| String | EmailDomain |
Extract domain from email | alice@example.com → example.com |
| String | URLHost |
Extract host from URL | https://api.example.com/v1 → api.example.com |
| String | RegexExtract(pattern, group) |
Extract via regex capture | Pattern: ERROR\[(\w+)\]:, group 1 |
| Numeric | RegexExtractInt(pattern, group) |
Extract + convert to float64 | order-12345 → 12345 |
| Numeric | IPv4ToInt |
IPv4 to uint32 for ranges | 192.168.1.1 → 3232235777 |
| Helper | CIDRToRange(cidr) |
Parse CIDR to start/end float64 | 192.168.1.0/24 → (start, end) |
| Helper | InSubnet(path, cidr) |
Returns []Predicate using the default ipv4_int alias |
InSubnet("$.ip", "10.0.0.0/8") |
| Helper | InSubnetAs(path, alias, cidr) |
Returns []Predicate for custom IPv4 companion aliases | InSubnetAs("$.ip", "ip_num", "10.0.0.0/8") |
| Numeric | SemVerToInt |
Semver to int (major1M+minor1K+patch) | v2.1.3 → 2001003 |
| Numeric | DurationToMs |
Go duration to ms | 1h30m → 5400000 |
| Numeric | NumericBucket(size) |
Bucket values | 150 with size 100 → 100 |
| Boolean | BoolNormalize |
Normalize boolean-like values | "yes", "1", "on" → true |
Files:
gin.go-FieldTransformer,RepresentationValue,As(), config options, and derived representation metadatatransformers.go- All built-in transformerstransformers_test.go- Unit and integration testsbuilder.go- Derived companion staging instageCompanionRepresentations()beforestageMaterializedValue()
Constructors: Use functional options pattern with two-phase validation
type FooOption func(*Foo) error // Options return errors
func WithBar(bar string) FooOption { // Option-level validation
return func(f *Foo) error {
if bar == "" { return errors.New("bar required") }
f.bar = bar
return nil
}
}
func NewFoo(opts ...FooOption) (*Foo, error) {
f := &Foo{}
for _, opt := range opts { // Apply options, fail fast
if err := opt(f); err != nil { return nil, err }
}
if err := validator.New().Struct(f); err != nil { // Struct validation
return nil, err
}
return f, nil
}Reference: pkg/catalog/pg_catalog.go:86. Note: validator.New() in constructors is fine; cache validators for hot paths.
Validation: Use github.com/go-playground/validator/v10 for all struct validation
- Register custom validators via
Validator.RegisterValidation() - Use struct tags:
validate:"required,at_least_one_host"
Defaults: Use github.com/creasty/defaults for struct default values
- See
pkg/types/logservice_defaults.gofor examples - Use struct tags:
default:"value" - Call
defaults.Set(&struct)to apply
Error Handling: Use github.com/pkg/errors for all error creation and propagation
errors.New("message")for new errors (captures stack trace)errors.Errorf("format %s", val)for formatted new errors (captures stack trace)errors.Wrap(err, "context")to wrap existing errors (captures stack trace at wrap point)errors.Wrapf(err, "context %s", val)for formatted wrap (captures stack trace at wrap point)- DEPRECATED:
fmt.Errorfwith%w- migrate toerrors.Wrap/errors.Wrapf(see #1670) - Use
errors.Cause(err)to get root cause,errors.Is()/errors.As()for comparison - Reference:
pkg/catalog/pg_catalog.gofor usage patterns
Required targets: test, integration-test, lint, lint-fix, security-scan, clean, help
Learn more: Use tclr-makefile skill for target specifications, templates, and examples.
GIN Index — Open Source Readiness
GIN Index is a Generalized Inverted Index library for JSON data, designed for row-group pruning in columnar storage (Parquet). It enables fast predicate evaluation to determine which row groups may contain matching documents — filling a gap between full-scan and standing up a database. This project tracks the work needed to take it from a private repo to a credible public open-source release.
Core Value: A credible first impression: anyone who finds the repo can immediately understand, build, test, and contribute — with no internal artifacts leaking through.
- License: MIT — simple, permissive, compatible with all dependency licenses
- Module path: Must be
github.com/amikos-tech/ami-ginto match the GitHub repo URL - Go version: 1.25.5 (already current)
- No breaking API changes: Existing API surface is clean — preserve it through the OSS transition
- Go 1.25.5 - Entire codebase (library + CLI tool)
- None
- Go toolchain 1.25.5 (specified in
go.mod) - No
.go-versionor.tool-versionsfile detected - Go modules (
go.mod+go.sum) - Lockfile:
go.sumpresent (67KB) - Module path:
github.com/amikos-tech/ami-gin
- No web or application framework. This is a standalone Go library with a CLI tool.
testing(stdlib) - Unit and benchmark testsgithub.com/leanovate/gopterv0.2.11 - Property-based testing (generators, properties, shrinking)gotest.tools/gotestsum(installed at test time via Makefile) - Test runner with JUnit outputgo build- Standard Go buildgolangci-lint- Linting (config:.golangci.yml)make- Build automation (Makefile)- Linters enabled:
dupword,gocritic,mirror - Formatter:
gci(import ordering) - Import order: standard -> third-party ->
github.com/amikos-tech/ami-gin-> blank -> dot errchecksuppressed for_test.goandexamples/- Timeout: 5m, concurrency: 4
github.com/cespare/xxhash/v2v2.3.0 - Fast non-cryptographic hash function; used by BloomFilter (bloom.go) and HyperLogLog (hyperloglog.go) for hashing valuesgithub.com/klauspost/compressv1.18.3 - zstd compression/decompression for index serialization (serialize.go); supports configurable compression levels 0-19github.com/RoaringBitmap/roaring/v2v2.14.4 - Compressed bitmap data structure; underliesRGSet(bitmap.go) for row-group tracking with set operations (And/Or/AndNot)github.com/ohler55/ojgv1.28.0 - JSONPath parsing library; used injsonpath.gofor path validation viaojg/jpsubpackagegithub.com/parquet-go/parquet-gov0.27.0 - Apache Parquet file reading/writing; used inparquet.gofor building indexes from Parquet files and embedding indexes in Parquet metadatagithub.com/aws/aws-sdk-go-v2v1.41.1 + service packages - AWS S3 integration for remote Parquet file access (s3.go); includesconfig,credentials,service/s3subpackagesgithub.com/pkg/errorsv0.9.1 - Error wrapping with stack traces; used throughout forerrors.New(),errors.Wrap(),errors.Errorf()github.com/leanovate/gopterv0.2.11 - Property-based testing framework; used ingenerators_test.go,property_test.go,integration_property_test.gogithub.com/bits-and-blooms/bitsetv1.24.2 - Dependency of roaring bitmapsgithub.com/mschoch/smatv0.2.0 - Dependency of roaring bitmaps (state machine testing)github.com/andybalholm/brotliv1.1.1 - Compression codec for parquet-gogithub.com/pierrec/lz4/v4v4.1.21 - Compression codec for parquet-gogithub.com/google/uuidv1.6.0 - Dependency of parquet-gogithub.com/parquet-go/bitpackv1.0.0 - Bit-packing for parquet-gogithub.com/parquet-go/jsonlitev1.0.0 - JSON handling for parquet-gogithub.com/twpayne/go-geomv1.6.1 - Geometry types for parquet-gogolang.org/x/sysv0.38.0 - System calls (dependency of compress, parquet)google.golang.org/protobufv1.34.2 - Protocol buffers (dependency of parquet-go)
AWS_ENDPOINT_URLorAWS_S3_ENDPOINT- Custom S3 endpoint (e.g., MinIO)AWS_REGIONorAWS_DEFAULT_REGION- AWS region (defaults tous-east-1)AWS_ACCESS_KEY_ID- S3 access keyAWS_SECRET_ACCESS_KEY- S3 secret keyAWS_S3_PATH_STYLE- Set to"true"for path-style S3 accessGINConfigstruct ingin.gowith functional options pattern (ConfigOption)- Defaults: bloom 65536 bits / 5 hashes, trigrams enabled, HLL precision 12, prefix block 16, cardinality threshold 10000
go.mod- Module definition and dependencies.golangci.yml- Linter configurationMakefile- Build targets:build,test,lint,lint-fix,clean,help
- Go 1.25.5+
golangci-lintfor lintinggotestsum(auto-installed bymake test)- No CGo dependencies - pure Go
- Standalone Go binary (no runtime dependencies)
- AWS credentials needed only for S3 operations
- File system access for local Parquet/sidecar operations
build- Build GIN index from Parquet file(s) (local or S3)query- Evaluate predicates against an indexinfo- Display index metadata and path directoryextract- Extract embedded index to sidecar file
.github/workflows/claude-code-review.yml- Claude-based code review.github/workflows/claude.yml- Claude workflow
cmd/gin-index/main.go- CLI entry pointexamples/*/main.go- Runnable examples (basic, full, fulltext, nested, null, parquet, range, regex, serialize, transformers, transformers-advanced)
- Single lowercase word per file:
bitmap.go,bloom.go,builder.go,query.go - Compound names use underscores:
transformer_registry.go - Test files co-located with matching
_test.gosuffix:gin_test.go,regex_test.go - Specialized test files named by type:
property_test.go,benchmark_test.go,generators_test.go,integration_property_test.go - PascalCase:
GINIndex,GINBuilder,GINConfig,RGSet,BloomFilter,HyperLogLog,TrigramIndex - Acronyms stay uppercase:
GIN,RG,HLL,FTS,IP - Internal build types use camelCase:
pathBuildData - Type aliases use PascalCase:
DocID,Operator,FieldTransformer,TransformerID - Constructors:
NewXxx(...)returning(*Xxx, error)- e.g.,NewBuilder(),NewBloomFilter(),NewRGSet() - Must-constructors:
MustNewXxx(...)wrappingNewXxxwithpanicon error - e.g.,MustNewRGSet(),MustNewBloomFilter() - Predicate builders are top-level functions returning
Predicate:EQ(),GT(),IN(),Contains(),Regex(),IsNull() - Transformer functions are top-level:
ISODateToEpochMs(),ToLower(),IPv4ToInt(),SemVerToInt() - Helper functions that return closures:
CustomDateToEpochMs(layout),RegexExtract(pattern, group),NumericBucket(size) - camelCase for locals and unexported fields
- PascalCase for exported fields:
NumRGs,Trigrams,GlobalMin - Constants: PascalCase for exported (
MagicBytes,Version), camelCase for unexported (maxLiteralExpansion,maxConfigSize) - Iota enums:
Op-prefixed for operators (OpEQ,OpNE,OpGT),Type-prefixed for types (TypeString,TypeInt),Flag-prefixed for flags (FlagBloomOnly,FlagTrigramIndex),Transformer-prefixed for IDs (TransformerISODateToEpochMs) - Name by behavior:
DocIDCodec(encodes/decodes DocIDs) - Methods:
Encode(),Decode(),Name()
WithISODateTransformer(path),WithToLowerTransformer(path),WithIPv4Transformer(path)etc.- These wrap
WithRegisteredTransformer()which handles both runtime and serialization config.
- Linters enabled:
dupword,gocritic,mirror staticcheckwith all checks exceptST1000(package comments),ST1003(naming),ST1020(exported comments)errcheckexcluded for_test.gofiles andexamples/directory- Import ordering enforced by
gciformatter: standard, default, then project prefix
bitmap.go- RGSet type and operationsbloom.go- BloomFilter typebuilder.go- GINBuilder (index construction)gin.go- GINIndex, GINConfig, types, optionsquery.go- Query evaluation and predicate constructorsserialize.go- Binary encoding/decoding with compressionregex.go- Regex literal extraction for trigram optimizationtransformers.go- Built-in field transformer functionstransformer_registry.go- Transformer serialization/reconstructionparquet.go- Parquet file integrations3.go- S3 storage integrationdocid.go- DocID codec abstractionjsonpath.go- JSONPath validationhyperloglog.go- HyperLogLog cardinality estimatortrigram.go- N-gram index for CONTAINS queriesprefix.go- Prefix compression for sorted strings
- Single flat Go package (
package gin) -- all core types and logic live at the module root - Builder pattern for index construction (mutable) producing an immutable index for querying
- Binary serialization with zstd compression for compact storage
- Multiple index structures (string, numeric, null, trigram, bloom, HLL) keyed by path ID
- Row-group-level granularity -- the index answers "which row groups MAY contain matching documents"
- Functional options pattern for configuration (
ConfigOption,BuilderOption,NGramOption, etc.)
- Purpose: Define index build parameters and field transformers
- Location:
gin.go(types + options),transformers.go(built-in transformers),transformer_registry.go(serializable transformer registry) - Contains:
GINConfig,ConfigOptionfunctions (WithCustomTransformer,WithISODateTransformer, etc.),FieldTransformer,As(), andTransformerSpecfor serialization - Depends on: Nothing (leaf layer)
- Used by: Builder, Serialize
- Purpose: Ingest JSON documents, walk their structure, populate index data structures
- Location:
builder.go - Contains:
GINBuilder,pathBuildData,AddDocument(),walkJSON(),Finalize() - Depends on: Configuration, Data Structures (RGSet, BloomFilter, TrigramIndex, HyperLogLog)
- Used by: User code, Parquet integration, CLI
- Purpose: Hold the immutable, finalized index and its component structures
- Location:
gin.go(types:GINIndex,Header,PathEntry,StringIndex,NumericIndex,NullIndex,StringLengthIndex) - Contains: All index struct definitions, type constants, operator enum
- Depends on: Data Structures (RGSet)
- Used by: Query, Serialize, Parquet integration
- Purpose: Evaluate predicates against the index, return matching row groups
- Location:
query.go - Contains:
Evaluate(), per-operator evaluation functions, predicate constructors (EQ(),GT(),Contains(), etc.) - Depends on: Index Layer, Data Structures, Regex analysis
- Used by: User code, CLI
- Purpose: Binary encode/decode the full index with optional zstd compression
- Location:
serialize.go - Contains:
Encode(),EncodeWithLevel(),Decode(), per-structure read/write functions - Depends on: Index Layer, Data Structures, zstd, roaring bitmap serialization
- Used by: Parquet integration, CLI, S3 client
- Purpose: Probabilistic and bitmap data structures used across the index
- Location:
bitmap.go(RGSet),bloom.go(BloomFilter),trigram.go(TrigramIndex),hyperloglog.go(HyperLogLog),prefix.go(PrefixCompressor) - Contains: Self-contained data structure implementations
- Depends on:
roaring/v2(RGSet),xxhash/v2(BloomFilter, HLL) - Used by: Builder, Index, Query, Serialize
- Purpose: Read/write indexes from/to Parquet files (embedded metadata or sidecar
.ginfiles) - Location:
parquet.go - Contains:
BuildFromParquet(),WriteSidecar(),ReadSidecar(),EncodeToMetadata(),DecodeFromMetadata(),LoadIndex(),RebuildWithIndex(),ParquetIndexWriter - Depends on: Builder, Serialize,
parquet-go - Used by: CLI, S3 client
- Purpose: Read/write Parquet files and GIN indexes from AWS S3
- Location:
s3.go - Contains:
S3Client,S3Config,s3ReaderAt(implementsio.ReaderAtover S3 range requests) - Depends on: Parquet integration, Serialize, AWS SDK v2
- Used by: CLI
- Purpose: JSONPath validation, regex literal extraction
- Location:
jsonpath.go(path validation viaojg/jp),regex.go(regex analysis for trigram pruning),docid.go(document ID codec abstraction) - Depends on:
ojg/jp,regexp/syntax - Used by: Query (regex), user code (path validation, DocID codec)
- Purpose: Command-line interface for index operations
- Location:
cmd/gin-index/main.go - Contains:
build,query,info,extractsubcommands, predicate parser - Depends on: All other layers
- Used by: End users
GINBuilderis mutable during construction, holding all intermediate data in mapsGINIndexis effectively immutable afterFinalize()-- all fields are populated and not modified- No shared state or concurrency primitives -- single-threaded build and query
- Purpose: Bitmap representing which row groups match a condition
- File:
bitmap.go - Pattern: Wraps
roaring.Bitmapwith bounds checking and set operations - Operations:
Set(),IsSet(),Intersect(),Union(),Invert(),Clone(),IsEmpty(),Count(),ToSlice() - Factory functions:
AllRGs(n)(all bits set),NoRGs(n)(empty),MustNewRGSet(n) - Purpose: Represents a single query condition
- File:
gin.go(type),query.go(constructors + evaluation) - Pattern: Simple value object with Path + Operator + Value
- Constructor functions:
EQ(),NE(),GT(),GTE(),LT(),LTE(),IN(),NIN(),IsNull(),IsNotNull(),Contains(),Regex() - Purpose: Transform field values before indexing (e.g., date strings to epoch milliseconds)
- File:
gin.go(type definition),transformers.go(implementations),transformer_registry.go(serialization support) - Pattern:
func(value any) (any, bool)-- returns transformed value and success flag - Registry pattern enables serialization: each transformer has a
TransformerIDand can be reconstructed from ID + params - Purpose: Encode/decode composite document identifiers
- File:
docid.go - Pattern: Interface with
Encode(indices ...int) DocIDandDecode(docID DocID) []int - Implementations:
IdentityCodec(1:1 mapping),RowGroupCodec(fileIndex * rowGroupsPerFile + rgIndex) - Purpose: Configurable index parameters with sensible defaults
- File:
gin.go - Pattern: Functional options returning
error--type ConfigOption func(*GINConfig) error DefaultConfig()provides production-ready defaults (bloom=65536, trigrams=enabled, HLL precision=12)
- Location:
builder.go:NewBuilder(),builder.go:AddDocument(),builder.go:Finalize() - Triggers: User code importing
github.com/amikos-tech/ami-gin - Responsibilities: Full build-query-serialize lifecycle
- Location:
cmd/gin-index/main.go - Triggers:
gin-index build|query|info|extractcommands - Responsibilities: Parquet file operations (build indexes, query, inspect, extract)
- Location:
parquet.go:BuildFromParquet(),parquet.go:LoadIndex() - Triggers: When working with Parquet files directly
- Responsibilities: Read JSON column from Parquet, build index, store as sidecar or embedded metadata
- Location:
s3.go:NewS3Client(),s3.go:S3Client.BuildFromParquet() - Triggers: When files are on S3
- Responsibilities: Remote file access via range requests, S3 sidecar management
- All constructors return
(*T, error)-- validation happens at construction time Must*variants (e.g.,MustNewRGSet,MustNewBloomFilter) panic on error -- used only in builder internals where errors indicate programming bugs- Errors are wrapped with context at each layer:
errors.Wrap(err, "create bloom filter") errors.Errorf()for new errors with formatting- No sentinel errors -- all errors are string-based
- The query layer does NOT return errors -- unknown paths or unsupported operations return
AllRGs()(safe fallback = no pruning)
- JSONPath validation via
jsonpath.go:ValidateJSONPath()usingojg/jpparser - Config validation at construction time via functional options returning errors
- Input validation in constructors (
numRGs > 0,precision 4-16, etc.) - Regex compile timeout (100ms) in transformer registry to prevent ReDoS
Before using Edit, Write, or other file-changing tools, start work through a GSD command so planning artifacts and execution context stay in sync.
Use these entry points:
/gsd:quickfor small fixes, doc updates, and ad-hoc tasks/gsd:debugfor investigation and bug fixing/gsd:execute-phasefor planned phase work
Do not make direct repo edits outside a GSD workflow unless the user explicitly asks to bypass it.
Profile not yet configured. Run
/gsd:profile-userto generate your developer profile. This section is managed bygenerate-claude-profile-- do not edit manually.