nixIndex - Fast Binary File Query System

Efficiently query large encoded binary files without storing decoded versions. Achieve sub-2-second searches on 100GB files.

Quick Start

Installation

No installation needed! Just Python 3.7+:

# Optional: Install brotli support
pip install brotli

Basic Usage

Import a file:

./nixindex.py --import --file mydata.bin --encoding base64

Search for a term:

./nixindex.py --search --term restaurant

That's it!

Features

✨ Fast searches: < 2 seconds regardless of source file size
🔐 Multiple encodings: base64, gzip, hex, brotli, ROT13, and more
💾 Space efficient: Database typically 5% of decoded file size
🚀 No external dependencies: All decoding done inline
📊 Acuity filtering: Remove low-frequency tokens for better performance
🔍 Position-based extraction: Direct record access via stored positions

Supported Encodings

Compression: gzip, bz2, zlib, brotli, zip, tar
Encoding: base64, ascii85, hex
Ciphers: ROT-N (e.g., ROT13), Caesar cipher
Legacy: uuencode, xxencode
Raw: none (for pre-decoded data)

Command Reference

Import Command

./nixindex.py --import --file <path> [options]

Options:

--encoding <type>: Encoding to decode (default: none)
--separator <sep>: Record separator (default: \n)
--chunk <size>: Chunk size (e.g., 64, 1KB, 10MB)
--acuity <n>: Minimum token count (default: 5)
--db <path>: Database file (default: nixindex.db)

Examples:

# Import base64-encoded file
./nixindex.py --import --file data.b64 --encoding base64

# Import with custom separator
./nixindex.py --import --file data.txt --separator '\n\n'

# Import with acuity filter
./nixindex.py --import --file data.json --acuity 10

# Import from stdin
cat data.gz | ./nixindex.py --import --stdin --encoding gzip

Search Command

./nixindex.py --search --term <word>

Options:

--term <word>: Word to search for (required)
--file <path>: Source file (if different from import)
--db <path>: Database file (default: nixindex.db)

Examples:

# Basic search
./nixindex.py --search --term restaurant

# Search with custom database
./nixindex.py --search --term cafe --db mydata.db

Generate Command

./nixindex.py --generate --encoding <type> [options]

Options:

--encoding <type>: Encoding to apply (required)
--url <url>: URL to download (optional)
--target-size <size>: Target file size (default: 100GB)
--output <path>: Output file path (default: temp file)

Examples:

# Generate from Yelp dataset
./nixindex.py --generate \
  --url https://business.yelp.com/external-assets/files/Yelp-JSON.zip \
  --encoding base64 \
  --target-size 10GB \
  --output testdata.bin

# Generate random data
./nixindex.py --generate --encoding gzip --target-size 1GB

How It Works

Import Phase:
- Read file in chunks
- Decode using specified encoding
- Split into records by separator
- Extract tokens (alphanumeric words)
- Store in indexed SQLite database
Search Phase:
- Look up token in database (fast B-tree index)
- Get record positions
- Read and decode source file
- Extract matching records
- Display results
Why It's Fast:
- Token lookup: O(log n) via indexes
- No full file scan needed
- Only matching records processed
- Efficient position-based extraction

Performance

Expected Timings

File Size	Import Time	Search Time	Database Size
1 GB	1-3 min	< 0.5s	50-150 MB
10 GB	5-15 min	< 1s	500 MB-1.5 GB
100 GB	30-60 min	< 2s	2-5 GB

Times vary based on encoding type and CPU speed

Optimization Tips

Use acuity filtering: Remove rare tokens for smaller database
Choose efficient encoding: gzip > base64 > hex for size
Increase chunk size: Larger chunks = fewer I/O operations
Use appropriate separator: Match your data structure

Examples

Example 1: JSON Log Files

# Import JSON logs with newline separation
./nixindex.py --import --file logs.json --separator '\n'

# Search for error events
./nixindex.py --search --term error

# Search for specific service
./nixindex.py --search --term authentication

Example 2: Encoded Archives

# Import gzipped data
./nixindex.py --import --file archive.gz --encoding gzip

# Search for terms
./nixindex.py --search --term database

Example 3: Base64 Email Data

# Import base64-encoded emails
./nixindex.py --import --file emails.b64 --encoding base64 --separator '-----'

# Search for sender
./nixindex.py --search --term john

Testing

Run the comprehensive test suite:

./tests/test_nixindex.py

Tests include:

All encoding formats
Database operations
Full import/search workflow
Yelp dataset download and test
Performance verification (< 2s target)

Troubleshooting

"No results found"

Check token spelling (case-insensitive)
Verify data was imported correctly
Check if acuity filter removed term (reduce --acuity)

"Database is empty"

Run --import before --search
Check database file path matches

Slow imports

Increase --chunk size (e.g., --chunk 10MB)
Use faster encoding (avoid hex for large files)
Reduce --acuity to filter fewer tokens

Out of memory

Reduce chunk size
Process files in smaller batches
Increase system swap space

File Locations

nixIndex/
├── nixindex.py          # Main program
├── nixindex.db          # Database (created after import)
├── src/                 # Source modules
├── tests/               # Test suite
├── logs/                # Log files
└── backup/              # Code backups

Advanced Usage

Custom Database Location

# Use custom database
./nixindex.py --import --file data.bin --db /path/to/my.db
./nixindex.py --search --term word --db /path/to/my.db

Pipeline Processing

# Decode and import in pipeline
cat encoded.b64 | base64 -d | ./nixindex.py --import --stdin

Multiple Databases

# Keep separate databases for different datasets
./nixindex.py --import --file dataset1.bin --db data1.db
./nixindex.py --import --file dataset2.bin --db data2.db

./nixindex.py --search --term foo --db data1.db
./nixindex.py --search --term bar --db data2.db

Requirements

Python 3.7 or higher
Standard library modules only
Optional: brotli for Brotli compression

License

Built for the nixCraft Challenge.

Documentation

README.md - This file (user guide)
WARP.md - Technical architecture and implementation
ARCHITECTURE.md - Original specification

Support

For technical details, see WARP.md
For architecture, see ARCHITECTURE.md

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
ENCODING_VERIFICATION.md		ENCODING_VERIFICATION.md
MEMORY_OPTIMIZATION.md		MEMORY_OPTIMIZATION.md
PROJECT_STATUS.md		PROJECT_STATUS.md
README.md		README.md
SUMMARY.md		SUMMARY.md
WARP.md		WARP.md
YELP_TEST_RESULTS.md		YELP_TEST_RESULTS.md
nixindex.py		nixindex.py

mcollard0/nixIndex

Folders and files

Latest commit

History

Repository files navigation

nixIndex - Fast Binary File Query System

Quick Start

Installation

Basic Usage

Features

Supported Encodings

Command Reference

Import Command

Search Command

Generate Command

How It Works

Performance

Expected Timings

Optimization Tips

Examples

Example 1: JSON Log Files

Example 2: Encoded Archives

Example 3: Base64 Email Data

Testing

Troubleshooting

"No results found"

"Database is empty"

Slow imports

Out of memory

File Locations

Advanced Usage

Custom Database Location

Pipeline Processing

Multiple Databases

Requirements

License

Documentation

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages