Efficiently query large encoded binary files without storing decoded versions. Achieve sub-2-second searches on 100GB files.
No installation needed! Just Python 3.7+:
# Optional: Install brotli support
pip install brotli- Import a file:
./nixindex.py --import --file mydata.bin --encoding base64- Search for a term:
./nixindex.py --search --term restaurantThat's it!
- β¨ Fast searches: < 2 seconds regardless of source file size
- π Multiple encodings: base64, gzip, hex, brotli, ROT13, and more
- πΎ Space efficient: Database typically 5% of decoded file size
- π No external dependencies: All decoding done inline
- π Acuity filtering: Remove low-frequency tokens for better performance
- π Position-based extraction: Direct record access via stored positions
- Compression: gzip, bz2, zlib, brotli, zip, tar
- Encoding: base64, ascii85, hex
- Ciphers: ROT-N (e.g., ROT13), Caesar cipher
- Legacy: uuencode, xxencode
- Raw: none (for pre-decoded data)
./nixindex.py --import --file <path> [options]Options:
--encoding <type>: Encoding to decode (default: none)--separator <sep>: Record separator (default: \n)--chunk <size>: Chunk size (e.g., 64, 1KB, 10MB)--acuity <n>: Minimum token count (default: 5)--db <path>: Database file (default: nixindex.db)
Examples:
# Import base64-encoded file
./nixindex.py --import --file data.b64 --encoding base64
# Import with custom separator
./nixindex.py --import --file data.txt --separator '\n\n'
# Import with acuity filter
./nixindex.py --import --file data.json --acuity 10
# Import from stdin
cat data.gz | ./nixindex.py --import --stdin --encoding gzip./nixindex.py --search --term <word>Options:
--term <word>: Word to search for (required)--file <path>: Source file (if different from import)--db <path>: Database file (default: nixindex.db)
Examples:
# Basic search
./nixindex.py --search --term restaurant
# Search with custom database
./nixindex.py --search --term cafe --db mydata.db./nixindex.py --generate --encoding <type> [options]Options:
--encoding <type>: Encoding to apply (required)--url <url>: URL to download (optional)--target-size <size>: Target file size (default: 100GB)--output <path>: Output file path (default: temp file)
Examples:
# Generate from Yelp dataset
./nixindex.py --generate \
--url https://business.yelp.com/external-assets/files/Yelp-JSON.zip \
--encoding base64 \
--target-size 10GB \
--output testdata.bin
# Generate random data
./nixindex.py --generate --encoding gzip --target-size 1GB-
Import Phase:
- Read file in chunks
- Decode using specified encoding
- Split into records by separator
- Extract tokens (alphanumeric words)
- Store in indexed SQLite database
-
Search Phase:
- Look up token in database (fast B-tree index)
- Get record positions
- Read and decode source file
- Extract matching records
- Display results
-
Why It's Fast:
- Token lookup: O(log n) via indexes
- No full file scan needed
- Only matching records processed
- Efficient position-based extraction
| File Size | Import Time | Search Time | Database Size |
|---|---|---|---|
| 1 GB | 1-3 min | < 0.5s | 50-150 MB |
| 10 GB | 5-15 min | < 1s | 500 MB-1.5 GB |
| 100 GB | 30-60 min | < 2s | 2-5 GB |
Times vary based on encoding type and CPU speed
- Use acuity filtering: Remove rare tokens for smaller database
- Choose efficient encoding: gzip > base64 > hex for size
- Increase chunk size: Larger chunks = fewer I/O operations
- Use appropriate separator: Match your data structure
# Import JSON logs with newline separation
./nixindex.py --import --file logs.json --separator '\n'
# Search for error events
./nixindex.py --search --term error
# Search for specific service
./nixindex.py --search --term authentication# Import gzipped data
./nixindex.py --import --file archive.gz --encoding gzip
# Search for terms
./nixindex.py --search --term database# Import base64-encoded emails
./nixindex.py --import --file emails.b64 --encoding base64 --separator '-----'
# Search for sender
./nixindex.py --search --term johnRun the comprehensive test suite:
./tests/test_nixindex.pyTests include:
- All encoding formats
- Database operations
- Full import/search workflow
- Yelp dataset download and test
- Performance verification (< 2s target)
- Check token spelling (case-insensitive)
- Verify data was imported correctly
- Check if acuity filter removed term (reduce
--acuity)
- Run
--importbefore--search - Check database file path matches
- Increase
--chunksize (e.g.,--chunk 10MB) - Use faster encoding (avoid hex for large files)
- Reduce
--acuityto filter fewer tokens
- Reduce chunk size
- Process files in smaller batches
- Increase system swap space
nixIndex/
βββ nixindex.py # Main program
βββ nixindex.db # Database (created after import)
βββ src/ # Source modules
βββ tests/ # Test suite
βββ logs/ # Log files
βββ backup/ # Code backups
# Use custom database
./nixindex.py --import --file data.bin --db /path/to/my.db
./nixindex.py --search --term word --db /path/to/my.db# Decode and import in pipeline
cat encoded.b64 | base64 -d | ./nixindex.py --import --stdin# Keep separate databases for different datasets
./nixindex.py --import --file dataset1.bin --db data1.db
./nixindex.py --import --file dataset2.bin --db data2.db
./nixindex.py --search --term foo --db data1.db
./nixindex.py --search --term bar --db data2.db- Python 3.7 or higher
- Standard library modules only
- Optional:
brotlifor Brotli compression
Built for the nixCraft Challenge.
README.md- This file (user guide)WARP.md- Technical architecture and implementationARCHITECTURE.md- Original specification
For technical details, see WARP.md
For architecture, see ARCHITECTURE.md