Context
Currently shaha deduplicates during build - if two different preimages produce the same hash, only one is stored (with merged sources). This prevents collision detection.
From discussion about Bitcoin security research portfolio - shaha could detect hash collisions if it stored all preimages, enabling queries like:
SELECT hash, COUNT(*), ARRAY_AGG(preimage)
FROM hashes
GROUP BY hash
HAVING COUNT(*) > 1
Problem
// Current behavior (build.rs deduplication)
hash("hello") = abc123 → save
hash("world") = abc123 → SKIP (duplicate), only merge sources
Real collisions are lost during build.
Proposed Solution
Schema Change (Option A - recommended)
Change from single preimage to list:
Current:
| hash | preimage | algorithm | sources |
| Binary | Utf8 | Utf8 | List<Utf8> |
Proposed:
| hash | preimages | algorithm | sources |
| Binary | List<Utf8> | Utf8 | List<Utf8> |
Alternative (Option B - simpler)
Remove deduplication, allow multiple rows per hash:
- Simpler implementation
- Larger storage footprint
- Query with
GROUP BY hash HAVING COUNT(*) > 1
New CLI Command
shaha collisions [--algorithm sha256] [--min-count 2] [--limit 100]
# Output
Found 3 collisions:
HASH: a1b2c3d4... (sha256)
- "hello" (rockyou.txt)
- "world123" (custom.txt)
Use Case
Peter Todd's hash collision bounties (SHA256, RIPEMD160, HASH160, HASH256) - ~0.59 BTC unclaimed since 2013. With collision detection, shaha becomes a tool for:
- Building large preimage databases
- Detecting accidental collisions during build
- Targeted collision search campaigns
Breaking Change
This requires a schema migration. Existing databases would need rebuild.
Related
- boha
hash_collision collection tracks Peter Todd bounties
- Birthday attack: 2^80 for RIPEMD160, 2^128 for SHA256
Context
Currently shaha deduplicates during build - if two different preimages produce the same hash, only one is stored (with merged sources). This prevents collision detection.
From discussion about Bitcoin security research portfolio - shaha could detect hash collisions if it stored all preimages, enabling queries like:
Problem
Real collisions are lost during build.
Proposed Solution
Schema Change (Option A - recommended)
Change from single preimage to list:
Alternative (Option B - simpler)
Remove deduplication, allow multiple rows per hash:
GROUP BY hash HAVING COUNT(*) > 1New CLI Command
Use Case
Peter Todd's hash collision bounties (SHA256, RIPEMD160, HASH160, HASH256) - ~0.59 BTC unclaimed since 2013. With collision detection, shaha becomes a tool for:
Breaking Change
This requires a schema migration. Existing databases would need rebuild.
Related
hash_collisioncollection tracks Peter Todd bounties