Open work items for this repo. Cross-cutting tracking lives in
../workspace/crossrepostatus.md;
items here are BAF-specific or are this repo's slice of a
cross-cutting initiative.
Implementation of BLOOM / HASHSET / TRUNCATED_LONG_64 is DONE
(see "Done" history below). The remaining open items in the
pluggable-persistence design plan:
-
Add open-addressing primitive hash table backend (fastutil
Long2LongOpenHashMapor hand-rolled overlong[]) — would offer O(1) average vs. the sorted array's O(log n) per lookup with similar cache profile. Deferred becauseTRUNCATED_LONG_64is already the fastest backend in practice and the marginal speedup would not change the recommended default. -
Add standalone
BloomFilterPersistence(Bloom-only / probabilistic mode without a backingAddressLookup). The currentBloomFilterAcceleratorreturnsrequiresBackend()==true; a pure-Bloom variant would returnfalseand acceptgetAmountsemantics of "unsupported / alwaysCoin.ZERO". Not yet needed by any caller; ship only when someone asks for it.
Status: idea / not yet investigated to implementation depth. Goal: a newcomer runs BitcoinAddressFinder with nothing to download but the jar (and Java) — no hand-copied config, no
logbackConfiguration.xml, no multi-GB LMDB download before the first run. Three independently shippable phases; each needs its own design pass before coding. Captured here so the idea isn't lost.
- Idea: package
examples/config_*.json(9 files today) into the jar and let the CLI accept a bare name that resolves local file if it exists, else jar/classpath (no silent shadowing of an edited local file). - Investigate first:
cli/Main.javais filesystem-only today:main()→loadConfiguration(Path)(line ~177) →readString(Files.readString, line ~110) →fromJson/fromYaml(extension picks the parser,.json/.jsvs.yaml/.yml). Refactor so a(String content, String nameForExtension)path shares the parser branch; keeploadConfiguration(Path)intact (used byMainTest).- The repo already loads classpath resources cleanly — mirror, don't invent:
getResourceAsStream(secret/BIP39Wordlist.java) and GuavaResources.getResource(opencl/OpenCLContext.java). - Decide canonical home: move configs to
src/main/resources/config/(packaged at/config/...), or copy at build time.pom.xmlalready copiessrc/main/resourcesinto both the normal andassemblyfat jar. - Guard test: add/extend a
ConfigFixturesParseTestthat loads each bundled config from the classpath and asserts it deserialises to aCConfigurationwith the expectedcommand/findershape — this is what stops the shipped configs rotting. - REUSE/CI:
REUSE.tomlcurrently listsexamples/config_*.json. Moving them means updating those annotation paths; every committed file needs SPDX metadata.
- Tests to run:
mvn test -Dtest='ConfigFixturesParseTest,MainTest'.
- Idea: embed a tiny (~1000-address) demo LMDB so a first
Findrun works end-to-end with no DB download. Decisions already taken: demo DB is generated during the build (no binary committed to git), and it activates as an auto-fallback when the configuredlmdbDirectoryis missing — plus a loud WARN every time it is used (this tool scans for real balances; it must never silently use the tiny set). - Investigate first:
- LMDB is memory-mapped and cannot be opened from inside a jar:
persistence/lmdb/LMDBPersistenceinitReadOnly()opensnew File(lmdbDirectory)withMDB_RDONLY_ENV | MDB_NOLOCK→ onlydata.mdbmust be extracted to a temp dir (nolock.mdbneeded). Need a new resource→temp extractor (GuavaResources+Files.copy,deleteOnExit) — none exists today. - Runtime seam for the fallback:
consumer/ConsumerJava.initLMDB()(line ~301-303,new LMDBPersistence(cfg, persistenceUtils); lmdb.init();). Branch: iflmdbDirectoryis empty / has nodata.mdb, extract the embedded demo DB to temp, WARN, open read-only; else open the configured dir exactly as today. Build a read-onlyCLMDBConfigurationReadOnlyfor the temp dir. - Build a DB from keys:
command/AddressFilesToLMDBwrites hash160→coin viaputNewAmount(...); test pathLMDBBase.createAndFillAndOpenLMDB→TestAddressesLMDB.createTestLMDB. Derivation viamodel/PublicKeyBytes+Hash160. A demo DB built from private keys1..Nyields real hits when an incremental producer scans from1. - Build wiring: bind
exec-maven-plugin(already present for JMH) toprocess-classes(after main compile, beforetest) to run ademo/DemoLmdbGeneratormainwriting intotarget/classes/demo/lmdb/→ on the test classpath and packaged into the jar.exec:javaruns in-process so it inherits.mvn/jvm.config's lmdbjava--add-opens. Build outputs undertarget/are not REUSE-checked → a build-time demo DB needs no REUSE entry. - NullAway is error-level — annotate every new
@Nullable/@NonNull. - Smoke test idea: with no
lmdb/present, run a tiny incremental-from-1Findconfig and assert the demo fallback activates and a hit is produced; addconfig_Find_Demo.json.
- LMDB is memory-mapped and cannot be opened from inside a jar:
- Tests to run:
mvn test -Dtest='DemoDatabaseSmokeTest,ConsumerJavaTest', thenmvn package -P assemblyunzip -l target/*-jar-with-dependencies.jar | grep demo/lmdbto confirm packaging.
- Idea: with configs + demo DB in the jar, launchers need only deliver the jar and a JDK 21; first run needs no other download.
- Investigate first:
examples/baf.sh/baf.ps1/baf.cmd: local jar first → else download the pinned fat jar from Maven Central + verify SHA-256 → enforce Java 21 → run with the two required runtime--add-opens(java.base/java.nio,java.base/sun.nio.ch) and-Dlogback.configurationFile=....- JBang quickstart (
quickstart.sh/.ps1+jbang-catalog.json): JBang brings JDK 21 and runs the GAVnet.ladenthin:bitcoinaddressfinder:<version>directly. - Version-pin upkeep: pin the current release (now 1.6.0) + its fat-jar SHA-256 in
baf.*, the GAV in catalog/quickstart, and the README; add a release version-bump checklist. Central vs GitHub-Release bytes are separate uploads — pin the checksum of whichever mirror the script fetches. - Verification:
shellcheck examples/*.sh; PSScriptAnalyzer on.ps1; fresh-dir run downloads the jar once then runs offline.
- Sandbox caveat: this environment may block external installs (JBang/Central/light-DB download) and the GPU/OpenCL path — the demo (CPU incremental + tiny LMDB) is fully exercisable offline; live fetches must be validated in CI or on a dev box.
- Optional niceties (not required): a
lmdbDirectory: "classpath:demo"sentinel to force the demo even when a real dir exists; a--extract-config <name>to drop a bundled config to disk for editing. - Snapshot resolution (timestamped metadata) is out — Release + pinned checksum only.
All three GPU-acceleration TODOs below connect to a single architectural choice: where does the address-presence filter live?
| Part 1 — CPU-side filter (current) | Part 2 — GPU-side filter (goal) | |
|---|---|---|
| Filter lives in | CPU RAM | GPU VRAM |
| Filter checked by | CPU, after kernel returns | GPU, inside the kernel |
| What crosses PCIe per batch | every hash160 (all N work-items × 104 B) | only candidates that passed the filter (~0.4 % with Fuse-8) |
| CPU-side lookup cost | ~25–108 ns per candidate (depends on backend) | ~0 — only LMDB verification on the tiny trickle |
| GPU VRAM consumed by filter | 0 | ~172 MB (Light DB) / ~1.8 GB (Full DB) with Fuse-8 |
| Kernel complexity | unchanged | +3 hash calls + 3 array reads + 1 XOR per work-item |
| GPU needed for filter | no — CPU-only configs work identically | yes — filter must be uploaded to VRAM at startup |
| When it wins | always (no extra GPU work) | when PCIe bandwidth or CPU filter throughput is the bottleneck |
| When it regresses | never | when GPU was already at ~100 % ALU occupancy on ECC |
How "copy only on hit" works in the GPU-side filter (Part 2): There is no per-result "copy" flag. Instead the kernel uses an atomic counter + compact output buffer:
__global volatile uint* hit_count // initialised to 0 before kernel launch
__global uchar* hit_results // pre-allocated for MAX_HITS entries
// Inside each work-item, after computing hash160 and running the Fuse-8 check:
if (fuse_hit) {
uint idx = atomic_add(hit_count, 1u); // claim the next output slot
if (idx < MAX_HITS) {
hit_results[idx * 20 .. idx*20+19] = my_hash160;
}
}
// Work-items that did NOT hit write nothing — they have no slot in hit_results.After the kernel finishes:
- CPU reads
hit_count(4 bytes PCIe transfer) → learns K hits. - CPU reads only the first K entries from
hit_results(K × 20 bytes). - CPU calls LMDB for those K candidates. Everything else was silently discarded on the GPU.
At 0.4 % FPR with a batch of 2048 work-items: K ≈ 8 hits per batch. PCIe transfer shrinks from 2048 × 104 B = 212 KB to 4 B + 8 × 20 B = 164 B — a >1000× reduction in bus traffic. The CPU never sees or processes the 2040 non-hit results.
The current "flags byte per work-item" approach described in the GPU-side-filter TODO below is a softer version that keeps all N results on the PCIe bus but lets the CPU skip LMDB for non-flagged entries. The compact-buffer approach above is stricter: non-hits never cross the bus at all. The compact-buffer design requires changing OpenCLGridResult from a fixed-stride layout (offset = work_item_index × CHUNK_SIZE) to a variable-length layout (offset = atomic_slot × 20), which is a larger refactor but eliminates the PCIe bandwidth entirely.
Critical constraint — vanity scanning always requires all results on the CPU side.
The existing ConsumerJava.processBatch() runs two independent checks per work-item:
containsAddress(hash160)— the database filter (fast, CPU-side, subject to Part 2 optimization).vanityPattern.matcher(base58Address).matches()— a Java regex against the base58-encoded address (enableVanity = true).
The vanity check runs on every work-item unconditionally — it has nothing to do with whether the address is in the database. A result could be a vanity match but a database miss, or a database hit but not a vanity match. Therefore:
- With
enableVanity = false(pure database scanning): the compact-output-buffer approach works perfectly — non-hits never need to reach the CPU. - With
enableVanity = true(vanity scanning): all results must still cross PCIe, because the GPU has no base58 encoder or regex engine. The compact-buffer optimization cannot be applied to the vanity path.
This means the Part 2 GPU-side filter and the compact-output-buffer approach apply only to the enableVanity = false configuration. For vanity scanning the result struct stays full-width and every work-item still crosses the bus. A future vanity-on-GPU implementation would need an OpenCL base58 encoder + pattern matcher — tractable for simple prefix patterns (1Abc...), impractical for arbitrary Java regex.
-
Pre-compute the
HASHSET-backend lookup hash on the GPU. Targets theHASHSETbackend (AddressLookupBackend.HASHSET→persistence/inmemory/HashSetAddressPresence.java), which today wraps each derived hash160 in a thread-localByteBuffer(ConsumerJava.java:367-371) and then callsSet<ByteBuffer>.contains(...)(HashSetAddressPresence.java:74-78). The dominant cost insidecontains(...)is recomputingByteBuffer.hashCode()per candidate — for a 20-byte hash160 this is 20 multiply-adds (h = 31*h + b) plus theHashMapspread ((h ^ (h >>> 16))). The same arithmetic can be computed once on the GPU, returned alongside the hash160, and consumed CPU-side without re-hashing. Per README §"Lookup latency" the HASHSET path is ~85 ns/op; the JDK hash + spread is ~20–25 ns of that, so the headroom is ~25 % of the HASHSET lookup time per candidate.-
Extend the kernel output struct. Today the kernel writes the layout described in
PublicKeyBytes.java:240-242(X, Y, hash160 uncompressed, hash160 compressed = 104 B/work-item). Add a 4-byteint hashCodeUncompressedand a 4-byteint hashCodeCompressedfield per work-item (112 B/work-item, +7.7 % per-candidate PCIe bandwidth). Reuse the existingCHUNK_SIZE_*offset machinery inOpenCLGridResult.java:118-122to lay the fields out without churn. -
Reproduce
java.nio.HeapByteBuffer.hashCode()byte-for-byte in OpenCL C. OpenJDK's implementation for a heap buffer with position 0 and limit 20 is:int h = 1; for (int i = 19; i >= 0; i--) h = 31 * h + (int)(byte)get(i); return h;Two correctness traps: (a) the cast is
(int)(byte)— the byte is sign-extended (e.g.0xFF⇒-1, not255); (b) the loop runs back-to-front (last byte first). A JMH benchmark must verify byte-equality againstByteBuffer.wrap(hash160).hashCode()over a randomised corpus before the GPU value is trusted in aSet.containspath. -
Add a new persistence implementation that accepts a precomputed hash.
HashSet<ByteBuffer>.contains(o)unconditionally callso.hashCode()— there is no JDK hook to pass in an external hash. So the optimization requires bypassingjava.util.HashSetentirely. AddHashSetPrecomputedHashAddressPresencenext toHashSetAddressPresence(persistence/inmemory/) with a custom open-addressing hash table keyed by the precomputed int hash (collisions resolved byArrays.equals(byte[], byte[])against the stored hash160). Expose a new APIboolean containsAddress(byte[] hash160, int precomputedHash)onAddressPresence(or a sibling interface) soConsumerJavacan forward the GPU-precomputed value without rewrapping in aByteBuffer. Document in the class Javadoc that the int hash is reproduced from a frozen OpenJDK formula and a future JDK change would silently corrupt lookups — pin a JMH equality test that fails the build if the JDK ever drifts. -
Wire the configuration toggle. Add
HASHSET_GPU_HASHtoAddressLookupBackend(preserveHASHSETas the JDK-HashSet<ByteBuffer>path) so both implementations live side-by-side and the JMH harness can A/B them on the same workload. Default staysTRUNCATED_LONG_64per the README recommendation; this is opt-in for HASHSET deployments only. -
Cost breakdown of one
Set<ByteBuffer>.contains(buf)call:Step inside contains(...)Approximate cost (warm L3 table) GPU pre-compute helps? ByteBuffer.hashCode()— 20 sign-extending multiply-adds (h = 31*h + (int)(byte)b)~20 ns — long dependency chain, ILP-limited ✅ eliminated HashMap.spread(h)—h ^ (h >>> 16)~1 ns ✅ eliminated Bucket index (n-1) & h+ array loadtab[i]~5–80 ns (cache-state-dependent) ❌ no — pre-computed hash doesn't fix cache miss Walk node chain (low load factor ⇒ usually 1 step) ~3–10 ns ❌ no ByteBuffer.equals(other)— content compare on hit (or first chain node)~15–25 ns (another byte loop) ❌ no Total per call (warm table) ~50–85 ns, matches the README's "85 ns HASHSET" ~25 % is the hash chain At Full-DB scale the bucket-array load becomes an L3/DRAM miss (50–100 ns), and total per-call cost rises to 130–180 ns; the hash chain's fraction of total time drops but its absolute cost (~20 ns) stays constant.
-
Throughput math when 32 cores are saturated on
.contains()(the real-world scenario).ConsumerJavaissues two.contains()calls per candidate (compressed + uncompressed hash160), so the per-candidate CPU cost is 2 × 85 ns = 170 ns on the warm-table path.Configuration Per-candidate CPU time Throughput on 32 saturated cores Without GPU hash (today) 170 ns ~188 M candidates/sec With GPU pre-computed hash 130 ns ~246 M candidates/sec Delta −40 ns (~23 % faster per call) +~30 % throughput, or equivalently ~7 of 32 cores freed That is not a marginal improvement when cores are saturated. Under CPU-bound saturation the +7.7 % PCIe-bandwidth cost is also not a real concern.
-
Realistic ceiling — where this TODO sits relative to bigger wins.
Optimization Expected throughput gain when 32 cores are saturated on .contains()GPU pre-computed hash (this TODO) ~30 % — frees ~7 of 32 cores; ship-worthy on its own Pack hash160 into (long, long, int)and key the table onlong(i.e. use the existingTRUNCATED_LONG_64approach, notHashSet<ByteBuffer>)~2-3× — eliminates both the hash loop and the 20-byte equality byte loop. Already implemented as a separate backend; the cheapest "fix" is to stop using HASHSET. GPU-side presence check (the "Push the TRUNCATED_LONG_64presence check into OpenCL" TODO below)~10–100× on the CPU lookup step in isolation, but end-to-end pipeline gain is GPU-headroom-dependent — the kernel grows by 256 phases of cooperative tile loading plus per-phase binary searches; if the GPU is already near saturation on ECC the kernel slows enough that net throughput can regress. See the throughput-trade-off sub-bullet under that TODO for the measurement plan. Batched lookups with software prefetch (issue 8 candidate hashes, __builtin_prefetchtheir bucket addresses, then check)~2× on cold tables; smaller on warm. Orthogonal to GPU-hash precompute. Honest read: if
.contains()saturation is the bottleneck today, this TODO is worth shipping for the 30 % it gives; but for the same investigation cycle it's worth measuring whether simply switching the active backend from HASHSET to TRUNCATED_LONG_64 (2-3×) or doing the GPU-presence-check work (10-100×) gives more and supersedes the need for this TODO at all. -
What pre-computed hash does not help. Cache misses on the bucket array at scale (the table is 8× L3 at Light DB and out-of-cache entirely at Full DB);
ByteBuffer.equals(other)byte compare on the matched node (~15-25 ns); GC pressure ifConsumerJava.java:367-371's "thread-local reusable ByteBuffer" turns out to allocate per call rather than reuse (verify before benchmarking — at 188 M ops/sec a per-callByteBuffer.wrap()would be ~9 GB/sec of allocation pressure). -
What needs to be designed first (before any kernel changes): the canonical reference of
HeapByteBuffer.hashCode()semantics that the JMH guard will pin against (capture the bytecode ofjava.nio.HeapByteBuffer#hashCodefor the running JDK and assert it matches a known-good copy at build time, so a JDK upgrade can't silently corrupt the GPU formula); whetherConsumerJavacarries the precomputed hash throughAbstractProducer/AbstractKeyProducerQueueBufferedas a parallelint[]/IntBuffernext to the existing hash160 buffers, or extends the per-candidate result struct in place; whether theHashSet_PrecomputedHashmap should fall back to JDKHashSet<ByteBuffer>semantics on CPU-only paths (e.g.ProducerJavaproducers that don't go through OpenCL) by computing the same hash on the CPU side using the same reference formula — yes, for consistency across producers.
-
-
Implement
BinaryFuse8AddressPresenceandBinaryFuse16AddressPresence— CPU-side Binary Fuse Filters for hash160 lookups. ✅ DONE (c603963,627b696). No XOR / fuse filter backend exists in the codebase today. The JMH benchmark (AddressLookupBenchmark) already covers the four existing backends via@Param({"LMDB_ONLY","BLOOM","HASHSET","TRUNCATED_LONG_64"}); adding two more entries is the whole benchmark change. This is the right first step before the GPU-side filter — both variants are purely Java, prove the FPR/memory trade-off with real JMH numbers, and their construction logic (populateFrom(LMDB)) produces the same flat array the GPU will upload. Implementing both 8-bit and 16-bit now is deliberate: each serves a distinct use case (VRAM-constrained vs precision-required), and both must exist in the benchmark to let JMH pick the winner.What is a Binary Fuse Filter? A static probabilistic membership filter (no inserts after construction). Lookup: 3 array reads + XOR of fingerprints — branchless, no division. Memory: ~1.14 bytes/entry (8-bit) or ~2.28 bytes/entry (16-bit). No false negatives by construction — every stored key is always found. FPR ≈ 1/256 ≈ 0.4 % (8-bit) or 1/65536 ≈ 0.0015 % (16-bit). This is exactly the no-FN / FP-ok contract the GPU filter requires. Reference: "Binary Fuse Filters: Fast and Smaller Than Xor Filters" (Graf & Lemire, 2022).
Memory comparison for this project:
Backend Bytes/entry Light DB (~132 M) Full DB (~1.4 B) FPR HASHSET~80 ~10.5 GB ~112 GB exact TRUNCATED_LONG_648 ~1.1 GB ~11 GB ~7.5 × 10⁻¹¹ BINARY_FUSE_8(new)1.14 ~150 MB ~1.6 GB ~0.4 % BINARY_FUSE_16(new)2.28 ~300 MB ~3.2 GB ~0.0015 % For the Full DB:
BINARY_FUSE_8fits in the RAM of any modern workstation;TRUNCATED_LONG_64requires ~11 GB. The 0.4 % FPR means ~2 M false-positive LMDB verifications per 500 M candidates — at 108 ns each that is ~216 ms/s of LMDB overhead, entirely negligible compared to the key-derivation cost.BINARY_FUSE_16trades ~2× more RAM for a ~270× lower FPR, which matters when LMDB I/O is the bottleneck.OpenCL portability design. Both variants must use a hash function that translates directly to OpenCL C without 128-bit arithmetic. The chosen kernel:
// Java (identical logic to the OpenCL kernel below) long h = murmur64(key ^ seed); // Murmur3 finalizer (3 multiplies + shifts) byte fp = (byte)(h ^ (h >>> 32)); // 8-bit fingerprint int h0 = reduce((int)h, seg); // reduce(x,m) = (uint)(x * (ulong)m >>> 32) int h1 = reduce((int)(h >> 21), seg) + seg; int h2 = reduce((int)(h >> 42), seg) + 2 * seg; return (table[h0] ^ table[h1] ^ table[h2]) == fp;The same three lines translate verbatim to OpenCL C with
ulong/uchar/ushort. The construction algorithm (Java-only) generates thetable[]array that the GPU will receive as a__global uchar[]or__global ushort[]buffer.Implementation plan — all steps are purely Java, no native code:
-
Add
BinaryFuse8AddressPresenceandBinaryFuse16AddressPresencetopersistence/inmemory/. Implement each algorithm inline (~250 lines each); do NOT add an external library dependency (BAF'sdependencyConvergence+bannedDependenciesenforcement makes transitive deps expensive, and the algorithm is compact enough to own). Both classes implementAddressPresenceand mirror the static-factory pattern ofTruncatedLong64SortedArrayPresence. The only difference between the two implementations is the fingerprint array type (byte[]vsshort[]) and the comparison width.Construction uses the standard iterative-peeling XOR-filter algorithm: populate count and XOR-accumulator arrays, extract singleton positions into a peeling queue, record the reverse topological order, then walk back assigning fingerprints.
-
Add
BINARY_FUSE_8andBINARY_FUSE_16toAddressLookupBackendenum. Add Javadoc entries describing bytes/entry, FPR, no-FN guarantee, LMDB closed after population. -
Wire into
ConsumerJavadispatch. Addcase BINARY_FUSE_8andcase BINARY_FUSE_16in the same switch that handlesTRUNCATED_LONG_64; no otherConsumerJavachanges needed. -
Add to
AddressLookupBenchmark. Add"BINARY_FUSE_8"and"BINARY_FUSE_16"to the@Param({…})list and the matchingcasearms inbuildLookup(). -
Unit tests + mutation coverage.
BinaryFuse8AddressPresenceTestandBinaryFuse16AddressPresenceTest: populate from a small fixed set; verify every member returnstrue(no-FN); verify FPR over a large random miss set is within expected bounds; verifyrequiresBackend() == false; verify buffer is not mutated bycontainsAddress; verify wrong-length buffer returns false.- Add both classes to the PIT
<targetClasses>list once 100 % mutation coverage is reached.
After this TODO is done: the GPU-side filter (see next TODO) reuses the same
murmur64/reduce/ fingerprint formula verbatim in OpenCL C. The Java and GPU lookup paths are verifiable against each other with identical test inputs. -
-
GPU-side Binary Fuse 8 filter — Part 2 implementation plan (atomic steps). Uses the CPU-side
BinaryFuse8AddressPresence(✅ done) uploaded to GPU VRAM so the kernel checks the filter inline and transmits only hits over PCIe. Controlled by two new config flags:enableGpuFilter(defaultfalse) andtransferAll(defaultfalse; forcedtrueautomatically whenConsumerJava.enableVanity = true, since vanity scanning requires all results on the CPU). Each of the 9 steps below is independently committable and must compile + pass existing tests before the next step begins.Unified output buffer format (ONE physical layout, two write modes). There is a single physical output layout, used unchanged by every kernel launch — there is not a separate full-transfer stride. This is deliberate: two different strides (104 vs 108) would make the destination-buffer sizing bound (
MAXIMUM_CHUNK_ELEMENTS/BIT_COUNT_FOR_MAX_CHUNKS_ARRAY) depend on which format is active. With one stride the bound is computed off a single true entry size and the buffer is alwaysOUTPUT_HEADER_SIZE_BYTES + N × OUTPUT_ENTRY_SIZE_BYTES, which is also exactly compact mode's worst case (every candidate is a filter hit), so one allocation safely covers both modes.Layout:
- Byte 0: a 4-byte unsigned count word.
- Byte
4 + j × 108: entryj, always[work_item_index:4][X:32][Y:32][hash160_u:20][hash160_c:20](108 bytes). Thework_item_indexis present in every entry. It is redundant in full-transfer mode (entryiis written at sloti), but carrying it keeps the stride identical to compact mode. The +4 bytes/entry costs ≈ 3.8 % PCIe in full-transfer mode only — the rare vanity/regex path, where the CPU regex dominates and the GPU is not the bottleneck.
The count word selects how the reader walks the entries:
- Count =
0xFFFFFFFF(OUTPUT_COUNT_FULL_TRANSFER_SENTINEL): full-transfer mode. Every work-item is present; entryiis written densely at slotiwithwork_item_index = i(no atomics). The reader walks exactlyworkSizeentries. Used when the GPU filter is disabled, or whentransfer_allis forced (vanity/regex scanning needs every derived address on the CPU). - Count = K (any other value): compact mode. Only the work-items whose hash160 passed the GPU Binary Fuse 8 filter wrote an entry, each claiming its slot via
atomic_add, so the K entries appear in nondeterministic order andwork_item_indexis essential. The reader walks exactly K entries. K cannot collide with the sentinel: the grid is capped at2^BIT_COUNT_FOR_MAX_CHUNKS_ARRAY(= 2²⁴) work-items, far below0xFFFFFFFF, so even an all-hit batch (K == workSize) stays under it.
Both modes share the entry parser; they differ only in the loop bound (
workSizevs K) and in how the count is produced (constant sentinel vs atomic counter). CPU reconstructssecret = KeyUtility.calculateSecretKey(secretBase, work_item_index, USE_OR)for every entry in both modes. PCIe saving at Fuse8 FPR 0.4 % (compact mode): ≈ 99.6 % bandwidth reduction. The mode flagtransfer_allis a uniform kernel argument → zero branch divergence from mode selection; only theif (hit)branch inside compact mode may diverge (≈ 0.4 % of work-items enter it).Step A — Layout constants ✅ (
constants/OpenClKernelConstants.java) AddedOUTPUT_HEADER_SIZE_BYTES = 4,OUTPUT_COUNT_FULL_TRANSFER_SENTINEL = 0xFFFF_FFFF, the unified-entry offsets (OUTPUT_ENTRY_INDEX_BYTE_OFFSET = 0,_X_BYTE_OFFSET = 4,_Y_BYTE_OFFSET = 36,_HASH160_UNCOMPRESSED_BYTE_OFFSET = 68,_HASH160_COMPRESSED_BYTE_OFFSET = 88) andOUTPUT_ENTRY_SIZE_BYTES = OUTPUT_HEADER_SIZE_BYTES + CHUNK_SIZE_NUM_BYTES = 108.MAXIMUM_CHUNK_ELEMENTSis derived from the unified stride ((Integer.MAX_VALUE − 4) / 108 = 19 884 107);BIT_COUNT_FOR_MAX_CHUNKS_ARRAYis unchanged at 24. (Design note: the unified single-stride layout — full transfer also carrieswork_item_index— replaces the original dual-stride 104/108 plan, so the capacity bound has a single true entry size.) Tests:mvn test -Dtest=BitcoinAddressFinderArchitectureTest(constants-only change).Step B — Config flags ✅ (
configuration/CProducerOpenCL.java) Addboolean enableGpuFilter = falseandboolean transferAll = falsewith Javadoc. Tests: JSON round-trip test inCProducerOpenCLTestverifying both fields default tofalseand survive a Jackson serialise/deserialise cycle.Step C — BinaryFuse8 getter exposure ✅ (
persistence/inmemory/BinaryFuse8AddressPresence.java) Add package-private getters:getFingerprints(),getSeed(),getSegmentLength(),getSegmentLengthMask(),getSegmentCountLength(). Tests (new, no GPU needed):getSeed_returnsInitialSeedForFirstSuccessfulBuild— build a small filter, verifygetSeed()is non-zero and matches the valuecontainsAddressuses internally (cross-check by building an equivalent key manually).getSegmentLengthMask_isSegmentLengthMinusOne— assertgetSegmentLengthMask() == getSegmentLength() - 1.getFingerprints_lengthEqualsSlotCount— assertgetFingerprints().length == slotCount().getters_doNotMutateFilter— call all getters, then verifycontainsAddressstill returns the same answers.
Step D — GPU VRAM upload ✅ (
opencl/OpenCLContext.java) Architecture note (revised wiring). The enforced layered-architecture test forbidsconsumer → openclandopencl → persistence. So the upload is routed cleanly through the producer rather than the consumer:OpenCLContext.uploadGpuFiltertakes primitives only (byte[] fingerprints, int seedLo, int seedHi, int segLen, int segLenMask, int segCountLen) — the OpenCL layer never references the persistence filter type.BinaryFuse8AddressPresence.toGpuFilterData()(public) returns aBinaryFuse8GpuFilterDatarecord (a persistence-package carrier) that the engine (Finder, which may access both persistence and producer) reads and decomposes into those primitives, handing them toProducerOpenCL, which uploads after itsOpenCLContext.init()(Step H).uploadGpuFilterallocates twocl_membuffers withCL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR(the fingerprint slot array + a 5-int metadata buffer[seedLo, seedHi, segLen, segLenMask, segCountLen]) and retains them for kernel binding; an empty filter pads to a 1-byte fingerprint buffer (zero-size device buffers are invalid) and is detected viasegCountLen == 0.close()releases both. AddsisInitialized(). Tests (skip if no OpenCL device):uploadGpuFilter_andClose_doesNotThrow— build a 5-entry filter, upload its payload, callclose(), assert no exception.isInitialized_falseBeforeInit_trueAfterInit_falseAfterRelease— lifecycle state test including the filter upload path.- Plus non-GPU
toGpuFilterData_*tests pinning the payload mirrors the getters and the empty-filter case.
Step E — Unified output buffer: count header + per-entry work_item_index ✅ (kernel
.cl+OpenClTask+OpenCLGridResult) Migrate the existing kernel output to the single unified 108-byte entry layout.getDstSizeInBytes()becomesOUTPUT_HEADER_SIZE_BYTES + OUTPUT_ENTRY_SIZE_BYTES × overallWorkSize. Work-item 0 writes0xFFFFFFFFu(the full-transfer sentinel) to output[0..3]. Each work-item writes itswork_item_indexat entry offset 0, then X/Y/hash160s at the unified entry offsets (shifted by the 4-byte index field).getPublicKeyFromByteBufferXYreads fromOUTPUT_HEADER_SIZE_BYTES + entry × OUTPUT_ENTRY_SIZE_BYTESusing the unifiedOUTPUT_ENTRY_*offsets.getPublicKeyBytes()reads the count word and asserts the sentinel (compact path not yet live). Tests (no GPU needed — hand-craftedByteBuffer):getPublicKeyBytes_sentinelCount_dispatchesToFullTransfer— ByteBuffer with0xFFFFFFFFat offset 0 followed by correct 108-byte unified entries; assert correctPublicKeyBytesarray returned.getPublicKeyFromByteBufferXY_offsetShiftedByHeaderAndIndex— verify the first key's X bytes are at byte4 + 4 = 8(header + index field), not byte 0.
Step F — In-kernel Fuse8 check + compact output path ✅ (kernel
.clfile +OpenClTask+OpenCLContext) Add 3 new kernel arguments (the 5 metadata ints are passed as one buffer rather than 5 scalars, matching the Step D upload):fuse8_fp(fingerprint buffer),fuse8_meta([seedLo, seedHi, segLen, segLenMask, segCountLen]buffer),transfer_all(uint). The physical entry layout is unchanged from Step E (the unified 108-byte entry); only the write decision changes: in compact mode (transfer_all == 0) a work-item writes its entry only if its uncompressed OR compressed hash160 passes the filter, claiming the slot viaatomic_addon the count word (which starts at 0). In full-transfer mode (transfer_all != 0) every work-item writes its entry at its own index and work-item 0 stamps the sentinel, exactly as Step E. The kernel's Fuse8 primitives (fuse8_hash64/reduce/rotl/fingerprint/contains) are a byte-exact port of the Java helpers, andfuse8_key_from_ripemdbyte-swaps the two LE RIPEMD words to reproduceByteBuffer.getLong(0).OpenClTask.executeKernelbinds the args, zero-initialises the count word viaclEnqueueWriteBufferbefore launch, and reads back only the bytes the count word reports (full →workSizeentries; compact → K).OpenCLContextallocates a dummy empty filter oninit()so the args are always bindable, tracksgpuFilterUploaded, and computestransfer_all = transferAll || !gpuFilterUploaded. Test:Fuse8GpuHashParityTest(no GPU) pins the key-extraction + hash formula againstBinaryFuse8AddressPresence.containsAddressover 1 000 distinct hash160 inputs (members + non-members), so any kernel drift fails in pure Java. The kernel build + compact execution are exercised end-to-end in Step I (GPU-gated).Hash strategy — MurmurHash3 finalizer (standard for XOR/fuse filters, ~20 lines total):
static ulong murmur64(ulong h) { // 5 lines — the entire hash primitive h ^= h >> 33; h *= 0xff51afd7ed558ccdUL; h ^= h >> 33; h *= 0xc4ceb9fe1a85ec53UL; h ^= h >> 33; return h; } // Key extraction: first 8 bytes of hash160 as big-endian uint64 — matches Java ByteBuffer.getLong(pos) // h0/h1/h2 via reduce = (uint)(((ulong)(uint)ph * (ulong)seg) >> 32) [no division] // seed rotations: rotl(seed,21) and rotl(seed,42) for h1/h2 independence // fingerprint: (uchar)(ph ^ (ph >> 32)) // XOR invariant: fp[h0]^fp[h1]^fp[h2] == fingerprint → hit
Critical: key extraction in the kernel (
first 8 bytes of hash160 as big-endian ulong) must matchBinaryFuse8AddressPresence.containsAddress()exactly (hash160.getLong(hash160.position())). Any mismatch → false negatives (missed balance hits).Tests:
Fuse8GpuHashParityTest(no GPU needed — pure Java): reimplements the GPU hash logic in Java (key = ByteBuffer.wrap(h160).getLong(0); ph = hash64(key, seed); h0 = reduce((int)ph, seg); h1 = reduce((int)hash64(key, rotl(seed,21)), seg) + seg; ...). For 1 000 distinct hash160 inputs, asserts that this Java reimplementation agrees withBinaryFuse8AddressPresence.containsAddress()on every hit/miss answer. This pins the exact key-extraction + hash formula in a runnable test before any OpenCL code is written — if the formulas drift, this test fails.
Step G — Java compact-mode reader ✅ (
opencl/OpenCLGridResult.java)getPublicKeyBytes()dispatches on the count word:0xFFFFFFFF→readFullTransfer()(walkworkSizeentries), otherwise →readCompact(count)(walk K entries). Because the entry layout is unified, both paths share the same per-entry parser: readwork_item_index(entry offset 0), derive secret viaKeyUtility.calculateSecretKey(secretBase, index, USE_OR), read X/Y/hash160u/hash160c at the unifiedOUTPUT_ENTRY_*offsets, assemblePublicKeyBytesviaPublicKeyBytes.assembleUncompressedPublicKey(x, y). Tests (no GPU needed — hand-craftedByteBuffer):readCompact_countZero_returnsEmptyArrayreadCompact_countTwo_returnsCorrectSecretsAndHashes— encode two known compact entries; assert bothPublicKeyByteshave the expected secrets, uncompressed keys, and hash160s.readCompact_invalidSecretZero_returnsInvalidKeyOne— compact entry withwork_item_indexthat makessecret = 0; assert result isPublicKeyBytes.INVALID_KEY_ONE.getPublicKeyBytes_sentinelDispatch_doesNotCallCompactPath— sentinel count must not be treated as a compact entry count.
Step H — Integration wiring ✅ (
consumer/ConsumerJava.java,producer/ProducerOpenCL.java,engine/Finder.java) Architecture note (revised wiring). Routed through the producer instead of the consumer (the layered architecture forbidsconsumer → opencl).ConsumerJava.getGpuFilterData()returns theBinaryFuse8GpuFilterDatapayload when the backend isBINARY_FUSE_8.Finder.configureProducer()then (1) callsapplyVanityFullTransferOverride()— forcestransferAll = trueon every OpenCL producer config whenconsumerJava.enableVanity = true, with a warning that compact mode is disabled — and (2) callsuploadGpuFilterToProducers(), which for eachenableGpuFilterproducer not in full-transfer mode reads the consumer payload, decomposes it into primitives, and callsProducerOpenCL.setGpuFilter(...).ProducerOpenCL.initProducer()uploads the staged filter to VRAM right afterOpenCLContext.init(). Tests:FinderTest:applyVanityFullTransferOverrideforcestransferAll = trueunder vanity and leaves itfalseotherwise.ConsumerJavaTest(LMDB-gated):getGpuFilterData()returns a payload for theBINARY_FUSE_8backend and empty forLMDB_ONLY.
Step I — End-to-end integration test + example config ✅ Added
OpenCLCompactOutputIntegrationTest. All assertions are exact — no ± tolerances — because the test derives every candidate on the CPU and controls exactly which are inserted into the filter. It runs on any OpenCL 2.0+ device (the grid size is capped to the device's safe range, so a CPU OpenCL runtime such as pocl exercises the same compact path on a small batch and a GPU runs the full 256-wide batch); it self-skips when no OpenCL 2.0+ device is present. Three sub-tests: full-batch (all 2N hash160s inserted →count == Nexactly, every returned key passesruntimePublicKeyCalculationCheck()); partial-batch (K=3 uncompressed hashes at indices 0/N·½/N−1 inserted →count >= Kfrom the no-FN guarantee andcount < N, with each inserted index present among the returned work-items); empty-filter (count == 0). Addedexamples/config_Find_GPUFilterCompact.json(BINARY_FUSE_8 backend +enableGpuFilter: true,transferAll: false).Test setup (shared): choose a fixed
secretBaseandworkSize = N(e.g. N = 256). CPU-side, derive all N secretssecretBase + ifori = 0 .. N-1and compute theirhash160_uncompressed+hash160_compressedviaKeyUtility.Full-batch test — count_out == N exactly:
- Populate
BinaryFuse8AddressPresencewith all 2N hash160s (both variants for every key in the range). - Upload filter to GPU. Run kernel with
secretBase,workSize = N,transfer_all = 0. - Assert
compact_count_out == N— zero misses in the batch → zero false positives possible → count is provably exact, not approximate. - Assert all N returned
PublicKeyBytespassruntimePublicKeyCalculationCheck().
Partial-batch test — count_out is provably in [K, K+1]:
- Choose a small K (e.g. K = 3) and populate the filter with only those K hash160_uncompressed values from the range (indices 0, N/2, N-1 — spread across the batch).
- Upload filter. Run same batch (workSize = N).
- Assert
compact_count_out >= K— no-FN guarantee: every inserted address must be found. - Assert
compact_count_out < N— the batch is mostly misses. - The K returned entries whose
work_item_indexmatches an inserted key must each passruntimePublicKeyCalculationCheck().
Empty-filter test — count_out == 0:
- Upload an empty filter (0 entries). Run the batch.
- Assert
compact_count_out == 0.
Add
examples/config_Find_GPUFilterCompact.json. Mark each step ✅ in this TODO as it lands. -
GPU-side no-false-negatives address filter with per-variant flag bitmask. Today the GPU kernel returns raw hash160 bytes for every candidate key and the CPU calls
lookup.containsAddress(...)twice per candidate (once for the uncompressed address, once for compressed). For large workgroup sizes this serialises the most expensive part of the pipeline on a single CPU thread. The goal is to move the address-presence filter onto the GPU so that the CPU only receives — and only queries LMDB for — the tiny subset of candidates that the GPU marked as "possibly found". The filter must satisfy the no-false-negatives invariant: if an address IS in the database, the GPU must always set its flag bit. False positives are acceptable — the CPU verifies any flagged candidate against LMDB and discards false positives there. This is exactly the bloom-filter contract: zero false negatives, bounded false positives.-
Upload the snapshot once at startup. Right after
TruncatedLong64SortedArrayPresence.populateFrom(lmdb)builds the 256 sortedlong[]buckets in host RAM, copy each bucket into device global memory (cl_membuffer per bucket, plus a small offset/length index). At ~8 B/entry this fits comfortably in modern GPU VRAM for any practical database size (~1.1 GB for the Light DB, ~11 GB for the Full DB; the latter may need streaming on smaller cards). -
Filter semantics and VRAM trade-offs. The TRUNCATED_LONG_64 snapshot (sorted 64-bit truncated values, binary search per work-item) gives a near-zero false-positive rate (~7.5 × 10⁻¹¹ per query for the Full DB) and satisfies no-FN by construction — no entry is ever omitted. This is stronger than the filter contract strictly requires. For VRAM-constrained GPUs (8 GB cards with Full DB), alternative probabilistic structures use less memory at the cost of a higher but still acceptable FP rate:
- Bloom filter on GPU: hash160 is already a strong 160-bit uniform hash, so different bit-window slices of the hash160 serve directly as the k independent probe positions — no separate hash computation. For 132 M entries (Light DB) a 512 MB filter gives ~32 bits/element and FP ≈ 10⁻⁹; for 1.4 B entries (Full DB) the same 512 MB gives ~3 bits/element and FP ≈ 10 % — unacceptable, requiring ~4–8 GB for a useful FP rate. GPU bloom filters are well-studied; the access pattern (k random bit reads) suits global memory with reasonable L2 hit rates for small-to-medium filters.
- Binary Fuse filter (XOR filter family): uses ~1.23 × log₂(1/ε) bits/element (vs ~1.44 for bloom). Lookup is 3 array accesses + XOR of 8-bit or 16-bit fingerprints derived from the hash160. For 132 M entries at FP ≈ 10⁻⁹: ~370 MB. For 1.4 B entries at the same rate: ~3.9 GB. Construction is offline (once, before upload); GPU-side lookup is simple and branch-free. Worth evaluating if VRAM is tight.
- Recommendation: use TRUNCATED_LONG_64 (already implemented CPU-side) for the first GPU implementation. Switch to a binary fuse filter only if VRAM headroom on the target GPU makes TRUNCATED_LONG_64 impractical for the DB size in use.
-
Output a flags byte, not raw hashes. The kernel's per-work-item result struct gains a
uint8_t flagsfield. The GPU runs the filter lookup twice — once for the uncompressed hash160 and once for the compressed hash160 — and encodes both results into separate bits:bit 0 (0x01): uncompressed hash160 lookup positive — no false negatives guaranteed bit 1 (0x02): compressed hash160 lookup positive — no false negatives guaranteed bits 2–7: reserved for future address-type variants (P2SH, bech32, …)CPU-side
ConsumerJavaiterates the result batch and only forwards work-items whereflags != 0to the LMDB verification step. Work-items withflags == 0are discarded without any LMDB access. For a well-calibrated GPU filter (TRUNCATED_LONG_64 or a low-FP binary fuse filter) the number of LMDB queries per batch collapses frombatch_size × 2to essentially zero except on real hits. The no-FN invariant ensures real hits are never discarded.Updated per-work-item result layout (extends the existing
PublicKeyBytes.java:240-242struct):[0–31] X coordinate (32 bytes, unchanged) [32–63] Y coordinate (32 bytes, unchanged) [64–83] hash160 uncompressed (20 bytes, unchanged) [84–103] hash160 compressed (20 bytes, unchanged) [104] flags (uint8_t) [105–107] padding to 4-byte alignment = 108 bytes/work-item (+4 bytes vs today's 104, +3.8 % PCIe bandwidth)OpenCLGridResultoffset constants (CHUNK_SIZE_*) are updated accordingly; no other Java change is needed forConsumerJavato read the new field. -
Phased bucket processing inside the workgroup. OpenCL local (shared) memory per workgroup is constrained (typically 32-64 KB); a full bucket (~43 MB at Full DB scale) does not fit. So the workgroup processes its candidates in 256 phases, one per first-byte bucket, in lockstep across all workgroups:
- Each thread in the workgroup derives its candidate hash160 once and stores
(firstByte, longKey)to private memory. - For each phase
b∈ [0, 255]: cooperatively load bucketbfrom global memory in tiles that DO fit into local memory; every thread whosefirstByte == bruns a branchless binary search of itslongKeyagainst the loaded tile; tiles are streamed through local memory until the bucket is exhausted; threads whosefirstByte != bparticipate in the cooperative load (memory bandwidth) but skip the search. - After phase 255 every thread has its
flagsbit set or cleared and the result struct is written to global memory. Alternative phase boundary: if the per-threadfirstBytedistribution is too skewed within a workgroup, sort candidates within the workgroup byfirstBytefirst (one parallel radix pass over 256 keys) so the per-phase active-thread mask is large enough to be worth the cooperative load.
- Each thread in the workgroup derives its candidate hash160 once and stores
-
What this buys — and the throughput trade-off that has to be measured first. The naive framing is "the slow path (~108 ns/op CPU per
containsAddress) collapses into the GPU's existing parallelism budget; the CPU only sees and follow-up-verifies the small set of 'probably present' candidates." That is only true if the GPU has spare compute and bandwidth headroom — adding work-per-thread is not free. Failure modes that can turn this from a win into a regression:-
Compute saturation. The current kernel is dominated by secp256k1 scalar multiplication (compute-bound on the ECC inner loop). Each work-item that also runs 256 phases of cooperative tile loading + per-phase binary search lengthens the work-item; if the GPU was already at ~100 % ALU occupancy on ECC, the presence check serialises behind that and the wall-clock per work-item grows almost linearly with the added work.
-
Memory-bandwidth competition. ECC is compute-bound; the presence-check phase is memory-bound (~1.1 GB Light DB or ~11 GB Full DB streamed through global memory each scan). On a GPU whose global-memory bandwidth is already a constraint (compact consumer cards in particular), the presence-check phase steals bandwidth from kernels that may need it later, and the cooperative-load tile fills can stall warps that ECC was not stalling.
-
VRAM displacement. The 1.1–11 GB snapshot competes for VRAM with the workgroup's own state and any other kernel resources. On 8 GB cards the Full DB snapshot doesn't fit; on 12–16 GB cards it fits but leaves little headroom for batch growth.
-
The crossover question that decides whether this TODO is worth doing at all:
Scenario CPU .contains()saturated?Adding presence check to GPU is… CPU has spare cores (consumer feed is the bottleneck) No Likely a regression — GPU slows, CPU lookups remain easy. Don't ship. CPU saturated, GPU has headroom Yes Likely a clear win — even a 10–30 % per-work-item slowdown on the GPU is dominated by the relief on the CPU side. CPU saturated and GPU near-saturated on ECC Yes Maybe a wash — the kernel gets slower in proportion to the CPU relief. Needs measurement. Decide on end-to-end pipeline throughput (candidates verified per second, end-to-end), not on CPU
containsAddresslatency in isolation. The README's 108 ns/op figure is the right input for the CPU-side ceiling, but the GPU-side cost has to be measured directly because it depends on the specific GPU model, the snapshot size, and the workgroup configuration. -
The measurement plan that has to precede any kernel changes:
- Baseline today — record candidates/sec, GPU kernel time per launch, and CPU
containsAddresstime per call at the configurations of interest (Light DB + TRUNCATED_LONG_64; Full DB + TRUNCATED_LONG_64; both withkeysPerWorkItemmatching production). - CPU headroom probe — saturate the CPU
containsAddresspath artificially (run more producer threads thancontainsAddresscan handle) to verify the CPU side actually can become the bottleneck under realistic load. If it never saturates, this TODO targets a non-bottleneck and should be deprioritised. - Kernel-cost simulation — without writing the full presence-check kernel, ship a stub that adds a deterministic, comparable-cost workload to each work-item (e.g. a fixed number of dummy reads from a 1 GB device buffer) so the kernel-slowdown side of the trade-off is quantified before the real implementation.
- Decision threshold — proceed with the real implementation only if step 2 confirms the CPU is the bottleneck and step 3 shows kernel slowdown ≤ the CPU relief at the workgroup size of interest. If either condition fails, the right answer is to fix the CPU side (switch backend, batch-prefetch, or do the smaller "GPU pre-computed hash" TODO above) instead.
- Baseline today — record candidates/sec, GPU kernel time per launch, and CPU
-
What this buys when the trade-off is favourable. With the CPU saturated and the GPU having ≥ 20–30 % compute headroom: the slow path (~108 ns/op CPU) effectively disappears from the pipeline and the bottleneck shifts elsewhere (typically PCIe upload of the producer keystream or LMDB verification of the now-tiny "probably present" subset). For a typical workgroup of 256 threads at 10 M candidates/s under those conditions, the difference is ~100 ms/s of CPU lookup overhead vs ~negligible. Under the unfavourable trade-off, expect net regression — kernel time grows by more than the CPU saves, throughput drops, and the only thing gained is a more complex pipeline.
-
-
What needs to be designed first (before any OpenCL code is written): the cooperative-load tile size (function of GPU local memory + bucket size + workgroup size); how the 108-byte result struct (with the new
flagsbyte) flows through the existingOpenClTask+OpenCLGridResulttypes — specifically theCHUNK_SIZE_*offset constants and the Java-side extraction loop; the upload path (one-shot atOpenCLContext.init()or per-kernel-launch); whether the snapshot is rebuilt on the GPU after each LMDB update or kept immutable for the run (current expectation: immutable per scan session, matches the CPUpopulateFromcontract); and the filter-choice gate in configuration (newaddressLookupBackendvalueGPU_TRUNCATED_LONG_64, withGPU_BINARY_FUSEreserved for the alternative filter path).
-
-
Long-term vision: end-to-end address scan on the GPU; CPU is a thin verifier only. The "GPU-side no-false-negatives filter" TODO above is the first concrete step; this entry captures the broader end-state it leads to. North star: a single scan invocation is one GPU pipeline that emits only the small set of "possibly found" candidates back to the CPU; the CPU's only remaining job is to verify those few candidates against LMDB. Everything currently sitting between GPU output and LMDB (
BLOOM,HASHSET,TRUNCATED_LONG_64on the CPU path) becomes optional / disable-able because the GPU's own filter has already done the work.- Two-phase per-launch pipeline on the GPU:
- Address generation phase (already exists): every thread derives both hash160 variants (uncompressed + compressed) and stores them to private memory.
- Address lookup phase (new): the workgroups consult the pre-loaded snapshot, run the filter for both hash160 variants, and write back a
{flags, hash160_uncompressed, hash160_compressed}record per thread.flagsbit 0 = uncompressed possibly found; bit 1 = compressed possibly found. The CPU only acts on records whereflags != 0.
- Snapshot lives in GPU global memory and is loaded once per session. At startup the JVM builds the
TRUNCATED_LONG_64snapshot (256 sortedlong[]buckets, ~1.1 GB at Light DB scale, ~11 GB at Full DB) and uploads it to device global memory. Modern consumer GPUs have 8-24 GB of VRAM; the Light DB fits comfortably, the Full DB needs higher-end cards or out-of-core streaming. The upload is one-shot — the snapshot does not change during a scan session. - Per-workgroup local-memory tiles. Workgroup local (shared) memory is typically 32-64 KB on consumer GPUs and 96-100 KB on workstation cards — far smaller than even a single full bucket at Full DB scale (~43 MB). The lookup phase streams each bucket through local memory in tiles sized to the workgroup's local-memory budget; threads cooperatively load each tile, then every thread whose
firstBytematches the bucket index runs a branchless binary search against the loaded tile before the workgroup advances to the next tile. - All workgroups process the same bucket at the same time. The 256 phases (one per first-byte bucket) advance in lockstep across all workgroups so that the streamed bucket data is read once from global memory per phase rather than per workgroup. Threads whose
firstBytedoes not match the active bucket participate in the cooperative load (memory-bandwidth contribution) but skip the search. - The "looser-but-larger candidate set is OK" trade-off. A GPU-side filter does not need to be as exact as the CPU-side TRUNCATED_LONG_64. As long as the rate of "probably present" flags reaching the CPU stays low enough that CPU-side LMDB verification keeps up with the GPU's throughput, a coarser GPU filter is acceptable. Example: a smaller stored value per entry (4 bytes / 32 bits instead of 8) cuts VRAM cost in half and the corresponding ~N/2^32 false-positive rate is still low enough that the CPU handles the resulting "hits" without becoming the bottleneck.
- Configuration shift implied. Today
addressLookupBackendselects the in-RAM CPU accelerator. After this work it should add a value likeGPU_ONLY(or be replaced by a richerconsumerJava.lookupChain: [...]config) where the operator declares "the GPU filter is the front-line; the CPU pipeline only verifies the flagged candidates against LMDB".BLOOM/HASHSET/TRUNCATED_LONG_64remain available for CPU-only setups or as a fallback while the GPU pipeline is still being commissioned. - Why this is the right end-state. The CPU's address-check budget today (~108 ns/op for
TRUNCATED_LONG_64) is the only synchronisation point between the GPU's parallelism and the verification path. Moving that check to the GPU collapses the CPU contribution to "follow up on the small flagged subset" — at typical false-positive rates this is a few candidates per million derivations, well below any conceivable CPU bottleneck. The CPU then truly becomes the orchestrator (start kernels, read flagged results, verify against LMDB, log hits) rather than the rate-limiting step.
- Two-phase per-launch pipeline on the GPU:
Today BAF couples key generation and address checking in one process. The idea here is the inverse: a dedicated checker service that owns the LMDB database, exposes a network endpoint (socket / WebSocket / ZeroMQ), and receives batches of raw private keys (or hash160 values) from external producers — which may run on a different machine, a cluster of GPUs, or a completely separate codebase.
This is architecturally the mirror image of the existing KeyProducerJavaSocket / KeyProducerJavaWebSocket / KeyProducerJavaZmq inputs, which let external sources feed keys into BAF. What is missing is a mode where BAF acts purely as a checking endpoint — no key generation at all, just receive → derive addresses → LMDB lookup → report hits.
Why it makes sense:
- Separation of concerns: key generation (GPU-heavy) and address checking (DB-heavy) have different hardware profiles. Running them on separate machines lets each be optimised independently.
- Horizontal scaling: multiple GPU nodes can feed a single checker, or a single GPU node can fan out to multiple checker replicas with sharded databases.
- Interoperability: third-party key generators (hashcat, custom FPGA tooling, cloud spot instances) can contribute to a scan without being aware of BAF internals — they just push key bytes to a socket.
Sketch of the design:
- New
CCommandvalueCheckerService(orConsumerService) that starts aConsumerJavawired to a network-facing key receiver instead of a local producer queue. - The network receiver implements
Producer(or a newKeySourceabstraction) and enqueues received key batches into the existingLinkedBlockingQueue<byte[]>. FromConsumerJava's perspective nothing changes. - Protocol: raw binary frames of packed private keys (same layout as the existing queue entries) over TCP/WebSocket/ZMQ — no new serialisation format needed.
- Hit reporting: existing log output is sufficient for a first version; a structured hit-callback endpoint (HTTP POST, ZMQ PUB) can be added later.
- Configuration:
CProducerCheckerService(mirrorsCProducerJavaSocketshape); the checker service is just another producer config entry withlistenAddress+listenPort+ protocol selector.
What this is NOT: this is not the GPU-side presence-check optimisation (pushing TRUNCATED_LONG_64 into the OpenCL kernel). That optimisation keeps everything in one process and moves the lookup to the GPU. The checker service idea moves the lookup to a separate process/machine, which is the opposite trade-off — more network overhead, but full decoupling of key generation hardware from database hardware.
Prerequisite before implementing: define the wire format for key batches (size, byte order, compressed vs uncompressed flag) so that non-BAF producers can implement it without reading BAF source. Document it in docs/wire-protocol.md.
Decouple "producing" from "finding" — a proper network API (finding service + result feedback + web UI)
Vision. Split the project's two concerns — producing candidate keys and finding (derive address → check DB → report) — into cleanly separated, independently runnable parts, within this existing project. Today they are coupled in one Finder process. Shrink and separate them so the finding side becomes a self-contained service reachable over the network that accepts keys, key ranges, or hash160s from any source; so a lot more producers can be added (in-process or fully external) without touching the finder; and so operation/monitoring can move to a web UI. This is the umbrella the Standalone consumer (checker) service entry above is the first slice of, and it feeds the Cross-platform GUI entry below.
Three network key-input transports already exist, all inbound ingestion only, all feeding the shared LinkedBlockingQueue<byte[]> through AbstractKeyProducerQueueBuffered:
| Transport | Class | Mode(s) | Framing today |
|---|---|---|---|
| TCP | KeyProducerJavaSocket |
SERVER (accept) / CLIENT (connect) | reads fixed 32-byte frames back-to-back, no delimiter/header |
| WebSocket | KeyProducerJavaWebSocket |
embedded server | one binary message == 32 bytes per key (else rejected/logged) |
| ZeroMQ | KeyProducerJavaZmq |
PULL, BIND / CONNECT | one message == 32 bytes per key |
The "wire format" is therefore implicit and minimal: the unit is exactly 32 bytes = one raw big-endian unsigned private key (decoded by keyUtility.bigIntegerFromUnsignedByteArray; length must equal PRIVATE_KEY_MAX_NUM_BYTES). What is missing (the gap to fill):
- No protocol version / handshake, no message types, no batch header (one key per message on WS/ZMQ is very chatty).
- No key-range submission. Only the local
KeyProducerJavaIncrementalenumerates ranges; the network path cannot say "scan start..start+N". (Earlier drafts of this TODO wrongly implied ranges over sockets — they do not exist.) - No way to submit hash160s directly (needed for the pure checker mode).
- No acknowledgements / errors back to the sender (it can't tell if keys were accepted, invalid, or dropped).
- No result feedback at all — hits/vanity are
ConsumerJavalog-only (HIT_PREFIX/VANITY_HIT_PREFIX); a remote sender gets nothing and cannot correlate a hit to what it sent. - No backpressure (the queue is unbounded → memory risk under a fast sender), no auth/TLS (it moves private keys in the clear).
Topology + ownership (confirmed). The server (e.g. one PC with a GPU) runs with its own configured settings and owns them — clients never configure it. In the normal case the two client roles are different machines: a key-sender client feeds keys to be checked (→ server's receiver), and a completely separate result-listener client only subscribes to events (→ server's sender/publisher). A web UI is the special case that does both. Every event is broadcast to all subscribers and is self-describing (carries its key/address), so a listener that did not submit still knows what each event refers to — no per-client correlation required.
Define one small, versioned, typed, framed message schema (document it in docs/wire-protocol.md + a machine schema), split by direction:
Client → server (ingestion / receiver):
HELLO/ capability + version negotiation (key formats, compression, max batch, supported coins).SUBMIT_KEYS— a batch of N×32-byte keys in one frame. The minimal baseline is "just a private key to check"; batching kills the one-message-per-key overhead.SUBMIT_RANGE—{startKey:32, count, increment}(+ optional compressed/coin flags). Orders of magnitude less bandwidth than enumerating keys; the finder expands it (reuseKeyProducerJavaIncremental). The single highest-value ingestion capability. (optional extension)SUBMIT_HASH160— pre-derived 20-byte hashes for the pure checker mode (skip EC derivation). (optional extension)ACK/ERROR— accepted count, or rejection with a code (bad length, out-of-range, backpressure, unknown type).
Server → subscribers (notification / sender) — queue/job granularity + configured hits ONLY:
CONFIG— sent once on connect: a snapshot of the server's running configuration so a new subscriber knows the setup (device, backend, coins, vanity pattern, …).KEY_QUEUED— "the service received a key/range to scan; it is now on the queue." Emitted per submitted unit (sender-controlled volume).GENERATION_STARTED— emitted per key/key-range taken from the queue (scanning/generation began for that unit).VANITY_HIT— a configured vanity-pattern match.HIT— a database/GPU found: secret + address + variant (uncompressed/compressed/coin).PROGRESS(optional) — periodic aggregate for a range scan (position, keys/sec, ETA) for the UI.PING/PONG,BYE.- Hard non-goal: never emit an event per produced key — only the configured results of interest (vanity hits / DB founds) plus the queue-lifecycle events above. Streaming every generated key would be orders of magnitude too slow and defeats the purpose.
Flow control: credit-based backpressure on ingestion (server advertises how many it can accept) replacing the unbounded queue; on the notification side, bounded per-subscriber queues with explicit drop (for lossy PROGRESS) rather than silent loss, while founds stay on the durable log sink.
Transport recommendation (honest trade-offs):
- WebSocket = primary. Already a dependency (
Java-WebSocket); bidirectional and self-framing (message boundaries built in — no manual length-prefix); carries binary (bulk keys) and text/JSON (control); and is browser-native, so the same endpoint feeds the web UI (ingest + feedback + monitoring over one connection). Best single choice for "a very good network API + the planned UI." - gRPC (bidirectional streaming) = optional, for typed multi-language producers. Protobuf gives a language-neutral contract with codegen (hashcat/FPGA/cloud producers in any language), built-in flow control and deadlines. Cost: adds
grpc-java(heavier; friction with the repo'sdependencyConvergence+bannedDependencies), and needs grpc-web for browsers. (Note: protobuf is already a transitive dep via bitcoinj, so the schema language is familiar.) - ZeroMQ = optional, for scale fan-out. Great for many-producers→one-finder or one-producer→many-checkers, but no built-in request/response correlation and not browser-friendly. Keep for the horizontal-scaling story.
- Raw TCP = keep as a minimal/low-overhead option, but versioned + length-prefixed (the current "read 32 bytes forever" is fragile and un-versioned).
Recommendation: implement the schema once, ship WebSocket first (covers the API and the UI), add gRPC/ZMQ later only if a concrete scale/interop need appears.
The current transports are one-directional and not n-to-n: TCP KeyProducerJavaSocket in SERVER mode calls serverSocket.accept() exactly once and then reads 32-byte frames from that single client forever (one connection per producer instance — not even n-to-1); WebSocket's server is multi-client but only ingests; ZMQ uses a single PULL socket. None of them ever send anything back. So "async notification of results/progress" is genuinely new wiring, and the natural shape is two logical channels with opposite fan patterns:
- Ingestion channel — fan-in (n producers → 1 finder): n key producers submit keys/ranges/hash160s.
- Notification channel — fan-out (1 finder → n listeners): the finder publishes
RESULT/VANITY_HIT/PROGRESSto all subscribers.
Is a separate (secondary) results socket good? — Yes, as the default, with a single-socket option per transport. Trade-off:
- Separate results channel (recommended baseline). Keys and results have different fan patterns (fan-in vs fan-out) and different audiences — a pure monitoring listener (e.g. the web UI) is not a key producer and should not need submit rights; independent lifecycle, auth scope, and backpressure; idiomatic for ZMQ. No mapping / no correlation by design: every event is broadcast to every subscriber and is self-describing (carries its key/address), so the server keeps no per-client routing state and a listener that never submitted still sees everything. Cost is just two endpoints to configure.
- Single bidirectional connection (per-transport option). For WebSocket / gRPC streaming the client can submit and subscribe on the same connection (one endpoint) — convenient for the web-UI special case. It still broadcasts all events to all subscribers (no per-connection filtering), and a pure listener that never submitted still needs the subscribe role + the shared subscriber registry.
Recommended model: two roles, not two hard-coded sockets. A client may take a submit role, a subscribe role, or both. Each transport maps the roles idiomatically:
| Transport | Ingestion (fan-in) | Notification (fan-out) |
|---|---|---|
| ZeroMQ | PULL (bind; n PUSH connect) |
separate PUB socket (n SUB connect) — the idiomatic "secondary socket" |
| WebSocket | one server, many clients in submit role | same server broadcasts to all clients in subscribe role (role negotiated per connection or via path /submit vs /events) |
| Raw TCP | accept-loop many clients (fix the single-accept() limitation) + a connection registry |
either a second port for events, or write events back on each subscriber's socket from the registry |
Concrete answer — "one server, many subscribers, broadcast status to all": this maps directly onto WebSocket and is the recommended primary. WebSocketServer already accepts many clients on one port, tracks them (getConnections()), and has built-in broadcast(...); reuse the same server — onMessage ingests keys, an outbound broadcast(status) fans results/progress out to every connected subscriber (optionally a dedicated events-only server on its own port). Note it is not a literal "same class, send instead of listen" for the other transports: raw TCP needs a new broadcaster (accept-loop + connection registry + length-prefixed framing + per-client write isolation, because the existing class accepts a single client and only reads fixed 32-byte frames), and ZMQ uses a separate PUB socket (socket types are fixed — a PULL cannot send). Distinguish submit vs subscribe roles per connection (HELLO message or /submit vs /events path) with separate auth scopes — a pure status listener must not be able to submit keys.
Backpressure + QoS (important for fan-out): a slow listener must never stall the finding hot path. Split QoS by event kind:
PROGRESSis lossy-OK — best-effort broadcast, drop for slow subscribers (ZMQPUBwith a send HWM does this natively; WebSocket: bounded per-subscriber queue, drop-oldest).RESULT(founds) is precious — keep the existing durable log/file sink as the source of truth so a found key is never lost to a dropped notification, and optionally offer an at-least-once acked delivery for subscribers that need it.
Implementation seam — a connection package with two sub-packages (named from the server's perspective):
connection.receiver— accepts incoming private keys (aKeySourcethat feeds the existing queue). Today's socket key producers are this role.connection.sender— publishesCONFIG-on-connect + the event stream to all subscribers (aResultSink/event-bus the consumer publishes to, plus a subscriber registry and per-subscriber bounded queues).
Each transport provides the impls it needs in each sub-package (see the adapter matrix below). The consumer publishes to the transport-agnostic ResultSink; producers and listeners are independent clients — a listener need not be a producer and vice versa.
Adapter matrix (transport × direction) — do we implement a sender per receiver? Conceptually yes: each transport can have an ingestion adapter (KeySource, today's receiver key producers already are this) and a notification adapter (ResultSink). But these are two different roles, not a receiver subclassed into a sender, and the directions are not symmetric per transport — so it is not 6 mandatory classes:
| Transport | Ingestion (KeySource) |
Notification (ResultSink) |
Notes |
|---|---|---|---|
| WebSocket | onMessage |
broadcast(...) |
a single dual-role class can do both (bidirectional, multi-client, broadcast built-in) |
| ZeroMQ | PULL adapter |
separate PUB adapter |
two thin adapters — socket types are fixed |
| Raw TCP | existing single-accept reader | new broadcaster (accept-loop + registry + length-prefixed framing + per-client write isolation) | sender is much heavier than the receiver; not a flip |
Fill the matrix by value, not parity: WebSocket both (API + web UI) first; ZMQ both (PULL+PUB) for cluster scale; raw TCP KeySource now, ResultSink optional/last (most work, least unique benefit). What makes the senders cheap is a shared protocol codec + the shared ResultSink event model, so each transport adapter is thin byte-moving plumbing.
Current state (researched). Runtime metrics live in two places, both written from many threads:
RuntimeStatistics(shared; created byFinder, injected into producers + consumer):batchesByProducer(per-producer label →AtomicLong, incremented by each producer on every dispatched batch) and a late-boundrunningProducersGauge(computed from producerRUNNINGstates).ConsumerJavaAtomicLongcounters, mutated on the consumer hot path:checkedKeys(per address check),checkedKeysSumOfTimeToCheckContains,hits,vanityHits,consumerReadyCount,producerBlockedCount; plus derivedkeysQueue.size(),runningConsumerCount(),startTime.- These are surfaced pull/poll-only:
ConsumerJava.startStatisticsTimer()runs aScheduledExecutorServiceeveryprintStatisticsEveryNSeconds(default 60s), reads everything, renders a string viaStatistics.createStatisticsMessage(...), andLOGGER.infos it. The log is the only sink.
Proposed. Add a thread-independent StatisticsPublisher that pushes updates to subscribers via the ResultSink (connection.sender), without touching the hot path:
- Extract the per-tick aggregation into an immutable
StatisticsSnapshot(uptime, keys, keys/sec, avg contains-time, batches-by-producer, producers running, consumers running, ready, blocked, queue size, hits, vanity hits). Feed the same snapshot to both the existing log line and the publisher — one source of truth, log behaviour unchanged. - The publisher owns its own thread/scheduler, reads the lock-free atomics, and broadcasts. Producers/consumer keep doing only atomic increments as today; a slow subscriber can never stall them (bounded per-subscriber queues, drop the lossy snapshot).
Critical: "fire on every change" must be coalesced, not literal. checkedKeys increments millions of times/sec — publishing per-increment is the same firehose the per-produced-key non-goal forbids. So split by rate:
- Discrete, low-rate changes → fire immediately on change: a
HIT, aVANITY_HIT, a producer started/stopped,KEY_QUEUED,GENERATION_STARTED. (These are exactly the discrete events in the schema above.) - High-frequency aggregate counters (
checkedKeys/keys-per-sec, queue size, avg-time) → publish a coalescedStatisticsSnapshoton a tick (the configured cadence, or faster — e.g. 1 s — for a responsive UI), never per-increment. Adirtyflag skips ticks where nothing changed (idle); optionally emit only the fields that changed.
So the publisher's two outputs map onto the existing event set: discrete changes = HIT/VANITY_HIT/KEY_QUEUED/GENERATION_STARTED; the coalesced snapshot = PROGRESS/STATS. (The separate address-file-import path has its own ReadStatistic progress, which could publish the same way if a UI ever drives imports.)
- Finding service that accepts keys / ranges / hash160s (do first). New run mode (
CCommand, e.g.FindService) wiring the consumer to a network key source instead of an in-process producer queue. Introduce aKeySourceabstraction feeding the existing queue; producers (ProducerJava,ProducerOpenCL, third-party) become clients of the protocol. AddSUBMIT_RANGE(reuse the incremental expander) andSUBMIT_HASH160(checker mode). - Notification / result feedback channel (second) —
connection.sender. A structured outbound event stream published from the consumer via a transport-agnosticResultSink/event-emitter:CONFIGon connect, thenKEY_QUEUED,GENERATION_STARTED,VANITY_HIT,HIT(founds), and optional aggregatePROGRESS— configured results of interest only, never per produced key. Optional and backpressure-safe (never blocks the finding hot path). Broadcast to all subscribers over WebSocket / ZMQPUB/ TCP. This is what lets a separate listener or web UI watch a scan live instead of tailing logs. - Web UI (third; ties into the Cross-platform GUI entry below). Configure/start/stop a scan and monitor it (live keys/sec, range progress, hits, vanity) by subscribing to the Deliverable-2 stream over the same WebSocket. JSON config stays as the headless interface.
- Preserve the layered architecture: the network source and feedback emitter are new edges — introduce them as their own small packages/interfaces (
KeySource→ existingLinkedBlockingQueue;ResultSink← consumer events), not by threading sockets throughConsumerJava. - The feedback channel publishes events; UI / CLI / remote producer are just subscribers — finding must not depend on any UI.
- Security first: ingestion and feedback move private keys and hits. Bind to localhost by default; require auth (token / mTLS) and TLS before any remote-exposed default; document the exposure model in
docs/wire-protocol.md. Add bounded queues so a hostile/fast sender cannot OOM the finder.
BAF is currently a CLI + JSON-config tool. Adding a minimal GUI would make it accessible to users who are not comfortable with JSON configuration files. The target experience: download one file, double-click, pick a GPU or CPU, point at the Light DB, click Start — and see a live counter of keys/second and any hits.
Scope for a first version (desktop only):
- 1-GPU or 1-CPU random key generator (single
producerOpenCLorproducerJavawithkeyProducerJavaRandom) - Light DB read-only check (
TRUNCATED_LONG_64orBLOOMbackend) - Live statistics panel: keys/sec, total scanned, uptime, hits
- Start / Pause / Stop controls
- No LMDB import/export UI — the CLI covers that; the GUI is scan-only
Cross-platform UI framework investigation (must settle this before implementing):
The choice of UI toolkit determines whether the same codebase can later reach Android.
| Framework | Desktop (Win/Mac/Linux) | Android | Notes |
|---|---|---|---|
| JavaFX (OpenJFX) | ✅ | ❌ (not natively) | Standard Java desktop toolkit; good styling via CSS; ships as Maven dep (org.openjfx); well-documented |
| JavaFX + Gluon Mobile | ✅ | ✅ | Gluon's client-maven-plugin cross-compiles JavaFX to Android/iOS via GraalVM Native Image; complex build matrix but same Java/JavaFX codebase throughout |
| Compose Multiplatform (JetBrains) | ✅ | ✅ | Kotlin-first; supports Desktop + Android + iOS + Web from one codebase; modern declarative UI; BAF's Java core would be called from a Kotlin UI module — interop is clean |
| Swing | ✅ | ❌ | Still works; ugly by default; no path to mobile; not recommended for new work |
| SWT | ✅ | ❌ | Eclipse toolkit; native widgets; no mobile path |
Recommended investigation order:
- JavaFX (desktop only first) — lowest friction: pure Java, Maven dep, no Kotlin, no native build. Delivers the desktop goal immediately. Prototype this first.
- Compose Multiplatform — if Android is a real goal, evaluate whether the Kotlin UI layer calling the Java BAF core (via a thin adapter module) is maintainable. The BAF library itself stays Java 21; only the UI module is Kotlin. This is the cleanest path to a single-source desktop + Android app.
- Gluon Mobile — only investigate if JavaFX is chosen for desktop and Android is required without rewriting the UI in Kotlin. The GraalVM Native Image build for Android is heavyweight to maintain.
Android considerations:
- BAF uses Java 21 features (records, sealed types, text blocks, pattern matching) which are not available on all Android versions via
d8/r8. The GPU pipeline (JOCL) has no Android equivalent — the Android version would be CPU-only (KeyProducerJavaRandom+ConsumerJavawithTRUNCATED_LONG_64). - A pragmatic split: ship the desktop GUI as a standalone JavaFX module; ship the Android app as a separate Kotlin + Compose module that uses a stripped-down BAF core (CPU-only, no JOCL dependency, Java 8–compatible subset or a dedicated
baf-core-androidartifact). - The checker-service TODO above would let the Android app act as a remote viewer — the heavy scanning runs on a desktop/server and the Android app displays live stats and hits over a WebSocket, without running the scan itself.
Module layout when implemented:
BitcoinAddressFinder/
├── baf-core/ # existing library code (producer/consumer/persistence/…)
├── baf-cli/ # existing Main.java entry point
├── baf-gui-desktop/ # JavaFX desktop app (new)
└── baf-gui-android/ # Android / Compose module (new, later)
Until the investigation settles on a toolkit, no UI code should be added to the existing modules. Record the toolkit decision and its rationale in docs/gui-toolkit-decision.md before starting implementation.
-
Pluggable OpenCL backend behind a small device/library abstraction (migrated from GitHub issue #22 "Can jogamp be used to improve OpenCL handling?"). Today the GPU layer is bound directly to JOCL (
org.jocl,jocl 2.0.6):opencl/OpenCLBuilderenumerates platforms/devices via raworg.jocl.CLcalls,opencl/OpenCLContextcompiles/runs the kernel,opencl/OpenClTasksets kernel args, andopencl/OpenCLDeviceis a hand-written value type. The goal is a tiny internal API (interface) for "list platforms/devices" + "build a context / run the kernel grid" so the OpenCL implementation can be switched between backends without touching producers/config.- Step 1 — define the abstraction over the existing JOCL impl. Extract a minimal interface set (e.g. an
OpenClBackend/ device-enumeration + context-and-grid-execution contract) and make the current JOCL code the first implementation behind it. No behaviour change; pure introduction of the seam. Keepjoclconfined to theopenclpackage (already enforced by thejoclConfinedToOpenclArchUnit rule). - Step 2 — wire a second backend behind the same API. Add JogAmp JOCL (
com.jogamp.opencl, OOCLPlatform/CLDevice/CLContext) as an alternative implementation selectable at runtime/config, so the two bindings can be A/B'd (device enumeration, kernel build/run, native-lib packaging). Decide based on results whether JogAmp simplifies the layer enough to become the default or stays optional. - Open questions to settle when picked up: where the backend is selected (new
configurationfield vs auto-detect), how kernel source/args map across the two APIs, and the native-library/packaging impact of adding JogAmp.
- Step 1 — define the abstraction over the existing JOCL impl. Extract a minimal interface set (e.g. an
-
Add a test exercising two OpenCL devices simultaneously (migrated from GitHub issue #6). Current OpenCL coverage drives a single device (
ProbeAddressesOpenCLTest, gated byOpenCLPlatformAssume). Add a test that runs twoproducerOpenCLinstances concurrently (the multi-device path the project supports via multipleproducerOpenCLentries) and asserts both produce and feed the consumer correctly at the same time.- Scope: two physical OpenCL devices — e.g. a machine with two GPUs, or one GPU plus a CPU that exposes an OpenCL device. (Not two logical handles to the same device.) Each
producerOpenCLtargets a distinct(platformIndex, deviceIndex). - Availability gate: the test must self-skip unless ≥ 2 distinct physical OpenCL devices are enumerated (extend the
OpenCLPlatformAssumepattern). Most CI has 0–1 device, so it will usually skip, like the existing OpenCL tests; it is meant to run on a real dual-device host.
- Scope: two physical OpenCL devices — e.g. a machine with two GPUs, or one GPU plus a CPU that exposes an OpenCL device. (Not two logical handles to the same device.) Each
-
jqwik pin policy — see
../workspace/policies/jqwik-prompt-injection.md.jqwik.version ≤ 1.9.3is mandatory. -
@VisibleForTestingaudit. 10 sites remaining (down from 19 — see workspacecrossrepostatus.mdfor the site-by-site audit). All 10 are legitimate per the design-fit review; no further cleanup recommended unless the source moves. -
Null-safety further refinement. JSpecify + NullAway are enforced at compile time in strict JSpecify mode with the extra options
CheckOptionalEmptiness,AcknowledgeRestrictiveAnnotations,AcknowledgeAndroidRecent,AssertsEnabled(seepom.xml). Every package carries an explicit@NullMarkedviapackage-info.javaso the convention is visible to non-NullAway tools (IDEs, Kotlin, Checker Framework). The 50@Nullablesites currently in the codebase are all legitimate.OpenCLContext.getOpenClTask()returnsOptional<OpenClTask>rather than@Nullable OpenClTaskto surface the lifecycle state in the type. Open follow-up: review any future-added public API surfaces for places where@Nullablewould be more precise than the implicit non-null default; consider whether further@Nullable Treturns should migrate toOptional<T>on a case-by-case basis (the project's established convention is@Nullable; Optional is used selectively for lifecycle-shaped APIs). -
SpotBugs
effort=Max+threshold=Low— ✅ enforced at the gate (76fd1a7).pom.xml<effort>Max</effort>+<threshold>Low</threshold>;spotbugs:checkis part ofmvn verifyand fails on any unsuppressed finding. The full clearing chain (191 → 0) is recorded in../workspace/crossrepostatus.mdunder "SpotBugs Max+Low".spotbugs-exclude.xmlcarries narrow<Match>blocks with rationale for every structural false positive (Lombok-USBR, project-wide CRLF mitigation, generic-erasure CHECKCAST in keyproducer,@FireAndForgetFuture-DLS, Producer interface heterogeneous throws, drain-pattern PRMC, CWE-338 demo RNG, secp256k1 curve params, JOCL-spec nulls, preserved-for-revival private helpers, plus the two opt-in lifecycle items below). -
Mutation-testing threshold expansion — the gate now covers a verified-100% 15-class list (
util.BitHelper+util.PrivateKeyTooLargeException+model.PublicKeyBytes/AddressToCoin/AddressType+ the 8 custom exceptions +statistics.Statistics/ReadStatistic+configuration.CKeyProducerJavaIncremental; 65 mutations, pitest-maven 1.25.4). The earliernet.ladenthin.bitcoinaddressfinder.BitHelpertarget was stale (BitHelper moved toutil/in the restructure → gate matched nothing) and has been fixed.model.Hash160is deliberately excluded (its fast/slow hash paths are identical, so theif(useFast)negate mutant is equivalent). Still open (optional): config getters covered only by producer/keyproducer integration tests, and the larger orchestration classes (producer / consumer / engine / opencl) which need heavier fixtures. -
Additional ArchUnit rules to consider — public-fields-final,
noTestFrameworksInProduction,loggersArePrivateStaticFinal,noPackageCycles, the fulllayeredArchitecture()rule, per-module banned-imports (joclConfinedToOpencl,networkInputLibsConfinedToKeyproducer,lmdbConfinedToPersistenceAndIo), and the no-public-mutable-static-state rule (noPublicMutableStaticFields— public static fields must be final; 0 violations, pure drift-guard) are all DONE. No further ArchUnit rules open. -
Cross-repo code-quality TODOs — see
../workspace/policies/code-quality-todos.mdfor the canonical@VisibleForTestingdesign-fit review (BAF site-by-site audit captured in../workspace/crossrepostatus.md), package hierarchy review, and class/method naming review. -
Drop the catch-rethrow
THROWS_METHOD_THROWS_RUNTIMEEXCEPTIONsuppression inspotbugs-exclude.xmlonce a SpotBugs release ships with PR #4087 merged. That PR fixes the detector to match the exception-handler ASTORE register against the local variable being thrown, so the catch-then-rethrow-of-the-same-RuntimeException pattern onAbstractProducer.produceKeysandLMDBPersistence.addresseswill stop firing the warning. Tracking issue: spotbugs/spotbugs#3918. Verification when removing the suppression: runmvn spotbugs:checkand confirm zeroTHROWS_METHOD_THROWS_RUNTIMEEXCEPTIONfindings on those two methods. -
(Unblocked, optional) Drop the project-wide
OPM_OVERLY_PERMISSIVE_METHODsuppression inspotbugs-exclude.xml. The package-architecture refactor it was waiting on has now landed (the single root package was split into the layered packages — see "Done" history below), so cross-layer call sites are now stable and OPM findings would be actionable signals rather than correct-but-unstable noise. Re-enabling is optional: visibility minimisation is not a project goal (the original tightening pressure was fb-contrib noise, not a requirement). If re-enabled, delete the project-wide<Match>and triage the resulting findings (snapshot at suppression time: ~33 sites — Main CLI internal helpers, test-only public surface, abstract/concrete constructors, internal helpers, oneenum.valueOffalse positive).
A bytecode-level (jdeps) audit of the compiled package graph found one latent
upward coupling that the layered rule did not catch: util.Bech32Helper
statically imported io.AddressTxtLine.BITCOIN_CASH_PREFIX — a Foundation→io
edge (latent util↔io cycle) hidden from ArchUnit only because the
static final String constant is inlined at compile time. The constant moved
to the constants leaf (constants.AddressConstants.BITCOIN_CASH_PREFIX); both
io and util now depend strictly downward on it.
With that edge gone, the layeredArchitecture() access lists were tightened to
the exact set of layers that reach each layer today (verified by jdeps):
Pipeline only by Orchestration; InputOutput only by
Orchestration/Pipeline/Capabilities (not Entry); Config not by
Foundation; Foundation not by Config. Any new unintended cross-layer edge
now fails the build. (jllama and plugin were audited the same way and were
already exact — no slack found.)
The 48 classes that previously sat flat in the root
net.ladenthin.bitcoinaddressfinder package were split (via git mv,
history preserved) into dedicated layered packages so the package
boundaries align with the architectural layers:
- Foundation:
model(Hash160, PublicKeyBytes, AddressToCoin, AddressType),util(KeyUtility, PrivateKeyValidator, Bech32Helper, Base36Decoder, BitHelper, ByteBufferUtility, NetworkParameterFactory, ByteConversion, EndiannessConverter, PrivateKeyTooLargeException),core(Interruptable, Startable, FireAndForget, InterruptedRuntimeException),secret(SecretSupplier, RandomSecretSupplier, NoMoreSecretsAvailableException, BIP39Wordlist),statistics(Statistics, ReadStatistic). - InputOutput:
io(AbstractPlaintextFile, AddressFile, AddressTxtLine, SecretsFile, SeparatorFormat, FileHelper, AddressFormatNotAcceptedException). - Pipeline:
producer(Producer, AbstractProducer, ProducerJava, ProducerOpenCL, ProducerJavaSecretsFiles, ProducerState, ProducerStateProvider),consumer(Consumer, ConsumerJava). - Orchestration:
engine(Finder, Shutdown),command(AddressFilesToLMDB, LMDBToAddressFile). - Absorbed into existing capability packages: OpenCL runtime
(OpenCLContext, OpenClTask, OpenCLGridResult, ReleaseCLObject) →
opencl; BIP39KeyProducer →keyproducer.
Test classes were mirrored into the same packages as their subjects
(standard Maven layout). The only in-class change was moving the pure
helper calculateSecretKey(BigInteger, int) from AbstractProducer to
KeyUtility (foundation) to break the single opencl → producer
back-edge; everything else was package moves + import updates +
cross-layer public promotions. The secret foundation package was
introduced to host the secret/mnemonic primitives that KeyUtility
depends on, breaking the util ↔ keyproducer and producer ↔ keyproducer
cycles.
Enforced by the new layeredArchitecture() ArchUnit rule in
BitcoinAddressFinderArchitectureTest (strict top-to-bottom: Entry →
Orchestration → Pipeline → Capabilities → InputOutput → Foundation →
Config → Constants), alongside the retained noPackageCycles and the
targeted leaf rules. All 13 architecture rules green; module-info.java
exports updated for the new packages.
Three structural refactors landed alongside the Max+Low gate flip so the
remaining MDM_THREAD_YIELD sites were resolved at source rather than
suppressed. The Javadoc on each touched class records the rationale; the
ArchUnit comment in BitcoinAddressFinderArchitectureTest documents the
two Thread.sleep sites that remain (and why they are correct).
-
AbstractProducer.waitTillProducerNotRunning— replaced the spin-on-state==RUNNING+ 10 ms sleep with aCountDownLatchawaited viacProducer.shutdownTimeoutSeconds. The newsignalNotRunning()helper counts down at bothNOT_RUNNINGtransitions inrun()and serves as the test seam. Eliminates up-to-10 ms shutdown-wake-up latency per producer; deletes the 17-line apology Javadoc onWAIT_TILL_NOT_RUNNING_RESTORES_INTERRUPT_FLAG(commit892b76a). -
ConsumerJava.consumeKeysRunner— extracted the per-batch processing into a privateprocessBatchhelper, leavingconsumeKeys(ByteBuffer)as a drain-only utility for tests. The runner now waits onkeysQueue.poll(queuePollTimeoutMillis, MILLISECONDS)between drain cycles instead ofThread.sleep(queuePollTimeoutMillis), so the worker wakes the instant a producer enqueues. Idle-to-active latency drops from up-to-100 ms (default) to ~0; steady-state throughput unchanged (commit99f390f). -
ProducerOpenCL.processSecretBase— replaced the spin onThreadPoolExecutor.getActiveCount()with the JCIP §8.3.3BoundedExecutorpattern: aSemaphore(maxResultReaderThreads)acquired beforeexecute()and released in the runnable's outerfinally. The release-on-rejection path is wrapped via asubmittedflag so a shutdown-raceRejectedExecutionExceptiondoes not leak a permit. Important: the spin-wait was the only backpressure on the result-reader pool's unbounded innerLinkedBlockingQueue— without the semaphore the GPU would have submitted faster than the readers could drain, holding result buffers in memory indefinitely. The semaphore is therefore the only correct backpressure primitive here, not just a polish.getFreeThreads()now returnssubmitSlot.availablePermits()(same semantics). Removes up-to-100 ms GPU-pacing latency (commit09c5d52).Config breaking change (acknowledged):
delayBlockedReaderwas the polling-delay knob feeding the deleted spin. TheSemaphorewakes immediately, so the field is removed fromCProducerOpenCLand from the three example JSON configs (examples/config_Find_1OpenCLDevice.json,config_Find_1OpenCLDeviceAnd2CPUProducer.json,src/test/resources/testRoundtrip/config_Find_1OpenCLDevice.json). External configs referencing it will deserialize-fail; users should delete the line.
All blockers cleared and <arg>-Werror</arg> is on in pom.xml:
- Original 6-item pre-flip warnings — cleared per the 4274c25 / 5e3f6a8 / 523fc79 / 62603d3 / da4cab7 / 84b35cb tranche (Thread.getId; deprecated jocl CL_DEVICE_QUEUE_PROPERTIES; two
thisescapes in KeyProducerJavaSocket/Zmq; Closeable.close InterruptedException; explicit close on auto-closeable in OpenClTask). - 14 Checker Framework
[type.anno.before.modifier]warnings —f37f162(moved@NonNull/@Nullableafter modifiers in 8 files). - 5 final
sun.misc.Unsafeproprietary-API warnings —2881c96(deletedByteBufferUtility#freeByteBufferentirely; the eagerUnsafe.invokeCleanerpath was already a no-op on OpenJ9 / GraalVM Native Image / Android per its own Javadoc — HotSpot now joins them via the JVM's built-in Cleaner).
- Error Prone bug-pattern promotions to
ERROR— 12 high-confidence patterns atpom.xml:344. -parametersjavac arg —pom.xml:315.--release N— main compile<release>21</release>(pom.xml:313);module-info-compileexecution stays at--release 9;default-testCompileoverrides back to<source>/<target>because tests legitimately importjdk.internal.ref.Cleanerandsun.nio.ch.DirectBuffer.- Mutation-testing threshold enforcement (PIT) — runs every CI build with
<mutationThreshold>100</mutationThreshold>;<targetClasses>now an explicit 15-class verified-100% list (was the staleBitHelpertarget — see the open "Mutation-testing threshold expansion" item above for the current list and exclusions). - Checker Framework as a second static-nullness pass — Nullness Checker (4.1.0) alongside NullAway.
src/etc/checker/objects.astuboverrides the CF 4.1.0Objects.requireNonNullstub. JOCL-wrapping classes (OpenCLContext,OpenClTask,opencl/OpenCLBuilder) carry class-level@SuppressWarnings({"nullness:argument", "nullness:dereference.of.nullable"}).KeyProducerJavaWebSocketcarries the documented this-escape suppression (Socket / Zmq were refactored to theStartablelifecycle so no suppression is needed there).PublicKeyBytes.equals(Object)takes@Nullable Object;BIP39Wordlist.getWordListStream()returns@Nullable InputStream. - JPMS
module-info.java— lives insrc/main/java9/(a separate source root) so javac at source/target 21 does not auto-activate module mode on the test sources. Themodule-info-compileexecution is bound toprepare-packagerather thancompilesomodule-info.classis not present intarget/classes/while the test sources compile or run. The module opensnet.ladenthin.bitcoinaddressfinder.configurationtocom.fasterxml.jackson.databindso Jackson can populate the configuration POJOs reflectively on any non-public members added later. Module-level@NullMarkedwas intentionally NOT added — the per-package annotation covers the same scope and avoids pulling JSpecify into the module'srequiresgraph. Local-dev caveat:mvn testaftermvn packagewithout an interveningmvn cleanfails withIllegalAccessError; CI is unaffected because the Build and Test jobs run in separate runners with fresh checkouts. - Banned-API enforcement — Maven Enforcer
bannedDependencies+dependencyConvergence(pom.xml:268-283); ArchUnitnoSystemExit/noNewRandom/noThreadSleeprules (BitcoinAddressFinderArchitectureTest:137,164,178);sun.*/com.sun.*/jdk.internal.*import ban (BitcoinAddressFinderArchitectureTest:90-97). - ArchUnit additions — public-fields-final (
BitcoinAddressFinderArchitectureTest:120-130).
- HashSet snapshot —
persistence/inmemory/HashSetAddressPresence.java(presence-only). - TRUNCATED_LONG_64 backend —
persistence/inmemory/TruncatedLong64SortedArrayPresence.java(256-bucket sortedlong[]). - BloomFilter extraction —
persistence/bloom/BloomFilterAccelerator.java(standalone wrapper; LMDBPersistence no longer carries the Bloom fields directly). - Backend config selector —
configuration/AddressLookupBackend.javaenum +CLMDBConfigurationReadOnly.addressLookupBackendfield +ConsumerJava.java:183-189dispatch. Default remainsBLOOM. - Layered/chained backend contract —
persistence/AddressPresence.java(minimal "is this address present?") +persistence/AddressLookup.java(extends withgetAmount). Decorators fall through on positive answers; self-contained snapshots returnrequiresBackend()==falseafterpopulateFrom(lmdb)and the LMDB env is closed. - Lookup benchmark (JMH) —
src/test/java/net/ladenthin/bitcoinaddressfinder/benchmark/AddressLookupBenchmark.java(@Param({"LMDB_ONLY","BLOOM","HASHSET","TRUNCATED_LONG_64"}),Mode.AverageTime,OutputTimeUnit.NANOSECONDS, 0xC0FFEE seed + 2 048 LMDB entries + Bloom FPP 0.01). - Removed dead
loadToMemoryCacheOnInitfrom all 4 stale example JSONs (config_AddressFilesToLMDB.json,config_Find_1OpenCLDevice.json,config_Find_1OpenCLDeviceAnd2CPUProducer.json,config_Find_SecretsFile.json) and fromREADME.md.
Historical context (kept for the pre-Bloom design rationale). A HashSet-based in-memory persistence DID exist pre-Git but was removed in pre-Git commit f153a1bdb363c16bbe86134d360f4c2e4423d3e7 ("Replace in-memory HashSet with Bloom filter for address lookup optimization", 2025-07-10), ~10 months before this repo's boot commit (2c8e9f1, 2026-05-08). That commit is not in this repository — only its post-state was imported. The current HashSetAddressPresence is the resurrection of that earlier design with a cleaner contract (no LMDB coupling once populated). The same removal commit also contained a commented-out sortedAddressCache variant using Arrays.binarySearch(...) — that is what TruncatedLong64SortedArrayPresence ships, with the additional optimization of truncating each hash160 to its first 8 bytes (256-bucket sharding plus the truncation gives a ~7.5×10⁻¹¹ false-positive rate at Full DB scale — negligible in practice). Memory cost reference: HashSet shape was ~50 B/entry (ByteBuffer wrapper + 20-byte payload + HashMap.Node), so ~6.6 GB for the README's 132M-entry light db and ~70 GB for the 1.377B-entry full db; TRUNCATED_LONG_64 cuts that roughly 10× (~660 MB / ~7 GB respectively). Future history lookups for the pre-Bloom design need access to that external repository.
-
GPU grid-size sweep DONE.
src/test/java/net/ladenthin/bitcoinaddressfinder/benchmark/GridSizeSweepBenchmark.javasweepsCProducer.batchSizeInBits×CProducerOpenCL.keysPerWorkItem. Kernel entry isOpenCLContext.createKeys(BigInteger privateKeyBase). Availability gate isnew OpenCLPlatformAssume().assumeOpenClLibraryAvailableAndOneOpenCL2_0OrGreaterDeviceAvailable().@ForkjvmArgsAppendcarries the full project-canonical master JVM-flag list (the same 24-entry--add-opens/--add-exportsset, in the same order, aspom.xml<argLine>,.mvn/jvm.configandexamples/*.bat) so JMH's forked JVMs match the JVM Surefire uses. Throughput unit is kernel launches per second at each corner; candidates/sec = launches/sec ×(1 << batchSizeInBits). JMH's@OperationsPerInvocationcannot normalize this automatically because it needs a compile-time constant and@Paramis runtime; documented in the class Javadoc.NOT imported from the cjherm/BAF23 fork (intentional scope cap): the
BenchmarkFactory/BenchmarkSeries/BenchmarkLogger/LatexContentCreatorharness, thecommand: "BenchmarkSeries"CCommandextension, and the SHA / RIPEMD-160 GPU-vs-CPU comparison rounds.Context-reuse / init-cost-amortisation sweep — explicitly NOT imported. The fork ships
CtxRoundsIteratorBenchmark(fixgridNumBits, vary kernel-launches-per-context, measure init-cost amortisation curve). It is operationally meaningless for this codebase:ProducerOpenCLcreates theOpenCLContextonce ininitProducer(), runscreateKeys(BigInteger)on every produced batch, and closes the context once inreleaseProducer(). The smallest production scan is ≳ 10⁶ kernel launches against the one long-lived context; init cost is already amortised to noise. Re-importing this idea later requires evidence that BAF's lifecycle has changed to short-burst / one-shot scans.
- Abstract the Java and test writing guidelines to a workspace-level shared layer. Canonical guides at
../workspace/guides/src/CODE_WRITING_GUIDE-8.md(Java 8 baseline) +CODE_WRITING_GUIDE-21.md(Java 21 supplement, applies to this repo), andTEST_WRITING_GUIDE-8.md+TEST_WRITING_GUIDE-21.md; canonical TDD skill at../workspace/.claude/skills/java-tdd-guide/SKILL.md. BAF'sCODE_WRITING_GUIDE.md/TEST_WRITING_GUIDE.mdnow contain only BAF-specific supplements. - Standardised CLAUDE.md template —
../workspace/templates/CLAUDE.md.template.
- New standalone kernel for contiguous scanning: single host-supplied anchor, pure affine point-addition walk (no per-key
k·G, no comb/wNAF), 160-bit MSB-zero range, compact/output-only. Full design, rationale, constraints, and validation:docs/performance.md§8 "Future work". - Persistent / warp-synchronous "megakernel" variant (different execution model, large rewrite): launch once with occupancy-maximal resident threads, each walking its own keyspace stripe in registers and looping add → hash → Fuse8 → append-hit, running until a host stop flag with async output draining. GPU-only compute ⇒ a thin host (seed initial keys, drain hits, checkpoint the frontier) that could be reimplemented standalone (e.g. in Rust); reading back each thread's frontier yields a compact
ddrescue-style resumable coverage/map file (provable gap-free searched-domain record). Details + gotchas (TDR/watchdog, async drain, divergence/safegcd):docs/performance.md§8.