Skip to content

Proposal: Bipartite Supervalidator Architecture#224

Open
elliottdehn wants to merge 8 commits intocanton-foundation:mainfrom
elliottdehn:proposal/bipartite-supervalidator
Open

Proposal: Bipartite Supervalidator Architecture#224
elliottdehn wants to merge 8 commits intocanton-foundation:mainfrom
elliottdehn:proposal/bipartite-supervalidator

Conversation

@elliottdehn
Copy link
Copy Markdown

Development Fund Proposal Submission

Proposal file:
/proposals/bipartite-supervalidator-architecture.md


Summary

This proposal separates CPU-bound cryptographic operations (ECIES view encryption/decryption, Speedy reinterpretation) from IO-bound state management (BFT consensus, sequencer DB writes, ACS updates) into two independently scalable machine classes. A proof-of-concept using Canton's actual crypto primitives demonstrates 4x throughput improvement and 7x latency reduction at 32 concurrent transactions. The architecture is backward-compatible and opt-in — existing Daml applications require no changes.


Checklist

  • Proposal file added under /proposals/
  • Milestones and funding amounts defined
  • Acceptance criteria included
  • Alignment with Canton priorities described

Notes for Reviewers

  • This work is complementary to the ISS-based BFT ordering grant (approved 2026-03-23). Once ordering scales to 2500+ TPS, per-node ECIES encryption becomes the next binding throughput constraint. This proposal addresses that.
  • The PoC benchmark source code is available in the CIPs repo under bipartite-poc/. It uses real JDK 21 crypto (ECDH P-256, AES-256-GCM, Ed25519) and real PostgreSQL writes — not simulated delays.
  • The key architectural insight: ECIES encryption for Canton's sub-transaction privacy model is stateless (needs only read access to packages, contracts, topology) but consumes 35-43% of per-transaction CPU. Separating it from stateful DB/consensus work allows each to scale along its natural axis.
  • Funding request: 3,500,000 CC across 5 milestones over ~24 weeks.

Separates CPU-bound crypto from IO-bound state management into two
independently scalable machine classes. PoC benchmarks demonstrate
4x throughput and 7x latency improvement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
definenoob and others added 5 commits April 15, 2026 22:43
Measured CPU breakdown via JDK Flight Recorder: crypto is 38.2% of
CPU (Ed25519 signing 21.5%, ECDH 15.7%), while PostgreSQL is only
1.2% (IO-wait, not CPU). Validates the premise that crypto is the
dominant offloadable bottleneck.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Isolated the SV workload by running the submitting participant in a
separate JVM. Crypto rises to 42.7% of SV CPU (vs 38.2% all-in-one)
because Daml engine/protobuf work lives on the submitter, not the SV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The key insight: DB operations aren't slow (they're IO-wait), but
they can't start until a core is free. On a monolithic node with
cores saturated by ECIES, DB threads sit in the run queue. The
bipartite split eliminates this contention — B's cores are almost
idle, so DB threads schedule instantly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Benchmarking showed that JVM thread pool separation does NOT improve
latency — even under CPU saturation, the OS scheduler doesn't
enforce isolation between pools. The bipartite speedup comes from
having more total CPU (8 cpus vs 2 cpus = 4x), not from eliminating
scheduling contention between crypto and DB threads.

Updated the motivation, "Why Not Just Bigger Machines", and benchmark
results sections to accurately describe:
- The throughput gain is proportional to added CPU capacity
- The latency improvement follows from reduced queueing with more cores
- The architectural value is cost-efficient horizontal scaling of
  stateless crypto capacity, not thread scheduling optimization
- Thread pool separation within a single JVM was tested and doesn't work

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The smallest useful bipartite configuration (2 A nodes + 1 B node,
3 cpus total) delivers 6.4x throughput over 1-cpu monolithic — better
than the CPU ratio alone. The split eliminates cache thrashing between
crypto and IO workloads on the same core.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@elliottdehn elliottdehn force-pushed the proposal/bipartite-supervalidator branch from 2aa3f94 to a620c64 Compare April 16, 2026 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Incoming

Development

Successfully merging this pull request may close these issues.

2 participants