rsky-graph: incremental periodic save + drop eager bloom rebuild#187
Open
rudyfraser wants to merge 14 commits into
Open
rsky-graph: incremental periodic save + drop eager bloom rebuild#187rudyfraser wants to merge 14 commits into
rudyfraser wants to merge 14 commits into
Conversation
Adds an authenticated admin endpoint that triggers bulk_load_keyset against the running in-memory graph without restart or LMDB wipe. Required because main.rs only fires bulk-load on empty LMDB; with 1M+ live firehose users already accumulated, we cannot afford to discard that state to backfill historical follows from PostgreSQL. The keyset loader is concurrency-safe with the firehose writer: FollowGraph uses DashMap throughout and add_follow is idempotent (RoaringBitmap is a set), so the two writers can target the same graph for the entire load duration. Endpoint behavior: - 401 when GRAPH_ADMIN_TOKEN unset or Bearer header mismatches - 405 on non-POST - 503 when DATABASE_URL unset - 409 when a load is already in flight (compare_exchange gate) - 202 on success; spawned task drops a guard that resets the running flag even on panic 14 unit tests cover all gates plus a concurrent-writer idempotence proof for the load-bearing claim that bulk-load + firehose is safe.
Detects the alphabetically-highest creator with a non-empty `following` bitmap (and within it, the highest subject) at the start of a keyset bulk-load and starts the keyset query from that boundary instead of the beginning. Restarted loads now skip the work day-1 already did. The keyset architecture already supported resuming from any (creator, subjectDid) point; this just makes the loader detect a safe resume boundary from the live in-memory graph. Empty graph -> empty boundary -> behaves exactly like the current "start from scratch" path. Drops the now-redundant `initial_stmt` -- an empty pair lex-precedes every real DID, so `page_stmt` covers both first-call and resume cases. 4 unit tests cover empty graph, max-creator selection, max-subject within a creator, and the subject-only-DID ignore rule.
LMDB rejects zero-byte keys with MDB_BAD_VALSIZE, and persistence.rs propagates the put error via `?` -- so a single empty DID slipping into did_to_uid aborts the entire periodic save transaction. In production this caused saves to fail every minute for hours, leaving on-disk LMDB frozen far behind in-memory state. Two-layer fix: - graph.rs: add_follow / add_follow_with_rkey early-return on empty actor or subject DID. They never enter the in-memory graph. - persistence.rs: skip empty-key entries (defense for any state already in memory at upgrade time) and continue the txn.
Previously the periodic LMDB save task was spawned after bulk_load_keyset returned -- meaning during the multi-hour initial load, in-memory state (tens of millions of users + roaring bitmaps) was never persisted. A crash or restart at any point during the load discarded everything. Hoist the shutdown handler + periodic save task to immediately after graph construction. The save runs on its 60s tick concurrently with the bulk-load, so worst-case crash loss is one tick, and the auto-resume logic can then pick up from the last persisted (creator, subjectDid) boundary.
The previous fixed 60s interval choked the bulk-load: each save serializes the entire graph to LMDB, opens a new env, and writes every roaring bitmap. At 5M users the save was taking >30s, contending for DashMap shard locks with the bulk-load's add_follow path. Instantaneous bulk-load rate collapsed from 50k/s to 1.5k/s in under an hour. Switch to exponential schedule (60s, 120s, 240s, 480s, 960s, then every 30 min). Captures early progress quickly while keeping per-tick overhead amortized as the graph grows. Worst-case crash loss during late bulk-load: 30 min of progress, vs the prior tradeoff of 60s loss but loads that never complete.
The 60s statement_timeout in bulk_load_keyset trips under appview PG load (IO stalls, contention with the live getFollowsFollowing self-JOIN), and a single failed query propagated up via `?` and crashed the whole load. systemd restarted the process, but main.rs only auto-resumes via bulk_load_keyset when LMDB is empty -- the post-crash process sat idle. Two changes: 1. Bump per-statement cap to 300s. Almost no batch should exceed this. 2. Retry transient query errors up to 5 times with 10s, 20s, 30s, 40s backoff before giving up. Covers brief PG IO stalls.
Spawning the persistence task before bulk-load races the loader: each periodic save opens a second LMDB env at the same path and rewrites the data file while bulk_load_keyset is concurrently mutating the in- memory graph. Observed in production: 166 GB written through the LMDB path in 20 minutes during a startup load that hadn't yet bound the HTTP port -- saves were never going to converge with the still-loading graph. Revert to the original ordering: persistence task is spawned only after bulk_load_keyset returns and the post-load save completes. Crash mid-bulk-load loses everything (~7-8h), but with 300s retry/ timeout in bulk_load.rs the keyset itself is now resilient to PG hiccups, so a clean run is the most likely outcome. Steady-state interval bumped to 300s -- firehose ingestion alone doesn't need 60s checkpoints.
The previous design serialized the entire startup pipeline: load_from_lmdb -> bulk_load_keyset -> build_all_bloom_filters -> save_to_lmdb -> serve API This meant the API didn't bind for hours -- and a crash anywhere along the way lost all in-memory work. After 9 hours of bulk-load on production the bloom rebuild stalled and the 1.41B-follow graph was unrecoverable. This change inverts the model: 1. LoadState (types.rs): tracks `last_completed_creator` + `complete` flag. 2. bulk_load_keyset takes &LoadState; records a creator complete each time the keyset cursor transitions to the next creator, and marks complete on exit. Idempotent / monotonic. 3. API handlers consult LoadState. If a query references an actor whose DID is above the cursor, return 503. The appview already falls back to the SQL self-JOIN on non-2xx, so partial graph state is harmless. 4. main.rs starts the HTTP API immediately. Bulk-load runs in a tokio spawn; firehose runs alongside; periodic save runs alongside. None of them gate API availability. 5. Eager build_all_bloom_filters is removed -- bulk-load already populates blooms incrementally via add_follow. Skipping the upfront rebuild eliminates the 2+ hour single-threaded CPU hot loop that hung the prior deploy. Bloom FP rate is slightly higher in steady state but queries are still correct (bitmap intersection is the source of truth). 6. load_from_lmdb now restores follow_count from the bitmap state -- the smoke test caught this regression where the count was always 0 after reload. Tests: - 5 new unit tests in types.rs / api.rs covering LoadState semantics and the 503-when-unloaded path - 4 integration smoke tests in tests/smoke.rs that exercise the full bulk-load -> persistence -> reload flow against a real PostgreSQL. Gated on SMOKE_DATABASE_URL so they run in CI / dev / pre-deploy validation without requiring docker locally. Lib + bin split (lib.rs alongside main.rs) so integration tests can import the modules.
The save algorithm was a full O(N) DashMap snapshot every cycle: each periodic save iterated all did_to_uid, uid_to_did, following, and followers entries and wrote them all to LMDB. Time grew linearly with graph size, then superlinearly once the LMDB mmap exceeded the OS buffer cache (page-faulting through the file on every COW B-tree walk). At 50 GB LMDB / 64 GB box, save #11 took 92 minutes vs ~18 minutes for the equivalent save earlier in the same run. Replace with an incremental save: - FollowGraph gains a dirty_users: DashSet<u32>. add_follow, add_follow_with_rkey, remove_follow, and remove_follow_by_rkey now insert both the actor and subject UIDs into it. - save_to_lmdb drains the dirty set into a Vec, pre-removing each UID before reading the bitmap so a concurrent mutation that arrives after the drain re-marks the user dirty (next save catches up; never lost). - save_full() is retained for cold start (no data.mdb yet) so a fresh install gets a complete snapshot. Steady-state always uses save_dirty. - load_from_lmdb does not pollute dirty_users -- the post-load graph mirrors disk and the next save has nothing to write until a real mutation occurs. Tests: - 6 graph.rs unit tests cover dirty-tracking semantics (add/remove marks, drain clears, dedup, idempotency). - New smoke test incremental_save_only_writes_dirty_entries verifies cold-start full snapshot -> reload -> single-edge mutation -> dirty save -> reload, asserting the second save preserves untouched edges and persists the new one. No PG required for this case. cargo test -p rsky-graph: 31 lib tests + 5 smoke tests pass.
A prior commit message claimed the eager build_all_bloom_filters call was removed, but persistence.rs still invoked it inside load_from_lmdb. That blocks startup for hours on a 25M-user graph: each Bloom::new calls getrandom twice and we then set() ~1.4B follower bits, all on one thread, before api::serve binds or bulk_load_keyset resumes. A fresh 14:15 deploy spun ~16 min and hadn't moved past it. Skip the rebuild entirely. follower_blooms is repopulated incrementally as add_follow runs (firehose + bulk-load already do this). Missing blooms only forfeit the fast-reject optimization on get_follows_following queries -- the bitmap intersection that follows is the source of truth, so correctness is unchanged.
2 tasks
Collaborator
|
@rudyfraser look at #189 to help with the rsky-pds issue |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rsky-graph maintains an in-memory follow graph (RoaringBitmap per user, persisted to LMDB) so mutual-followers queries are sub-millisecond instead of a Postgres self-JOIN. This change makes the periodic LMDB save O(dirty users) via a
DashSet<u32>populated inadd_follow/remove_followand drained per save, and drops the eager bloom-filter rebuild fromload_from_lmdbsince blooms can be repopulated incrementally asadd_followruns.Test plan
cargo fmt -p rsky-graphcleancargo clippy -p rsky-graph --all-targetsno new warningscargo test -p rsky-graph --libpasses (includes 6 new tests ingraph::testscovering dirty-tracking semantics)cargo test -p rsky-graph --test smokepasses (includes newincremental_save_only_writes_dirty_entriesround-trip)cargo build --release -p rsky-graphclean