Skip to content

rsky-graph: incremental periodic save + drop eager bloom rebuild#187

Open
rudyfraser wants to merge 14 commits into
mainfrom
feat/rsky-graph-incremental-save
Open

rsky-graph: incremental periodic save + drop eager bloom rebuild#187
rudyfraser wants to merge 14 commits into
mainfrom
feat/rsky-graph-incremental-save

Conversation

@rudyfraser
Copy link
Copy Markdown
Member

Summary

rsky-graph maintains an in-memory follow graph (RoaringBitmap per user, persisted to LMDB) so mutual-followers queries are sub-millisecond instead of a Postgres self-JOIN. This change makes the periodic LMDB save O(dirty users) via a DashSet<u32> populated in add_follow/remove_follow and drained per save, and drops the eager bloom-filter rebuild from load_from_lmdb since blooms can be repopulated incrementally as add_follow runs.

Test plan

  • cargo fmt -p rsky-graph clean
  • cargo clippy -p rsky-graph --all-targets no new warnings
  • cargo test -p rsky-graph --lib passes (includes 6 new tests in graph::tests covering dirty-tracking semantics)
  • cargo test -p rsky-graph --test smoke passes (includes new incremental_save_only_writes_dirty_entries round-trip)
  • cargo build --release -p rsky-graph clean

Adds an authenticated admin endpoint that triggers bulk_load_keyset
against the running in-memory graph without restart or LMDB wipe.
Required because main.rs only fires bulk-load on empty LMDB; with
1M+ live firehose users already accumulated, we cannot afford to
discard that state to backfill historical follows from PostgreSQL.

The keyset loader is concurrency-safe with the firehose writer:
FollowGraph uses DashMap throughout and add_follow is idempotent
(RoaringBitmap is a set), so the two writers can target the same
graph for the entire load duration.

Endpoint behavior:
- 401 when GRAPH_ADMIN_TOKEN unset or Bearer header mismatches
- 405 on non-POST
- 503 when DATABASE_URL unset
- 409 when a load is already in flight (compare_exchange gate)
- 202 on success; spawned task drops a guard that resets the
  running flag even on panic

14 unit tests cover all gates plus a concurrent-writer idempotence
proof for the load-bearing claim that bulk-load + firehose is safe.
Detects the alphabetically-highest creator with a non-empty `following`
bitmap (and within it, the highest subject) at the start of a keyset
bulk-load and starts the keyset query from that boundary instead of
the beginning. Restarted loads now skip the work day-1 already did.

The keyset architecture already supported resuming from any
(creator, subjectDid) point; this just makes the loader detect a safe
resume boundary from the live in-memory graph. Empty graph -> empty
boundary -> behaves exactly like the current "start from scratch" path.

Drops the now-redundant `initial_stmt` -- an empty pair lex-precedes
every real DID, so `page_stmt` covers both first-call and resume cases.

4 unit tests cover empty graph, max-creator selection, max-subject within
a creator, and the subject-only-DID ignore rule.
LMDB rejects zero-byte keys with MDB_BAD_VALSIZE, and persistence.rs
propagates the put error via `?` -- so a single empty DID slipping into
did_to_uid aborts the entire periodic save transaction. In production
this caused saves to fail every minute for hours, leaving on-disk LMDB
frozen far behind in-memory state.

Two-layer fix:
- graph.rs: add_follow / add_follow_with_rkey early-return on empty
  actor or subject DID. They never enter the in-memory graph.
- persistence.rs: skip empty-key entries (defense for any state already
  in memory at upgrade time) and continue the txn.
Previously the periodic LMDB save task was spawned after bulk_load_keyset
returned -- meaning during the multi-hour initial load, in-memory state
(tens of millions of users + roaring bitmaps) was never persisted. A crash
or restart at any point during the load discarded everything.

Hoist the shutdown handler + periodic save task to immediately after graph
construction. The save runs on its 60s tick concurrently with the bulk-load,
so worst-case crash loss is one tick, and the auto-resume logic can then
pick up from the last persisted (creator, subjectDid) boundary.
The previous fixed 60s interval choked the bulk-load: each save serializes
the entire graph to LMDB, opens a new env, and writes every roaring bitmap.
At 5M users the save was taking >30s, contending for DashMap shard locks
with the bulk-load's add_follow path. Instantaneous bulk-load rate
collapsed from 50k/s to 1.5k/s in under an hour.

Switch to exponential schedule (60s, 120s, 240s, 480s, 960s, then every
30 min). Captures early progress quickly while keeping per-tick overhead
amortized as the graph grows. Worst-case crash loss during late bulk-load:
30 min of progress, vs the prior tradeoff of 60s loss but loads that
never complete.
The 60s statement_timeout in bulk_load_keyset trips under appview PG load
(IO stalls, contention with the live getFollowsFollowing self-JOIN), and
a single failed query propagated up via `?` and crashed the whole load.
systemd restarted the process, but main.rs only auto-resumes via
bulk_load_keyset when LMDB is empty -- the post-crash process sat idle.

Two changes:
1. Bump per-statement cap to 300s. Almost no batch should exceed this.
2. Retry transient query errors up to 5 times with 10s, 20s, 30s, 40s
   backoff before giving up. Covers brief PG IO stalls.
Spawning the persistence task before bulk-load races the loader: each
periodic save opens a second LMDB env at the same path and rewrites
the data file while bulk_load_keyset is concurrently mutating the in-
memory graph. Observed in production: 166 GB written through the LMDB
path in 20 minutes during a startup load that hadn't yet bound the
HTTP port -- saves were never going to converge with the still-loading
graph.

Revert to the original ordering: persistence task is spawned only
after bulk_load_keyset returns and the post-load save completes.
Crash mid-bulk-load loses everything (~7-8h), but with 300s retry/
timeout in bulk_load.rs the keyset itself is now resilient to PG
hiccups, so a clean run is the most likely outcome.

Steady-state interval bumped to 300s -- firehose ingestion alone
doesn't need 60s checkpoints.
The previous design serialized the entire startup pipeline:
  load_from_lmdb -> bulk_load_keyset -> build_all_bloom_filters -> save_to_lmdb -> serve API

This meant the API didn't bind for hours -- and a crash anywhere along the
way lost all in-memory work. After 9 hours of bulk-load on production the
bloom rebuild stalled and the 1.41B-follow graph was unrecoverable.

This change inverts the model:

1. LoadState (types.rs): tracks `last_completed_creator` + `complete` flag.
2. bulk_load_keyset takes &LoadState; records a creator complete each time
   the keyset cursor transitions to the next creator, and marks complete on
   exit. Idempotent / monotonic.
3. API handlers consult LoadState. If a query references an actor whose DID
   is above the cursor, return 503. The appview already falls back to the
   SQL self-JOIN on non-2xx, so partial graph state is harmless.
4. main.rs starts the HTTP API immediately. Bulk-load runs in a tokio
   spawn; firehose runs alongside; periodic save runs alongside. None of
   them gate API availability.
5. Eager build_all_bloom_filters is removed -- bulk-load already populates
   blooms incrementally via add_follow. Skipping the upfront rebuild
   eliminates the 2+ hour single-threaded CPU hot loop that hung the prior
   deploy. Bloom FP rate is slightly higher in steady state but queries
   are still correct (bitmap intersection is the source of truth).
6. load_from_lmdb now restores follow_count from the bitmap state -- the
   smoke test caught this regression where the count was always 0 after
   reload.

Tests:
- 5 new unit tests in types.rs / api.rs covering LoadState semantics and
  the 503-when-unloaded path
- 4 integration smoke tests in tests/smoke.rs that exercise the full
  bulk-load -> persistence -> reload flow against a real PostgreSQL.
  Gated on SMOKE_DATABASE_URL so they run in CI / dev / pre-deploy
  validation without requiring docker locally.

Lib + bin split (lib.rs alongside main.rs) so integration tests can import
the modules.
The save algorithm was a full O(N) DashMap snapshot every cycle: each
periodic save iterated all did_to_uid, uid_to_did, following, and
followers entries and wrote them all to LMDB. Time grew linearly with
graph size, then superlinearly once the LMDB mmap exceeded the OS
buffer cache (page-faulting through the file on every COW B-tree walk).
At 50 GB LMDB / 64 GB box, save #11 took 92 minutes vs ~18 minutes for
the equivalent save earlier in the same run.

Replace with an incremental save:

- FollowGraph gains a dirty_users: DashSet<u32>. add_follow,
  add_follow_with_rkey, remove_follow, and remove_follow_by_rkey now
  insert both the actor and subject UIDs into it.
- save_to_lmdb drains the dirty set into a Vec, pre-removing each UID
  before reading the bitmap so a concurrent mutation that arrives after
  the drain re-marks the user dirty (next save catches up; never lost).
- save_full() is retained for cold start (no data.mdb yet) so a fresh
  install gets a complete snapshot. Steady-state always uses save_dirty.
- load_from_lmdb does not pollute dirty_users -- the post-load graph
  mirrors disk and the next save has nothing to write until a real
  mutation occurs.

Tests:
- 6 graph.rs unit tests cover dirty-tracking semantics (add/remove
  marks, drain clears, dedup, idempotency).
- New smoke test incremental_save_only_writes_dirty_entries verifies
  cold-start full snapshot -> reload -> single-edge mutation -> dirty
  save -> reload, asserting the second save preserves untouched edges
  and persists the new one. No PG required for this case.

cargo test -p rsky-graph: 31 lib tests + 5 smoke tests pass.
A prior commit message claimed the eager build_all_bloom_filters call
was removed, but persistence.rs still invoked it inside load_from_lmdb.
That blocks startup for hours on a 25M-user graph: each Bloom::new
calls getrandom twice and we then set() ~1.4B follower bits, all on
one thread, before api::serve binds or bulk_load_keyset resumes. A
fresh 14:15 deploy spun ~16 min and hadn't moved past it.

Skip the rebuild entirely. follower_blooms is repopulated incrementally
as add_follow runs (firehose + bulk-load already do this). Missing
blooms only forfeit the fast-reject optimization on get_follows_following
queries -- the bitmap intersection that follows is the source of truth,
so correctness is unchanged.
@afbase
Copy link
Copy Markdown
Collaborator

afbase commented May 23, 2026

@rudyfraser look at #189 to help with the rsky-pds issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants