graph/db: async graph cache population by ellemouton · Pull Request #10065 · lightningnetwork/lnd

ellemouton · 2025-07-10T12:47:15Z

This PR makes the in-memory graph cache load asynchronously during
startup, so that lnd can begin serving RPCs and handling peer
connections without blocking on a full graph scan. On large nodes the
graph cache population can take tens of seconds; with this change it
happens in the background while all graph reads gracefully fall back to
the database until the cache is ready.

Opt-out flag (commit 7):

--db.sync-graph-cache-load restores the old blocking behaviour for
users who prefer it.

Observability (commit 8):

GraphCacheStatus enum (DISABLED, LOADING, LOADED) added to
GetInfoResponse so operators and clients can monitor cache readiness.

How it works

  Start()
    ├── creates cancellable ctx
    ├── launches populateCache(ctx) in goroutine
    │     ├── beginPopulation()    ← sets loading=true, starts buffering mutations
    │     ├── iterates DB (respects ctx cancellation)
    │     └── finishPopulation()   ← replays buffered mutations, sets loaded=true
    └── launches handleTopologySubscriptions

During loading:
reads → fall back to DB (isLoaded() == false)
writes → DB first, then buffered via applyUpdate()

After loading:
reads → served from cache (lock-free atomic check)
writes → applied directly to cache (no buffering)

Fixes #6187
Replaces #8919

gemini-code-assist

Summary of Changes

Hello @ellemouton, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances application startup performance by implementing asynchronous loading for the channel graph cache. This change ensures that the system remains available for queries by temporarily serving them from the database while the in-memory cache is being built. A new configuration flag provides flexibility for users to revert to synchronous loading if desired.

Highlights

Performance Improvement: The channel graph cache now populates asynchronously on startup, allowing the application to become responsive faster. During the asynchronous loading phase, all read queries will automatically fall back to the database, ensuring continuous availability.
Configuration Option: A new configuration option, --db.sync-graph-cache-load, has been introduced. This allows users to opt out of the new asynchronous loading behavior and force the graph cache to load synchronously, blocking startup until completion.
Robust Query Handling: Graph query methods have been updated to intelligently check if the in-memory cache is fully loaded. If not, they will transparently query the underlying database, preventing incomplete or stale data from being served from the cache.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The code changes introduce asynchronous graph cache population on startup, improving performance. The implementation includes a fallback to database queries during cache loading and a configuration option for opting out. The code appears well-structured and includes a new test case.

graph/db/graph_test.go

ellemouton · 2025-07-10T13:01:48Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces asynchronous graph cache population, which is a great performance enhancement for node startup. The implementation is clean, with appropriate configuration options to control the new behavior. The tests are thorough, including a new test for the async loading and updates to existing tests to ensure they remain deterministic. The code adheres to the LND Style Guide.

graph/db/graph.go

GustavoStingelin

🚀

graph/db/graph.go

graph/db/graph_test.go

djkazic · 2025-07-12T13:38:56Z

If there are writes to the underlying graph during cache population are these queued to be written into the cache, or will these entries just experience a cache miss when a read happens later?

Add testRemoteGraphPolicyUpdate which demonstrates that when a payment fails due to a stale fee policy (FeeInsufficient), the updated ChannelUpdate from the failure message is NOT applied to the sender's graph cache when the channel only exists in the remote graph source. The test asserts the current broken behavior: the payment fails with FAILURE_REASON_NO_ROUTE because Zane cannot learn the updated fee from the failure message, and the cached policy remains stale. Also includes a small fix to DBGraphSource.SourceNode (use ErrSourceNodeNotSet instead of ErrGraphNodeNotFound) and a TODO for making remote cache population async once lightningnetwork#10065 lands.

lightninglabs-deploy · 2026-03-13T11:26:20Z

@ellemouton, remember to re-request review from reviewers when ready

djkazic · 2026-03-17T14:35:14Z

Will review this tomorrow.

ziggie1984 · 2026-03-20T08:04:46Z

@claude review

claude · 2026-03-20T08:05:13Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

ziggie1984

Looking good I am going to test it on my node.

I wonder if we should shutdown the node if the async population fails ?

ziggie1984 · 2026-03-20T09:20:00Z

graph/db/graph.go

 	defer log.Debug("ChannelGraph started")

-	ctx := context.TODO()
+	ctx, cancel := context.WithCancel(context.Background())


why is the context an Option, seems to be its always set ?

i assume you mean cancel? well technically it is only set after Start is called. not set at construction time. And if you remember: we made the startup sequence such that it is possible for Stop to be called even if Start was never called

ziggie1984 · 2026-03-20T15:56:16Z

graph/db/kv_store.go


+// errContextDone returns the context error if the context is done (canceled or
+// deadline exceeded).
+func errContextDone(ctx context.Context) error {


errContextDone is a thin wrapper around ctx.Err() with a nil guard, but the guard doesn't add meaningful safety here. Go's own docs say "do not pass a nil Context" — if a nil context
were passed it would be a bug in the caller that should surface as a panic, not be silently swallowed by returning nil. The two call sites are straightforward enough to inline:

if err := ctx.Err(); err != nil { return err }

This is more idiomatic and removes a small helper whose nil branch is unreachable in practice.

good catch :)

ziggie1984 · 2026-03-20T16:21:18Z

graph/db/graph.go

+		c.cache.finishPopulation(loaded)
+	}()
+
+	cache := c.cache.cache


that reads like a typo, can we name them differently maybe ?

ziggie1984 · 2026-03-20T16:41:54Z

graph/db/graph_cache_state.go

+// pendingUpdatesWarnThreshold is the number of buffered cache mutations at
+// which a warning is logged. A large buffer indicates that cache population is
+// taking a long time relative to the incoming gossip rate.
+const pendingUpdatesWarnThreshold = 10_000


are we ever expecting hitting this value, have you done some testing on mainnet ?

just wanted to have some safe guard.. but perhaps this is too high?

ziggie1984 · 2026-03-20T16:50:17Z

graph/db/graph_cache_state.go

+	if s.loading {
+		s.pendingUpdates = append(s.pendingUpdates, update)
+
+		if len(s.pendingUpdates) == pendingUpdatesWarnThreshold {


Nit:

if len(s.pendingUpdates) % pendingUpdatesWarnThreshold == 0 { log.Warnf("Graph cache has %d pending updates "+ "buffered during population", len(s.pendingUpdates)) }

ziggie1984 · 2026-03-20T17:03:12Z

graph/db/options.go

+
+	// asyncGraphCachePopulation indicates whether the graph cache
+	// should be populated asynchronously or if the Start method should
+	// block until the cache is fully populated.


Nit: maybe add that this is the default

ziggie1984 · 2026-03-20T17:11:29Z

graph/db/graph.go

+
+	// GraphCacheStatusFailed indicates that the initial population of
+	// the graph cache failed. Reads fall back to the database.
+	GraphCacheStatusFailed


currently when the graphCache fails to be populated we just siltently continue and users need to either call getInfo or check the logs, what about shutting LND down if the graphcache fails, it would be similar to the sync behavior where we fail in case of an error ?

it now uses 'log.Critical' if it errors - which will trigger shutdown 👍

Instead of letting tests set the graphCache to nil in order to simulate it not being set, we instead make use of the WithUseGraphCache helper.

Clean up TestGraphCacheTraversal so that we are explicitly enabling the graphCache. This removes the need to explicitly make calls to the cache. Also remove a duplicate check from assertNodeNotInCache.

Use this to block reading from the cache unless cacheLoaded returns true. This will start being useful once cache population is done asynchronously.

Refactor so that we don't have two layers of indentation later on when we want to spin populateCache off into a goroutine.

Create a cancellable context in Start() and store its cancel function on the struct. Stop() invokes it so that long-running DB iterations (e.g. cache population) can be interrupted promptly during shutdown.

ellemouton · 2026-03-23T15:51:36Z

Thanks for review @ziggie1984 - I've updated things :)

@djkazic - just posting a reminder here as im unable to add you as a reviewer it seems 🤔

djkazic · 2026-03-23T15:59:01Z

Reviewing right now :)

ziggie1984

LGTM

djkazic · 2026-03-23T20:44:42Z

LGTM overall, graphCacheState is a nice and clean abstraction. Test coverage is also good.

One nit is on finishPopulation. IIUC, if population fails the cache is in a partial state, and replaying mutations on top of a partial cache seems pointless since reads won't use it (isFailed() gate would prevent it). IMO, for this case the replay is harmless but unnecessary.

Introduce graphCacheState, a wrapper around GraphCache that tracks its population lifecycle (loading -> loaded) and buffers concurrent mutations during the initial DB scan. Once population completes, buffered updates are replayed and the cache begins serving reads. Start() now launches populateCache in a background goroutine by default. While the cache is loading, all graph reads fall back to the database. The KV iterators (ForEachNodeCacheable, ForEachChannelCacheable) now respect context cancellation so that Stop() can interrupt a long-running population. Tests cover: concurrent reads during population, concurrent write replay, shutdown cancellation during load, population failure with DB fallback, and KV iterator cancellation.

Add a new option to opt out of the new asynchronous graph cache loading feature.

Add a GraphCacheStatus enum to GetInfoResponse so callers can tell whether the graph cache is disabled, still loading, or fully loaded. This makes the async graph cache startup state visible to operators and clients without changing the existing DB fallback behaviour for reads.

ellemouton · 2026-03-24T07:54:08Z

thanks @djkazic - updated to not replay on fail

ellemouton self-assigned this Jul 10, 2025

ellemouton changed the title ~~Async graph cache load~~ graph/db: async graph cache load Jul 10, 2025

ellemouton changed the title ~~graph/db: async graph cache load~~ graph/db: async graph cache population Jul 10, 2025

gemini-code-assist bot reviewed Jul 10, 2025

View reviewed changes

graph/db/graph_test.go Outdated Show resolved Hide resolved

ellemouton force-pushed the asyncGraphCacheLoad branch from 3d83c73 to 9ac1c58 Compare July 10, 2025 13:01

ellemouton added graph performance labels Jul 10, 2025

gemini-code-assist bot reviewed Jul 10, 2025

View reviewed changes

graph/db/graph.go Outdated Show resolved Hide resolved

graph/db/graph.go Show resolved Hide resolved

This was referenced Jul 10, 2025

channeldb: load graph cache async #8919

Closed

channeldb: make channel graph population async #6187

Open

ellemouton force-pushed the asyncGraphCacheLoad branch 2 times, most recently from a814e5a to 944c68c Compare July 10, 2025 16:04

saubyk added this to lnd v0.20 Jul 10, 2025

saubyk moved this to In progress in lnd v0.20 Jul 10, 2025

GustavoStingelin reviewed Jul 10, 2025

View reviewed changes

graph/db/graph.go Outdated Show resolved Hide resolved

graph/db/graph.go Outdated Show resolved Hide resolved

graph/db/graph_test.go Show resolved Hide resolved

ellemouton changed the base branch from master to elle-reset-callbacks July 11, 2025 10:22

ellemouton force-pushed the asyncGraphCacheLoad branch 2 times, most recently from 99833e9 to c5372c4 Compare July 11, 2025 10:24

GustavoStingelin approved these changes Jul 31, 2025

View reviewed changes

ellemouton marked this pull request as draft September 1, 2025 06:28

saubyk removed this from lnd v0.20 Sep 9, 2025

ellemouton deleted the branch lightningnetwork:master November 24, 2025 11:21

ellemouton closed this Nov 24, 2025

ellemouton reopened this Nov 24, 2025

ellemouton changed the base branch from elle-reset-callbacks to master November 24, 2025 11:24

ellemouton force-pushed the asyncGraphCacheLoad branch from c5372c4 to a9663ad Compare November 24, 2025 11:34

ellemouton force-pushed the asyncGraphCacheLoad branch from a9663ad to 281e080 Compare March 6, 2026 09:23

saubyk added this to v0.21 Mar 13, 2026

saubyk added this to the v0.21.0 milestone Mar 13, 2026

saubyk moved this to In review in v0.21 Mar 13, 2026

ellemouton force-pushed the asyncGraphCacheLoad branch from 864c553 to ee57127 Compare March 17, 2026 10:02

ellemouton force-pushed the asyncGraphCacheLoad branch 2 times, most recently from 8aa116e to 112c95a Compare March 17, 2026 15:02

saubyk requested a review from ziggie1984 March 17, 2026 15:42

ziggie1984 reviewed Mar 20, 2026

View reviewed changes

ellemouton added 5 commits March 23, 2026 14:37

graph/db: don't let tests write to graphCache

3c25195

Instead of letting tests set the graphCache to nil in order to simulate it not being set, we instead make use of the WithUseGraphCache helper.

graph/db: misc graphCache test updates

7f7b85e

Clean up TestGraphCacheTraversal so that we are explicitly enabling the graphCache. This removes the need to explicitly make calls to the cache. Also remove a duplicate check from assertNodeNotInCache.

graph/db: add cacheLoaded atomic bool

19216ee

Use this to block reading from the cache unless cacheLoaded returns true. This will start being useful once cache population is done asynchronously.

graph/db: move graph disabled check to inside populateCache

24d6e9d

Refactor so that we don't have two layers of indentation later on when we want to spin populateCache off into a goroutine.

graph/db: add startup context cancellation

ee4bc4d

Create a cancellable context in Start() and store its cancel function on the struct. Stop() invokes it so that long-running DB iterations (e.g. cache population) can be interrupted promptly during shutdown.

ellemouton force-pushed the asyncGraphCacheLoad branch 2 times, most recently from c90ff15 to 3c7d9d7 Compare March 23, 2026 14:22

ziggie1984 approved these changes Mar 23, 2026

View reviewed changes

ellemouton added 4 commits March 24, 2026 09:49

multi: add --db.sync-graph-cache-load option

eb04d40

Add a new option to opt out of the new asynchronous graph cache loading feature.

docs: add release notes

f733eed

ellemouton force-pushed the asyncGraphCacheLoad branch from 3c7d9d7 to f733eed Compare March 24, 2026 07:53

Conversation

ellemouton commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Opt-out flag (commit 7):

Observability (commit 8):

How it works

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ellemouton commented Jul 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

GustavoStingelin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

djkazic commented Jul 12, 2025

Uh oh!

lightninglabs-deploy commented Mar 13, 2026

Uh oh!

djkazic commented Mar 17, 2026

Uh oh!

ziggie1984 commented Mar 20, 2026

Uh oh!

claude bot commented Mar 20, 2026

Uh oh!

ziggie1984 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ellemouton commented Mar 23, 2026

Uh oh!

djkazic commented Mar 23, 2026

Uh oh!

ziggie1984 left a comment

Choose a reason for hiding this comment

Uh oh!

djkazic commented Mar 23, 2026

Uh oh!

ellemouton commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

ellemouton commented Jul 10, 2025 •

edited

Loading