graph/db: async graph cache population#10065
graph/db: async graph cache population#10065ellemouton wants to merge 9 commits intolightningnetwork:masterfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @ellemouton, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances application startup performance by implementing asynchronous loading for the channel graph cache. This change ensures that the system remains available for queries by temporarily serving them from the database while the in-memory cache is being built. A new configuration flag provides flexibility for users to revert to synchronous loading if desired.
Highlights
- Performance Improvement: The channel graph cache now populates asynchronously on startup, allowing the application to become responsive faster. During the asynchronous loading phase, all read queries will automatically fall back to the database, ensuring continuous availability.
- Configuration Option: A new configuration option,
--db.sync-graph-cache-load, has been introduced. This allows users to opt out of the new asynchronous loading behavior and force the graph cache to load synchronously, blocking startup until completion. - Robust Query Handling: Graph query methods have been updated to intelligently check if the in-memory cache is fully loaded. If not, they will transparently query the underlying database, preventing incomplete or stale data from being served from the cache.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
The code changes introduce asynchronous graph cache population on startup, improving performance. The implementation includes a fallback to database queries during cache loading and a configuration option for opting out. The code appears well-structured and includes a new test case.
3d83c73 to
9ac1c58
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces asynchronous graph cache population, which is a great performance enhancement for node startup. The implementation is clean, with appropriate configuration options to control the new behavior. The tests are thorough, including a new test for the async loading and updates to existing tests to ensure they remain deterministic. The code adheres to the LND Style Guide.
a814e5a to
944c68c
Compare
99833e9 to
c5372c4
Compare
|
If there are writes to the underlying graph during cache population are these queued to be written into the cache, or will these entries just experience a cache miss when a read happens later? |
c5372c4 to
a9663ad
Compare
a9663ad to
281e080
Compare
Add testRemoteGraphPolicyUpdate which demonstrates that when a payment fails due to a stale fee policy (FeeInsufficient), the updated ChannelUpdate from the failure message is NOT applied to the sender's graph cache when the channel only exists in the remote graph source. The test asserts the current broken behavior: the payment fails with FAILURE_REASON_NO_ROUTE because Zane cannot learn the updated fee from the failure message, and the cached policy remains stale. Also includes a small fix to DBGraphSource.SourceNode (use ErrSourceNodeNotSet instead of ErrGraphNodeNotFound) and a TODO for making remote cache population async once lightningnetwork#10065 lands.
|
@ellemouton, remember to re-request review from reviewers when ready |
864c553 to
ee57127
Compare
|
Will review this tomorrow. |
8aa116e to
112c95a
Compare
|
@claude review |
|
I'll analyze this and get back to you. |
ziggie1984
left a comment
There was a problem hiding this comment.
Looking good I am going to test it on my node.
I wonder if we should shutdown the node if the async population fails ?
| defer log.Debug("ChannelGraph started") | ||
|
|
||
| ctx := context.TODO() | ||
| ctx, cancel := context.WithCancel(context.Background()) |
There was a problem hiding this comment.
why is the context an Option, seems to be its always set ?
There was a problem hiding this comment.
i assume you mean cancel? well technically it is only set after Start is called. not set at construction time. And if you remember: we made the startup sequence such that it is possible for Stop to be called even if Start was never called
graph/db/kv_store.go
Outdated
|
|
||
| // errContextDone returns the context error if the context is done (canceled or | ||
| // deadline exceeded). | ||
| func errContextDone(ctx context.Context) error { |
There was a problem hiding this comment.
errContextDone is a thin wrapper around ctx.Err() with a nil guard, but the guard doesn't add meaningful safety here. Go's own docs say "do not pass a nil Context" — if a nil context
were passed it would be a bug in the caller that should surface as a panic, not be silently swallowed by returning nil. The two call sites are straightforward enough to inline:
if err := ctx.Err(); err != nil {
return err
}
This is more idiomatic and removes a small helper whose nil branch is unreachable in practice.
graph/db/graph.go
Outdated
| c.cache.finishPopulation(loaded) | ||
| }() | ||
|
|
||
| cache := c.cache.cache |
There was a problem hiding this comment.
that reads like a typo, can we name them differently maybe ?
| // pendingUpdatesWarnThreshold is the number of buffered cache mutations at | ||
| // which a warning is logged. A large buffer indicates that cache population is | ||
| // taking a long time relative to the incoming gossip rate. | ||
| const pendingUpdatesWarnThreshold = 10_000 |
There was a problem hiding this comment.
are we ever expecting hitting this value, have you done some testing on mainnet ?
There was a problem hiding this comment.
just wanted to have some safe guard.. but perhaps this is too high?
graph/db/graph_cache_state.go
Outdated
| if s.loading { | ||
| s.pendingUpdates = append(s.pendingUpdates, update) | ||
|
|
||
| if len(s.pendingUpdates) == pendingUpdatesWarnThreshold { |
There was a problem hiding this comment.
Nit:
if len(s.pendingUpdates) % pendingUpdatesWarnThreshold == 0 {
log.Warnf("Graph cache has %d pending updates "+
"buffered during population",
len(s.pendingUpdates))
}
graph/db/options.go
Outdated
|
|
||
| // asyncGraphCachePopulation indicates whether the graph cache | ||
| // should be populated asynchronously or if the Start method should | ||
| // block until the cache is fully populated. |
There was a problem hiding this comment.
Nit: maybe add that this is the default
|
|
||
| // GraphCacheStatusFailed indicates that the initial population of | ||
| // the graph cache failed. Reads fall back to the database. | ||
| GraphCacheStatusFailed |
There was a problem hiding this comment.
currently when the graphCache fails to be populated we just siltently continue and users need to either call getInfo or check the logs, what about shutting LND down if the graphcache fails, it would be similar to the sync behavior where we fail in case of an error ?
There was a problem hiding this comment.
it now uses 'log.Critical' if it errors - which will trigger shutdown 👍
Instead of letting tests set the graphCache to nil in order to simulate it not being set, we instead make use of the WithUseGraphCache helper.
Clean up TestGraphCacheTraversal so that we are explicitly enabling the graphCache. This removes the need to explicitly make calls to the cache. Also remove a duplicate check from assertNodeNotInCache.
Use this to block reading from the cache unless cacheLoaded returns true. This will start being useful once cache population is done asynchronously.
Refactor so that we don't have two layers of indentation later on when we want to spin populateCache off into a goroutine.
Create a cancellable context in Start() and store its cancel function on the struct. Stop() invokes it so that long-running DB iterations (e.g. cache population) can be interrupted promptly during shutdown.
c90ff15 to
3c7d9d7
Compare
|
Thanks for review @ziggie1984 - I've updated things :) @djkazic - just posting a reminder here as im unable to add you as a reviewer it seems 🤔 |
|
Reviewing right now :) |
|
LGTM overall, One nit is on |
Introduce graphCacheState, a wrapper around GraphCache that tracks its population lifecycle (loading -> loaded) and buffers concurrent mutations during the initial DB scan. Once population completes, buffered updates are replayed and the cache begins serving reads. Start() now launches populateCache in a background goroutine by default. While the cache is loading, all graph reads fall back to the database. The KV iterators (ForEachNodeCacheable, ForEachChannelCacheable) now respect context cancellation so that Stop() can interrupt a long-running population. Tests cover: concurrent reads during population, concurrent write replay, shutdown cancellation during load, population failure with DB fallback, and KV iterator cancellation.
Add a new option to opt out of the new asynchronous graph cache loading feature.
Add a GraphCacheStatus enum to GetInfoResponse so callers can tell whether the graph cache is disabled, still loading, or fully loaded. This makes the async graph cache startup state visible to operators and clients without changing the existing DB fallback behaviour for reads.
3c7d9d7 to
f733eed
Compare
|
thanks @djkazic - updated to not replay on fail |

This PR makes the in-memory graph cache load asynchronously during
startup, so that lnd can begin serving RPCs and handling peer
connections without blocking on a full graph scan. On large nodes the
graph cache population can take tens of seconds; with this change it
happens in the background while all graph reads gracefully fall back to
the database until the cache is ready.
Opt-out flag (commit 7):
--db.sync-graph-cache-loadrestores the old blocking behaviour forusers who prefer it.
Observability (commit 8):
GraphCacheStatus enum (DISABLED, LOADING, LOADED)added toGetInfoResponseso operators and clients can monitor cache readiness.How it works
During loading:
reads → fall back to DB (isLoaded() == false)
writes → DB first, then buffered via applyUpdate()
After loading:
reads → served from cache (lock-free atomic check)
writes → applied directly to cache (no buffering)
Fixes #6187
Replaces #8919