Skip to content

graph/db: async graph cache population#10065

Open
ellemouton wants to merge 9 commits intolightningnetwork:masterfrom
ellemouton:asyncGraphCacheLoad
Open

graph/db: async graph cache population#10065
ellemouton wants to merge 9 commits intolightningnetwork:masterfrom
ellemouton:asyncGraphCacheLoad

Conversation

@ellemouton
Copy link
Collaborator

@ellemouton ellemouton commented Jul 10, 2025

This PR makes the in-memory graph cache load asynchronously during
startup, so that lnd can begin serving RPCs and handling peer
connections without blocking on a full graph scan. On large nodes the
graph cache population can take tens of seconds; with this change it
happens in the background while all graph reads gracefully fall back to
the database until the cache is ready.

Opt-out flag (commit 7):

  • --db.sync-graph-cache-load restores the old blocking behaviour for
    users who prefer it.

Observability (commit 8):

  • GraphCacheStatus enum (DISABLED, LOADING, LOADED) added to
    GetInfoResponse so operators and clients can monitor cache readiness.

How it works

  Start()
    ├── creates cancellable ctx
    ├── launches populateCache(ctx) in goroutine
    │     ├── beginPopulation()    ← sets loading=true, starts buffering mutations
    │     ├── iterates DB (respects ctx cancellation)
    │     └── finishPopulation()   ← replays buffered mutations, sets loaded=true
    └── launches handleTopologySubscriptions

During loading:
reads → fall back to DB (isLoaded() == false)
writes → DB first, then buffered via applyUpdate()

After loading:
reads → served from cache (lock-free atomic check)
writes → applied directly to cache (no buffering)

Fixes #6187
Replaces #8919

@ellemouton ellemouton self-assigned this Jul 10, 2025
@ellemouton ellemouton changed the title Async graph cache load graph/db: async graph cache load Jul 10, 2025
@ellemouton ellemouton changed the title graph/db: async graph cache load graph/db: async graph cache population Jul 10, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ellemouton, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances application startup performance by implementing asynchronous loading for the channel graph cache. This change ensures that the system remains available for queries by temporarily serving them from the database while the in-memory cache is being built. A new configuration flag provides flexibility for users to revert to synchronous loading if desired.

Highlights

  • Performance Improvement: The channel graph cache now populates asynchronously on startup, allowing the application to become responsive faster. During the asynchronous loading phase, all read queries will automatically fall back to the database, ensuring continuous availability.
  • Configuration Option: A new configuration option, --db.sync-graph-cache-load, has been introduced. This allows users to opt out of the new asynchronous loading behavior and force the graph cache to load synchronously, blocking startup until completion.
  • Robust Query Handling: Graph query methods have been updated to intelligently check if the in-memory cache is fully loaded. If not, they will transparently query the underlying database, preventing incomplete or stale data from being served from the cache.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The code changes introduce asynchronous graph cache population on startup, improving performance. The implementation includes a fallback to database queries during cache loading and a configuration option for opting out. The code appears well-structured and includes a new test case.

@ellemouton ellemouton force-pushed the asyncGraphCacheLoad branch from 3d83c73 to 9ac1c58 Compare July 10, 2025 13:01
@ellemouton
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces asynchronous graph cache population, which is a great performance enhancement for node startup. The implementation is clean, with appropriate configuration options to control the new behavior. The tests are thorough, including a new test for the async loading and updates to existing tests to ensure they remain deterministic. The code adheres to the LND Style Guide.

@ellemouton ellemouton force-pushed the asyncGraphCacheLoad branch 2 times, most recently from a814e5a to 944c68c Compare July 10, 2025 16:04
@saubyk saubyk added this to lnd v0.20 Jul 10, 2025
@saubyk saubyk moved this to In progress in lnd v0.20 Jul 10, 2025
Copy link
Contributor

@GustavoStingelin GustavoStingelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@ellemouton ellemouton changed the base branch from master to elle-reset-callbacks July 11, 2025 10:22
@ellemouton ellemouton force-pushed the asyncGraphCacheLoad branch 2 times, most recently from 99833e9 to c5372c4 Compare July 11, 2025 10:24
@djkazic
Copy link
Contributor

djkazic commented Jul 12, 2025

If there are writes to the underlying graph during cache population are these queued to be written into the cache, or will these entries just experience a cache miss when a read happens later?

@ellemouton ellemouton marked this pull request as draft September 1, 2025 06:28
@saubyk saubyk removed this from lnd v0.20 Sep 9, 2025
@ellemouton ellemouton deleted the branch lightningnetwork:master November 24, 2025 11:21
@ellemouton ellemouton closed this Nov 24, 2025
@ellemouton ellemouton reopened this Nov 24, 2025
@ellemouton ellemouton changed the base branch from elle-reset-callbacks to master November 24, 2025 11:24
@ellemouton ellemouton force-pushed the asyncGraphCacheLoad branch from a9663ad to 281e080 Compare March 6, 2026 09:23
ellemouton added a commit to ellemouton/lnd that referenced this pull request Mar 9, 2026
Add testRemoteGraphPolicyUpdate which demonstrates that when a payment
fails due to a stale fee policy (FeeInsufficient), the updated
ChannelUpdate from the failure message is NOT applied to the sender's
graph cache when the channel only exists in the remote graph source.

The test asserts the current broken behavior: the payment fails with
FAILURE_REASON_NO_ROUTE because Zane cannot learn the updated fee from
the failure message, and the cached policy remains stale.

Also includes a small fix to DBGraphSource.SourceNode (use
ErrSourceNodeNotSet instead of ErrGraphNodeNotFound) and a TODO for
making remote cache population async once lightningnetwork#10065 lands.
@saubyk saubyk added this to v0.21 Mar 13, 2026
@saubyk saubyk added this to the v0.21.0 milestone Mar 13, 2026
@saubyk saubyk moved this to In review in v0.21 Mar 13, 2026
@lightninglabs-deploy
Copy link
Collaborator

@ellemouton, remember to re-request review from reviewers when ready

@ellemouton ellemouton force-pushed the asyncGraphCacheLoad branch from 864c553 to ee57127 Compare March 17, 2026 10:02
@djkazic
Copy link
Contributor

djkazic commented Mar 17, 2026

Will review this tomorrow.

@ellemouton ellemouton force-pushed the asyncGraphCacheLoad branch 2 times, most recently from 8aa116e to 112c95a Compare March 17, 2026 15:02
@saubyk saubyk requested a review from ziggie1984 March 17, 2026 15:42
@ziggie1984
Copy link
Collaborator

@claude review

@claude
Copy link

claude bot commented Mar 20, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good I am going to test it on my node.

I wonder if we should shutdown the node if the async population fails ?

defer log.Debug("ChannelGraph started")

ctx := context.TODO()
ctx, cancel := context.WithCancel(context.Background())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the context an Option, seems to be its always set ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume you mean cancel? well technically it is only set after Start is called. not set at construction time. And if you remember: we made the startup sequence such that it is possible for Stop to be called even if Start was never called


// errContextDone returns the context error if the context is done (canceled or
// deadline exceeded).
func errContextDone(ctx context.Context) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errContextDone is a thin wrapper around ctx.Err() with a nil guard, but the guard doesn't add meaningful safety here. Go's own docs say "do not pass a nil Context" — if a nil context
were passed it would be a bug in the caller that should surface as a panic, not be silently swallowed by returning nil. The two call sites are straightforward enough to inline:

if err := ctx.Err(); err != nil {
     return err
}

This is more idiomatic and removes a small helper whose nil branch is unreachable in practice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch :)

c.cache.finishPopulation(loaded)
}()

cache := c.cache.cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that reads like a typo, can we name them differently maybe ?

// pendingUpdatesWarnThreshold is the number of buffered cache mutations at
// which a warning is logged. A large buffer indicates that cache population is
// taking a long time relative to the incoming gossip rate.
const pendingUpdatesWarnThreshold = 10_000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we ever expecting hitting this value, have you done some testing on mainnet ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wanted to have some safe guard.. but perhaps this is too high?

if s.loading {
s.pendingUpdates = append(s.pendingUpdates, update)

if len(s.pendingUpdates) == pendingUpdatesWarnThreshold {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:


  if len(s.pendingUpdates) % pendingUpdatesWarnThreshold == 0 {
      log.Warnf("Graph cache has %d pending updates "+
          "buffered during population",
          len(s.pendingUpdates))
  }


// asyncGraphCachePopulation indicates whether the graph cache
// should be populated asynchronously or if the Start method should
// block until the cache is fully populated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe add that this is the default


// GraphCacheStatusFailed indicates that the initial population of
// the graph cache failed. Reads fall back to the database.
GraphCacheStatusFailed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently when the graphCache fails to be populated we just siltently continue and users need to either call getInfo or check the logs, what about shutting LND down if the graphcache fails, it would be similar to the sync behavior where we fail in case of an error ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it now uses 'log.Critical' if it errors - which will trigger shutdown 👍

Instead of letting tests set the graphCache to nil in order to simulate
it not being set, we instead make use of the WithUseGraphCache helper.
Clean up TestGraphCacheTraversal so that we are explicitly enabling the
graphCache. This removes the need to explicitly make calls to the cache.

Also remove a duplicate check from assertNodeNotInCache.
Use this to block reading from the cache unless cacheLoaded returns
true. This will start being useful once cache population is done
asynchronously.
Refactor so that we don't have two layers of indentation later on when
we want to spin populateCache off into a goroutine.
Create a cancellable context in Start() and store its cancel function
on the struct. Stop() invokes it so that long-running DB iterations
(e.g. cache population) can be interrupted promptly during shutdown.
@ellemouton ellemouton force-pushed the asyncGraphCacheLoad branch 2 times, most recently from c90ff15 to 3c7d9d7 Compare March 23, 2026 14:22
@ellemouton
Copy link
Collaborator Author

Thanks for review @ziggie1984 - I've updated things :)

@djkazic - just posting a reminder here as im unable to add you as a reviewer it seems 🤔

@djkazic
Copy link
Contributor

djkazic commented Mar 23, 2026

Reviewing right now :)

Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@djkazic
Copy link
Contributor

djkazic commented Mar 23, 2026

LGTM overall, graphCacheState is a nice and clean abstraction. Test coverage is also good.

One nit is on finishPopulation. IIUC, if population fails the cache is in a partial state, and replaying mutations on top of a partial cache seems pointless since reads won't use it (isFailed() gate would prevent it). IMO, for this case the replay is harmless but unnecessary.

Introduce graphCacheState, a wrapper around GraphCache that tracks its
population lifecycle (loading -> loaded) and buffers concurrent mutations
during the initial DB scan. Once population completes, buffered updates
are replayed and the cache begins serving reads.

Start() now launches populateCache in a background goroutine by default.
While the cache is loading, all graph reads fall back to the database.
The KV iterators (ForEachNodeCacheable, ForEachChannelCacheable) now
respect context cancellation so that Stop() can interrupt a long-running
population.

Tests cover: concurrent reads during population, concurrent write replay,
shutdown cancellation during load, population failure with DB fallback,
and KV iterator cancellation.
Add a new option to opt out of the new asynchronous graph cache loading
feature.
Add a GraphCacheStatus enum to GetInfoResponse so callers can tell
whether the graph cache is disabled, still loading, or fully loaded.

This makes the async graph cache startup state visible to operators and
clients without changing the existing DB fallback behaviour for reads.
@ellemouton ellemouton force-pushed the asyncGraphCacheLoad branch from 3c7d9d7 to f733eed Compare March 24, 2026 07:53
@ellemouton
Copy link
Collaborator Author

thanks @djkazic - updated to not replay on fail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

channeldb: make channel graph population async

7 participants