feat: Lazy ETCD initialization #5690

jthomson04 · 2026-01-27T19:21:59Z

ETCD seems to be very sensitive to network and memory pressure. During the network and memory-intensive model loading process (especially when loading from ETCD), pings from the ETCD server can take several seconds to be delivered. To get around this, we lazily initialize ETCD, and only create our lease after the model loading has completed.

Summary by CodeRabbit

Refactor

Enhanced key-value store initialization with improved runtime integration and lazy loading capabilities
Streamlined storage backend configuration and initialization logic

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

coderabbitai · 2026-01-27T19:26:40Z

Walkthrough

The changes introduce a lazy-initialization pattern for the key-value store with runtime-aware configuration. The Manager struct is refactored to defer KV store creation via Selector.build(), the oneshot dependency is added, and initialization logic across multiple modules is simplified to use the new Manager API.

Changes

Cohort / File(s)	Summary
Dependency Management `lib/runtime/Cargo.toml`	Added `oneshot` crate as workspace dependency in `[dependencies]` section.
KV Store Tests `lib/runtime/src/discovery/kv_store.rs`	Replaced `kv::Manager::memory()` with `kv::Manager::default()` across three test cases for consistent test setup.
Initialization & Core Refactoring `lib/runtime/src/distributed.rs`	Consolidated manual KV store backend initialization into single `kv::Manager::new(selected_kv_store, Some(runtime.clone()))` call; eliminates per-variant error handling.
	`lib/runtime/src/storage/kv.rs`

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hopping through code with delight,
Lazy init makes stores just right,
OnceCell caches, runtime aware,
Manager refactored with such care!
Oneshot added, errors caught—
Architecture battles, wisely fought! 🎯

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description provides the context and motivation but does not follow the required template structure with sections for Overview, Details, Where to start, and Related Issues.	Restructure the description to match the template: add an Overview section, expand Details with specific changes made, identify files for reviewer focus, and include Related Issues section with any GitHub issue references.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the main behavioral change: lazy initialization of ETCD. It directly reflects the primary objective stated in the PR description and commit messages.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@lib/runtime/src/storage/kv.rs`:
- Around line 338-340: The current connection_id() method calls
self.get_kv_store().unwrap() and can panic on misconfigured Manager or failed
lazy init; change connection_id() to return Result<u64, StoreError> (or the
crate's appropriate error type) and propagate the error from get_kv_store()
instead of unwrapping, updating its signature and all callers to handle the
Result; do the same for shutdown() (remove unwraps, return Result<(),
StoreError>) or alternatively replace unwrap() with expect() that includes clear
context only if you decide to keep panics — locate and update the
connection_id(), shutdown(), and get_kv_store() call sites to propagate/handle
the new Result types.

🧹 Nitpick comments (1)

lib/runtime/src/storage/kv.rs (1)

144-183: Consider using block_on or restructuring to avoid blocking an async thread.

The rx.recv() call on line 159 is a synchronous blocking operation that will block the entire OS thread. While the comment acknowledges this is a one-time operation, blocking an async runtime's worker thread can cause issues if:

The spawned task takes longer than expected (network delays during ETCD connection)

Other tasks on the same worker thread are starved

Since Selector::build is called from get_kv_store which is synchronous (returns Result<&KeyValueStoreEnum, StoreError>), you may need to reconsider the API design. Alternatives include:

Making get_kv_store async and using .await instead of blocking

Using tokio::task::block_in_place to signal this is intentional blocking

That said, given the PR objective is to defer ETCD initialization until after model loading (when the runtime is less busy), this trade-off may be acceptable.

coderabbitai · 2026-01-27T19:26:43Z

lib/runtime/src/storage/kv.rs

    pub fn connection_id(&self) -> u64 {
-        self.0.connection_id()
+        self.get_kv_store().unwrap().connection_id()
    }


⚠️ Potential issue | 🟡 Minor

Potential panic on misconfigured Manager.

connection_id() uses .unwrap() which will panic if get_kv_store() fails. This can occur if:

Manager was created with Selector::Etcd or Selector::File but runtime is None

The ETCD connection fails during lazy initialization

Consider returning Result<u64, StoreError> or documenting this as a precondition. The same applies to shutdown() on line 440.

💡 Suggested alternatives

Option 1: Return Result

- pub fn connection_id(&self) -> u64 { - self.get_kv_store().unwrap().connection_id() + pub fn connection_id(&self) -> Result<u64, StoreError> { + Ok(self.get_kv_store()?.connection_id()) }

Option 2: Add expect with context

- self.get_kv_store().unwrap().connection_id() + self.get_kv_store() + .expect("KV store must be initialized before calling connection_id") + .connection_id()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pub fn connection_id(&self) -> u64 {

self.0.connection_id()

self.get_kv_store().unwrap().connection_id()

}

pub fn connection_id(&self) -> u64 {

self.get_kv_store()

.expect("KV store must be initialized before calling connection_id")

.connection_id()

}

🤖 Prompt for AI Agents

In `@lib/runtime/src/storage/kv.rs` around lines 338 - 340, The current connection_id() method calls self.get_kv_store().unwrap() and can panic on misconfigured Manager or failed lazy init; change connection_id() to return Result<u64, StoreError> (or the crate's appropriate error type) and propagate the error from get_kv_store() instead of unwrapping, updating its signature and all callers to handle the Result; do the same for shutdown() (remove unwraps, return Result<(), StoreError>) or alternatively replace unwrap() with expect() that includes clear context only if you decide to keep panics — locate and update the connection_id(), shutdown(), and get_kv_store() call sites to propagate/handle the new Result types.

grahamking · 2026-01-27T20:54:23Z

lib/runtime/src/storage/kv.rs

+                    Ok(KeyValueStoreEnum::Etcd(EtcdStore::new(
+                        rx.recv()
+                            .map_err(|x| StoreError::from(anyhow::anyhow!(x)))??,
+                    )))


Why create the transport in a different task? Doesn't this resolve to creating it inline?

Oh, I see, it's the async. Thats' a pain.

grahamking · 2026-01-27T20:58:17Z

lib/runtime/src/storage/kv.rs

-    fn new(s: KeyValueStoreEnum) -> Manager {
-        Manager(Arc::new(s))
+    fn get_kv_store(&self) -> Result<&KeyValueStoreEnum, StoreError> {
+        let selector = self.selector.clone();


You are cloning the selector every time, but only using it on the very first call.

You could move the clone into the function.

If you put it in an Option you can do selector.take() to get it with no copying.

grahamking · 2026-01-27T21:00:18Z

lib/runtime/src/storage/kv.rs


    pub fn connection_id(&self) -> u64 {
-        self.0.connection_id()
+        self.get_kv_store().unwrap().connection_id()


We can't have any unwrap() in production code, unless they really can never happen. In that case add a comment: // Safety: why it can't happen.

grahamking · 2026-01-27T21:00:40Z

lib/runtime/src/storage/kv.rs

    /// TODO: Should this be async? Take &mut self?
    pub fn shutdown(&self) {
-        self.0.shutdown()
+        self.get_kv_store().unwrap().shutdown()


Same for unwrap. A panic here would mask any shutdown errors.

grahamking · 2026-01-27T21:01:19Z

lib/runtime/src/storage/kv.rs


-    fn new(s: KeyValueStoreEnum) -> Manager {
-        Manager(Arc::new(s))
+    fn get_kv_store(&self) -> Result<&KeyValueStoreEnum, StoreError> {


This Result is unfortunate, because it will always be Ok after it's initialize. I wonder if you can design it without?

grahamking · 2026-01-27T21:03:21Z

lib/runtime/src/distributed.rs

-                let etcd_client = etcd::Client::new(*etcd_config, runtime_clone).await.inspect_err(|err|
-                    // The returned error doesn't show because of a dropped runtime error, so
-                    // log it first.
-                    tracing::error!(%err, "Could not connect to etcd. Pass `--store-kv ..` to use a different backend or start etcd."))?;


I think you lost this part. That was a hard earned logging statement and comment.

grahamking · 2026-01-27T21:05:00Z

It feels like a hack to get around an environment issue. Could you investigate more and find the root cause?

I'd prefer we figure it out upstream. This complicates the code.

jthomson04 added 6 commits January 27, 2026 11:17

Defer kv store creation

4e8d204

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

fmt and clippy

b691a92

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

More fmt

402883a

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

check in cargo lock

cb6c669

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

fix tests

c566054

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

more clippy fixes

d12262d

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

jthomson04 requested a review from a team as a code owner January 27, 2026 19:22

pull-request-size bot added the size/L label Jan 27, 2026

jthomson04 requested a review from grahamking January 27, 2026 19:22

github-actions bot added the feat label Jan 27, 2026

jthomson04 mentioned this pull request Jan 27, 2026

feat: Lazy ETCD initialization #5656

Merged

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

grahamking reviewed Jan 27, 2026

View reviewed changes

grahamking mentioned this pull request Jan 27, 2026

feat: Lazy ETCD initialization #5692

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Lazy ETCD initialization #5690

feat: Lazy ETCD initialization #5690

jthomson04 commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 27, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 27, 2026

Uh oh!

grahamking Jan 27, 2026

Uh oh!

grahamking Jan 27, 2026

Uh oh!

grahamking Jan 27, 2026 •

edited

Loading

Uh oh!

grahamking Jan 27, 2026

Uh oh!

grahamking Jan 27, 2026

Uh oh!

grahamking Jan 27, 2026

Uh oh!

grahamking Jan 27, 2026

Uh oh!

grahamking commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Lazy ETCD initialization #5690

Are you sure you want to change the base?

feat: Lazy ETCD initialization #5690

Conversation

jthomson04 commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Refactor

Uh oh!

coderabbitai bot commented Jan 27, 2026

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

grahamking Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

grahamking Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

grahamking Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grahamking Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

grahamking Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

grahamking Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

grahamking Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

grahamking commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jthomson04 commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

grahamking Jan 27, 2026 •

edited

Loading