Introduce a `DatabaseLifecycleTracker`, which tracks database lifecycles #2652

gefjon · 2025-04-21T15:50:14Z

Description of Changes

Fixes #2630 .

Perhaps it should be called DatabaseLifecycleManager?

This new object is responsible for tracking the lifecycle of a database, and for cleaning up after the database exits.
In particular, it:

Unregisters the Host from the containing HostController. This was previously handled by an ad-hoc on-panic callback closure.
Aborts the database memory usage metrics reporter task. This was previously handled by a Drop method on Host.
Disconnects all connected WebSocket clients. Previously, this didn't happen at all, as per issue 2630.

I've also added some commentary to the WebSocket actor loop.

Follow-up commits will add tests once I've consulted with the team about how best to test this change.

API and ABI breaking changes

N/a

Expected complexity level and risk

3 at least. Database lifecycle management is complicated, and it's possible I've either missed cleaning up some resource, or cleaned it up too eagerly. There's also a lock around the DatabaseLifecycleTracker which is accessed from a few different locations and at one point held across an .await block, which could potentially introduce deadlocks.

Testing

None yet. I will need to discuss with @kim and @jsdt what the best way to test this is.

Perhaps it should be called `DatabaseLifecycleManager`? This new object is responsible for tracking the lifecycle of a database, and for cleaning up after the database exits. In particular, it: - Unregisters the `Host` from the containing `HostController`. This was previously handled by an ad-hoc on-panic callback closure. - Aborts the database memory usage metrics reporter task. This was previously handled by a `Drop` method on `Host`. - Disconnects all connected WebSocket clients. Previously, this didn't happen at all, as per issue 2630. I've also added some commentary to the WebSocket actor loop. Follow-up commits will add tests once I've consulted with the team about how best to test this change.

gefjon · 2025-04-22T16:03:46Z

I manually caused an error using the following diff:

modified   crates/durability/src/imp/local.rs
@@ -195,6 +195,14 @@ impl<T: Encode + Send + Sync + 'static> PersisterTask<T> {
             self.queue_depth.fetch_sub(1, Relaxed);
             trace!("received txdata");
 
+            let time = std::time::SystemTime::now()
+                .duration_since(std::time::SystemTime::UNIX_EPOCH)
+                .unwrap()
+                .as_micros();
+            if time & 0xff == 0xff {
+                panic!("Random panic!");
+            }
+
             // If we are writing one commit per tx, trying to buffer is
             // fairly pointless. Immediately flush instead.
             //

I then ran quickstart-chat with two clients: the normal one, and one that sends messages as fast as possible in a loop.

In both public and private, on both master and this PR, both clients were disconnected immediately (to human perception) when the panic triggered. I will attempt to put the same occasionally-panic code in other places to see if I can reproduce @kim 's issue on master, and from there if this patch fixes the issue.

jsdt · 2025-04-22T17:59:59Z

crates/core/src/host/host_controller.rs

@@ -671,6 +701,155 @@ async fn update_module(
    }
 }

+#[derive(Debug)]
+pub enum DatabaseLifecycle {


Comments explaining the state transitions would be nice here.

jsdt · 2025-04-22T18:02:37Z

crates/core/src/host/host_controller.rs

+        matches!(self.get_lifecycle(), DatabaseLifecycle::Stopped { .. })
+    }
+
+    pub fn get_lifecycle(&self) -> &DatabaseLifecycle {


Do you think this needs to be pub, or is is_stopped enough?

jsdt · 2025-04-22T18:14:45Z

crates/core/src/host/host_controller.rs

+        &self.lifecycle
+    }
+
+    pub async fn stop_database(&mut self, reason: anyhow::Error) {


What are the guarantees after this function completes? Will this stop any pending or currently running operations, or can operations still be running after this function completes?

jsdt · 2025-04-22T18:26:24Z

crates/client-api/src/routes/subscribe.rs

@@ -226,6 +226,8 @@ async fn ws_client_actor_inner(
    let mut closed = false;
    let mut rx_buf = Vec::new();

+    let mut connected_clients_watcher = client.module.lifecycle.lock().connected_clients_watcher.clone();


Can we avoid having the lock be part of the interface? I don't think we want users to hold the lock for more than a function call.

kim

I'm seeing the following:

durability panics
client(s) get disconnected
call_identity_disconnected is invoked, which also panics (because durability is unavailable)
WARN level messages appear because the metrics task is already None and lifecycle tracker is dropped in a !stopped state

We may want to avoid trying to call the disconnect reducer, or else downgrade the log level as it will be normal to shut down in this state.

kim · 2025-04-24T07:52:36Z

crates/core/src/host/module_host.rs

+                    &format!("while executing reducer {reducer}"),
+                    &panic_payload,
+                );
+                self.lifecycle.lock().stop_database(err).await;


This triggers clippy(await_holding_lock)

kim

I think this does improve the situation, but I'm observing (Rust SDK) clients not being disconnected promptly.

This could either be a problem with the SDK, or due to the ws_client_actor_inner loop to continue even after sending a close frame (which I'm not sure why we're doing that). I'll need to investigate further to tell whether that's an actual problem.

kim · 2025-04-24T16:03:07Z

I'm observing (Rust SDK) clients not being disconnected promptly

Ok, so it looks like reducer calls are accepted indefinitely, even if the connection is closed. This is probably by design, but perhaps we can error when the queue becomes too full.

In any case, it's a client issue.

CLAassistant · 2025-05-03T18:53:34Z

All committers have signed the CLA.

gefjon · 2025-05-05T16:21:06Z

I'm observing (Rust SDK) clients not being disconnected promptly

Ok, so it looks like reducer calls are accepted indefinitely, even if the connection is closed. This is probably by design, but perhaps we can error when the queue becomes too full.

That does not sound like it's by design.

jsdt reviewed Apr 22, 2025

View reviewed changes

kim reviewed Apr 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a `DatabaseLifecycleTracker`, which tracks database lifecycles #2652

Introduce a `DatabaseLifecycleTracker`, which tracks database lifecycles #2652

gefjon commented Apr 21, 2025

gefjon commented Apr 22, 2025

jsdt Apr 22, 2025

jsdt Apr 22, 2025

jsdt Apr 22, 2025

jsdt Apr 22, 2025

kim left a comment

kim Apr 24, 2025

kim left a comment

kim commented Apr 24, 2025

CLAassistant commented May 3, 2025 •

edited

Loading

gefjon commented May 5, 2025

Introduce a DatabaseLifecycleTracker, which tracks database lifecycles #2652

Are you sure you want to change the base?

Introduce a DatabaseLifecycleTracker, which tracks database lifecycles #2652

Conversation

gefjon commented Apr 21, 2025

Description of Changes

API and ABI breaking changes

Expected complexity level and risk

Testing

gefjon commented Apr 22, 2025

jsdt Apr 22, 2025

Choose a reason for hiding this comment

jsdt Apr 22, 2025

Choose a reason for hiding this comment

jsdt Apr 22, 2025

Choose a reason for hiding this comment

jsdt Apr 22, 2025

Choose a reason for hiding this comment

kim left a comment

Choose a reason for hiding this comment

kim Apr 24, 2025

Choose a reason for hiding this comment

kim left a comment

Choose a reason for hiding this comment

kim commented Apr 24, 2025

CLAassistant commented May 3, 2025 • edited Loading

gefjon commented May 5, 2025

Introduce a `DatabaseLifecycleTracker`, which tracks database lifecycles #2652

Introduce a `DatabaseLifecycleTracker`, which tracks database lifecycles #2652

CLAassistant commented May 3, 2025 •

edited

Loading