Skip to content

Introduce a DatabaseLifecycleTracker, which tracks database lifecycles #2652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gefjon
Copy link
Contributor

@gefjon gefjon commented Apr 21, 2025

Description of Changes

Fixes #2630 .

Perhaps it should be called DatabaseLifecycleManager?

This new object is responsible for tracking the lifecycle of a database, and for cleaning up after the database exits.
In particular, it:

  • Unregisters the Host from the containing HostController. This was previously handled by an ad-hoc on-panic callback closure.

  • Aborts the database memory usage metrics reporter task. This was previously handled by a Drop method on Host.

  • Disconnects all connected WebSocket clients. Previously, this didn't happen at all, as per issue 2630.

I've also added some commentary to the WebSocket actor loop.

Follow-up commits will add tests once I've consulted with the team about how best to test this change.

API and ABI breaking changes

N/a

Expected complexity level and risk

3 at least. Database lifecycle management is complicated, and it's possible I've either missed cleaning up some resource, or cleaned it up too eagerly. There's also a lock around the DatabaseLifecycleTracker which is accessed from a few different locations and at one point held across an .await block, which could potentially introduce deadlocks.

Testing

None yet. I will need to discuss with @kim and @jsdt what the best way to test this is.

Perhaps it should be called `DatabaseLifecycleManager`?

This new object is responsible for tracking the lifecycle of a database,
and for cleaning up after the database exits.
In particular, it:

- Unregisters the `Host` from the containing `HostController`.
  This was previously handled by an ad-hoc on-panic callback closure.

- Aborts the database memory usage metrics reporter task.
  This was previously handled by a `Drop` method on `Host`.

- Disconnects all connected WebSocket clients.
  Previously, this didn't happen at all, as per issue 2630.

I've also added some commentary to the WebSocket actor loop.

Follow-up commits will add tests once I've consulted with the team
about how best to test this change.
@gefjon
Copy link
Contributor Author

gefjon commented Apr 22, 2025

I manually caused an error using the following diff:

modified   crates/durability/src/imp/local.rs
@@ -195,6 +195,14 @@ impl<T: Encode + Send + Sync + 'static> PersisterTask<T> {
             self.queue_depth.fetch_sub(1, Relaxed);
             trace!("received txdata");
 
+            let time = std::time::SystemTime::now()
+                .duration_since(std::time::SystemTime::UNIX_EPOCH)
+                .unwrap()
+                .as_micros();
+            if time & 0xff == 0xff {
+                panic!("Random panic!");
+            }
+
             // If we are writing one commit per tx, trying to buffer is
             // fairly pointless. Immediately flush instead.
             //

I then ran quickstart-chat with two clients: the normal one, and one that sends messages as fast as possible in a loop.

In both public and private, on both master and this PR, both clients were disconnected immediately (to human perception) when the panic triggered. I will attempt to put the same occasionally-panic code in other places to see if I can reproduce @kim 's issue on master, and from there if this patch fixes the issue.

@@ -671,6 +701,155 @@ async fn update_module(
}
}

#[derive(Debug)]
pub enum DatabaseLifecycle {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments explaining the state transitions would be nice here.

matches!(self.get_lifecycle(), DatabaseLifecycle::Stopped { .. })
}

pub fn get_lifecycle(&self) -> &DatabaseLifecycle {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this needs to be pub, or is is_stopped enough?

&self.lifecycle
}

pub async fn stop_database(&mut self, reason: anyhow::Error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the guarantees after this function completes? Will this stop any pending or currently running operations, or can operations still be running after this function completes?

@@ -226,6 +226,8 @@ async fn ws_client_actor_inner(
let mut closed = false;
let mut rx_buf = Vec::new();

let mut connected_clients_watcher = client.module.lifecycle.lock().connected_clients_watcher.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid having the lock be part of the interface? I don't think we want users to hold the lock for more than a function call.

Copy link
Contributor

@kim kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing the following:

  • durability panics
  • client(s) get disconnected
  • call_identity_disconnected is invoked, which also panics (because durability is unavailable)
  • WARN level messages appear because the metrics task is already None and lifecycle tracker is dropped in a !stopped state

We may want to avoid trying to call the disconnect reducer, or else downgrade the log level as it will be normal to shut down in this state.

&format!("while executing reducer {reducer}"),
&panic_payload,
);
self.lifecycle.lock().stop_database(err).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This triggers clippy(await_holding_lock)

Copy link
Contributor

@kim kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this does improve the situation, but I'm observing (Rust SDK) clients not being disconnected promptly.

This could either be a problem with the SDK, or due to the ws_client_actor_inner loop to continue even after sending a close frame (which I'm not sure why we're doing that). I'll need to investigate further to tell whether that's an actual problem.

@kim
Copy link
Contributor

kim commented Apr 24, 2025

I'm observing (Rust SDK) clients not being disconnected promptly

Ok, so it looks like reducer calls are accepted indefinitely, even if the connection is closed. This is probably by design, but perhaps we can error when the queue becomes too full.

In any case, it's a client issue.

@CLAassistant
Copy link

CLAassistant commented May 3, 2025

CLA assistant check
All committers have signed the CLA.

@gefjon
Copy link
Contributor Author

gefjon commented May 5, 2025

I'm observing (Rust SDK) clients not being disconnected promptly

Ok, so it looks like reducer calls are accepted indefinitely, even if the connection is closed. This is probably by design, but perhaps we can error when the queue becomes too full.

That does not sound like it's by design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Module crash should disconnect all clients
4 participants