-
Notifications
You must be signed in to change notification settings - Fork 514
Introduce a DatabaseLifecycleTracker
, which tracks database lifecycles
#2652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Perhaps it should be called `DatabaseLifecycleManager`? This new object is responsible for tracking the lifecycle of a database, and for cleaning up after the database exits. In particular, it: - Unregisters the `Host` from the containing `HostController`. This was previously handled by an ad-hoc on-panic callback closure. - Aborts the database memory usage metrics reporter task. This was previously handled by a `Drop` method on `Host`. - Disconnects all connected WebSocket clients. Previously, this didn't happen at all, as per issue 2630. I've also added some commentary to the WebSocket actor loop. Follow-up commits will add tests once I've consulted with the team about how best to test this change.
I manually caused an error using the following diff: modified crates/durability/src/imp/local.rs
@@ -195,6 +195,14 @@ impl<T: Encode + Send + Sync + 'static> PersisterTask<T> {
self.queue_depth.fetch_sub(1, Relaxed);
trace!("received txdata");
+ let time = std::time::SystemTime::now()
+ .duration_since(std::time::SystemTime::UNIX_EPOCH)
+ .unwrap()
+ .as_micros();
+ if time & 0xff == 0xff {
+ panic!("Random panic!");
+ }
+
// If we are writing one commit per tx, trying to buffer is
// fairly pointless. Immediately flush instead.
// I then ran quickstart-chat with two clients: the normal one, and one that sends messages as fast as possible in a loop. In both public and private, on both master and this PR, both clients were disconnected immediately (to human perception) when the panic triggered. I will attempt to put the same occasionally-panic code in other places to see if I can reproduce @kim 's issue on master, and from there if this patch fixes the issue. |
@@ -671,6 +701,155 @@ async fn update_module( | |||
} | |||
} | |||
|
|||
#[derive(Debug)] | |||
pub enum DatabaseLifecycle { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments explaining the state transitions would be nice here.
matches!(self.get_lifecycle(), DatabaseLifecycle::Stopped { .. }) | ||
} | ||
|
||
pub fn get_lifecycle(&self) -> &DatabaseLifecycle { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this needs to be pub
, or is is_stopped
enough?
&self.lifecycle | ||
} | ||
|
||
pub async fn stop_database(&mut self, reason: anyhow::Error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the guarantees after this function completes? Will this stop any pending or currently running operations, or can operations still be running after this function completes?
@@ -226,6 +226,8 @@ async fn ws_client_actor_inner( | |||
let mut closed = false; | |||
let mut rx_buf = Vec::new(); | |||
|
|||
let mut connected_clients_watcher = client.module.lifecycle.lock().connected_clients_watcher.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid having the lock
be part of the interface? I don't think we want users to hold the lock for more than a function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing the following:
- durability panics
- client(s) get disconnected
call_identity_disconnected
is invoked, which also panics (because durability is unavailable)- WARN level messages appear because the metrics task is already
None
and lifecycle tracker is dropped in a!stopped
state
We may want to avoid trying to call the disconnect reducer, or else downgrade the log level as it will be normal to shut down in this state.
&format!("while executing reducer {reducer}"), | ||
&panic_payload, | ||
); | ||
self.lifecycle.lock().stop_database(err).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This triggers clippy(await_holding_lock)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this does improve the situation, but I'm observing (Rust SDK) clients not being disconnected promptly.
This could either be a problem with the SDK, or due to the ws_client_actor_inner
loop to continue even after sending a close frame (which I'm not sure why we're doing that). I'll need to investigate further to tell whether that's an actual problem.
Ok, so it looks like reducer calls are accepted indefinitely, even if the connection is closed. This is probably by design, but perhaps we can error when the queue becomes too full. In any case, it's a client issue. |
That does not sound like it's by design. |
Description of Changes
Fixes #2630 .
Perhaps it should be called
DatabaseLifecycleManager
?This new object is responsible for tracking the lifecycle of a database, and for cleaning up after the database exits.
In particular, it:
Unregisters the
Host
from the containingHostController
. This was previously handled by an ad-hoc on-panic callback closure.Aborts the database memory usage metrics reporter task. This was previously handled by a
Drop
method onHost
.Disconnects all connected WebSocket clients. Previously, this didn't happen at all, as per issue 2630.
I've also added some commentary to the WebSocket actor loop.
Follow-up commits will add tests once I've consulted with the team about how best to test this change.
API and ABI breaking changes
N/a
Expected complexity level and risk
3 at least. Database lifecycle management is complicated, and it's possible I've either missed cleaning up some resource, or cleaned it up too eagerly. There's also a lock around the
DatabaseLifecycleTracker
which is accessed from a few different locations and at one point held across an.await
block, which could potentially introduce deadlocks.Testing
None yet. I will need to discuss with @kim and @jsdt what the best way to test this is.