Avoid I/O after deleting client flows #4179

hillu · 2025-04-15T14:06:52Z

Commit 786e656 introduced a background job for updating client flow indices after deleting flows.
This background job always constructed indices from scratch, leading to slow I/O in a master/minion setup backed by Amazon EFS.

This PR allows individual flow IDs to be passed to the background job so it can simply filter them out from the index files, similar to the path taken when delete_flow( … , sync = true ) is called.

This changes also removes a value-shadowing error that caused empty index files to be written.

houseKeeping only has to do any work when flows have been deleted from a client. Passing the Flow IDs along with the Client IDs allows us to filter out flows from the index using removeFlowsFromIndex instead of rebuilding the index from scratch.

scudette · 2025-04-15T14:14:47Z

services/launcher/storage.go

 			self.mu.Unlock()

-			for _, client_id := range pending {
-				err := self.buildFlowIndexFromDatastore(


I am worried about coherency here - if there is a new flow created after a flow is deleted then the safest option is to rebuild the index from scratch.

I could add another field

addedFlows map[string][]string

that is used to tell houseKeeping about flows that need to be present in the index. If they are not, only those flows would need to be re-read. In the worst case, any missing flow would be added in the next loop iteration.

What do you think?

The method buildFlowIndexFromDatastore() returns the ground truth about the index of flows. The index is just a quick way to access the list of flows within the directory but the ground truth of which flows are present is still kept within the individual flow objects.

With this PR we never really re-build the ground truth because we only look at changes. If somethings are lost (e.g. due to server restart between housekeeping runs) we can get the index into an inconsistent state and have no way to return it (other than remove the index completely which will force a rebuild).

I think a more robust solution involves writing a journal of changes to the index and then rebuilding the index based on this (i.e. write a journal with the contents of addedFlows and removedFlows immediately, then in the housekeeping thread rebuild the index based on that journal).

The current code tries to amortize the cost of rebuilding by delaying the rebuild to once per minute (previously we rebuilt for each change). Do you still see a large IO demand due to this? can we either increase the rebuild time or sleep a bit during the rebuild so as not to spike it?

Yes, we still saw large I/O demands before applying the patches that are in this PR so far. When we deleted hunts and all their associated flows, we would cause flow indexes for a lot of clients to be rebuilt from within houseKeeping – I can't see much of an amortization effect here.

I assume that flow indexes are only written by the master, so we should be able to avoid race conditions using mutexes. I'll add a patch to this PR, hopefully later today.

Consistency/durability across server restarts or crashes is a more involved issue – we seem to be wandering into "proper" database territory. ;-)

I added a journal based implementation that I feel is more robust. Please take a look.

Looks good to me. Loving the very clear prose comment at the beginning of journal.go!

If I'm not mistaken, there's still an opportunity for the client flow index to be corrupted by concurrent write operations – from removeClientFlowsFromIndex, buildFlowIndexFromDatastore, WriteFlowIndex. Protecting client flow index files using a mutex seems to be the sensible thing to do. (This problem seems to have existed before this PR.)

I won't be able to properly test the journal implementation before Tuesday.

code draft for client-specific flow index locks:

func (self *FlowStorageManager) lockFlowIndex(client_id string) { for { self.mu.Lock() if _, locked := self.flowIndexLocks[client_id]; locked { self.mu.Unlock() time.Sleep(100 * time.Millisecond) } else { self.flowIndexLocks[client_id] = struct{}{} self.FlowIndexMu.Unlock() return } } } func (self *FlowStorageManager) unlockFlowIndex(client_id string) { self.mu.Lock() delete(self.flowIndexLocks, client_id) self.mu.Unlock() }

Thats a good point - I moved all the functions regarding building the index into the (per client) flowIndexBuilder which means they all share the same mutex and can not all run the same time.

Move removeClientFlowsFromIndex to the flowIndexBuilder to mediates contention with buildFlowIndexFromDatastore

This shares the lock with the other index building functions.

hillu · 2025-04-22T11:15:19Z

I have successfully tested your patches in our EFS-backed master/minion setup. Seems to perform well and we haven't noticed any irregularities.

scudette · 2025-04-22T11:23:08Z

Ok thanks for testing it! I will merge this now.

Commit 786e656 introduced a background job for updating client flow indices after deleting flows. This background job always constructed indices from scratch, leading to slow I/O in a master/minion setup backed by Amazon EFS. This PR allows individual flow IDs to be passed to the background job so it can simply filter them out from the index files, similar to the path taken when `delete_flow( … , sync = true )` is called. --------- Co-authored-by: Mike Cohen <[email protected]>

hillu added 2 commits April 15, 2025 15:23

removeFlowFromIndex: Accept more than 1 flow ID, rename accordingly

3bacdad

This changes also removes a value-shadowing error that caused empty index files to be written.

scudette reviewed Apr 15, 2025

View reviewed changes

scudette added 3 commits April 18, 2025 00:36

Implement a journal based housekeeping thread for launcher.

044ed7c

Manage contention

a5cfcb6

Move removeClientFlowsFromIndex to the flowIndexBuilder to mediates contention with buildFlowIndexFromDatastore

Move WriteFlowIndex into the flowIndexBuilder

b3bc2ed

This shares the lock with the other index building functions.

scudette merged commit 6c27e7e into Velocidex:master Apr 22, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid I/O after deleting client flows #4179

Avoid I/O after deleting client flows #4179

Uh oh!

hillu commented Apr 15, 2025

Uh oh!

scudette Apr 15, 2025

Uh oh!

hillu Apr 15, 2025

Uh oh!

scudette Apr 17, 2025

Uh oh!

hillu Apr 17, 2025

Uh oh!

scudette Apr 17, 2025

Uh oh!

hillu Apr 17, 2025

Uh oh!

hillu Apr 17, 2025

Uh oh!

scudette Apr 17, 2025

Uh oh!

hillu commented Apr 22, 2025

Uh oh!

scudette commented Apr 22, 2025

Uh oh!

Uh oh!

Uh oh!

Avoid I/O after deleting client flows #4179

Avoid I/O after deleting client flows #4179

Uh oh!

Conversation

hillu commented Apr 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hillu commented Apr 22, 2025

Uh oh!

scudette commented Apr 22, 2025

Uh oh!

Uh oh!

Uh oh!