Operation handling for reliable receipts #1637

EnriqueL8 · 2025-02-10T16:20:23Z

Proposed changes

Add a new function in the Operation Handler called BulkOperationUpdates that takes in a series of OperationUpdate and will notify the submitter when the TX transaction has been committed to the database.

This is important for blockchain plugins that want to preserve ordering of receipts and updating the operation before acknowledging to the blockchain connector that FireFly has accepted those receipts.

Reasons why I went this way:

The idempotency key is the Operation ID so correlation is easy in the blockchain plugin between the receipt and the operation
Blockchain plugin framework already has the concept of callbacks for updating operations with transaction information
These callback go through the operation manager and finally end up in the DB. Today it's a fire and forget as the data gets offloaded to some arbitrary amount of workers to write to the DB.
Need a way to communicate to the blockchain plugin when the operation updates has been committed so it can acknowledge a batch in the durable event stream and carry on processing. This function is now synchronous and it's the responsibility of the connector to wait.

Key part of the code to looks at

doBatchUpdateWithRetry runs in a retry.Do and in a DB RunAsGroup, it retries forever see

firefly/internal/operations/operation_updater.go

Lines 184 to 199 in c7e6058

    
           func (ou *operationUpdater) doBatchUpdateWithRetry(ctx context.Context, updates []*core.OperationUpdate) error { 
        
           	return ou.retry.Do(ctx, "operation update", func(attempt int) (retry bool, err error) { 
        
           		err = ou.database.RunAsGroup(ctx, func(ctx context.Context) error { 
        
           			return ou.doBatchUpdate(ctx, updates) 
        
           		}) 
        
           		if err != nil { 
        
           			return true, err 
        
           		} 
        
           		for _, update := range updates { 
        
           			if update.OnComplete != nil { 
        
           				update.OnComplete() 
        
           			} 
        
           		} 
        
           		return false, nil 
        
           	}) 
        
           }

Update worker get picked based on operation ID here

firefly/internal/operations/operation_updater.go

Line 90 in c7e6058

func (ou *operationUpdater) pickWorker(ctx context.Context, id *fftypes.UUID, update *core.OperationUpdate) chan *core.OperationUpdate {

Workers call doBatchUpdateWithRetry after timeout or reached a limit to write

firefly/internal/operations/operation_updater.go

Lines 172 to 180 in c7e6058

    
           if batch != nil && (timedOut || len(batch.updates) >= ou.conf.maxInserts) { 
        
           	batch.timeoutCancel() 
        
           	err := ou.doBatchUpdateWithRetry(ctx, batch.updates) 
        
           	if err != nil { 
        
           		log.L(ctx).Debugf("Operation update worker exiting: %s", err) 
        
           		return 
        
           	} 
        
           	batch = nil 
        
           }

- this is in memory and causes ordering problems!

Contributes to #1622

Example Blockchain plugin code that has a durable websocket communication:

func (bc *blockchainConnector) eventLoop() {
   // Get some events from the blockchain connector through a weboscket for transactions that have completed: Receipts

   
   // Synchronous
   err := bc.callbacks.BulkOperationUpdates(ctx, namespace, updates, onCommit)
   if err != nil {
       // Retry based on what the error is
   }

  
  // Acknowledging receipts once they have been committed to the DB
  bc.ws.ack()
}

Types of changes

Bug fix
New feature added
Documentation Update

Please make sure to follow these points

I have read the contributing guidelines.
I have performed a self-review of my own code or work.
I have commented my code, particularly in hard-to-understand areas.
My changes generates no new warnings.
My Pull Request title is in format < issue name > eg Added links in the documentation.
I have added tests that prove my fix is effective or that my feature works.
My changes have sufficient code coverage (unit, integration, e2e tests).

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

New callback that can be called to insert a number of operation updates reliably Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

internal/operations/manager.go

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

EnriqueL8 · 2025-02-24T13:23:16Z

Just thought of something different, instead of using a channel to communicate that it's done back to the blockchain connector I think it's better if the blockchain connector gets a callback when it's done. So the connector passes in a callback function

peterbroadhurst · 2025-02-24T13:46:36Z

pkg/core/operation.go


 type OperationCallbacks interface {
 	OperationUpdate(update *OperationUpdate)
+	BulkOperationUpdates(ctx context.Context, updates []*OperationUpdate, onCommit chan<- error)


I think passing a channel might be a little sticky. We've gone back and forth on this a few times in other code interfaces across the FireFly ecosystem.

My feeling is a callback function is much easier to pass in, and you then leave the caller the responsibility to decide if they want that callback to use a channel or not.

Yes! #1637 (comment) I got to the same conclusion after reading more code recently and that is what I'm reworking 💪🏼

Cool - I think maybe with the discussion below around making the interface synchronous, this might become a moot point

peterbroadhurst · 2025-02-24T13:48:57Z

internal/operations/manager.go

+		}
+		log.L(om.ctx).Debugf("%s updating operation %s status=%s%s", update.Plugin, update.NamespacedOpID, update.Status, errString)
+	}
+	om.updater.SubmitBulkOperationUpdates(ctx, updates, onCommit)


What is it that absolutely guarantees that this list of receipts in the order they were submitted, will be committed before any other receipts submitted to this function?

Thanks very insightful, I realize that today that is up to the consumer to call this function only once at a time so this function should handle the case and not allow that to happen

👍 - with the other discussions to make it a synchronous interface, I think you could just handle this as a comment on the API doc:

An insertion ordering guarantee is only provided when this code is called on a single goroutine inside of the connector. It is the responsibility of the connector code to allocate that routine, and ensure that there is only one.

peterbroadhurst · 2025-02-24T13:52:51Z

internal/operations/operation_updater.go

+	// Notice how this is not using the workers
+	// The reason is because we want for all updates to be stored at once in this order
+	// If offloaded into workers the updates would be processed in parallel, in different DB TX and in a different order
+	go func() {


I do think this is the fundamental question of design in this PR, but I'm unclear currently the way this solves it.
This go routine just gets fired off - who's responsibility is it, and exactly how, to do ordering against the API in a way that means future batches are extremely strictly ordered against completion of this batch.

Fundamentally - isn't this over complicated? Shouldn't we just make it synchronous and use the goroutine of the caller?

I thought about making it synchronous and wanted the connector to be able to wait for a notification back but you are right if we are treating this API as a Persist and wait until done in the connector this should absolutely be synchronous

Think that could be a big simplification overall to this. Push the responsibility to the connector to:

Manage the goroutine that is the receipt writer

Perform suitable performance batching into this code

Handle all errors/retries

Yeah, that simplifies massively this code to a synchronous bulk write of receipts

There is a caveat to the above with is doBatchUpdateWithRetry retries indefinitely, I think we need to break this if we are making it synchronous

peterbroadhurst · 2025-02-24T13:54:10Z

internal/operations/operation_updater.go

+			}
+		} else {
+			// Less than the max inserts, just do it in one go
+			err := ou.doBatchUpdateWithRetry(ctx, updates)


If you make it syncrhonous you do not need retry here - the caller can be informed it is their responsibility to retry

peterbroadhurst · 2025-02-24T13:55:02Z

internal/operations/operation_updater.go

+
+		// This retries forever until there is no error
+		// but returns on cancelled context
+		if len(updates) > ou.conf.maxInserts {


You can choose to push this responsibility back to the connector. I don't think you need to try and do sub-pagination inside of this implementation.

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

codecov · 2025-02-25T11:57:06Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.95%. Comparing base (d484497) to head (456279a).
Report is 22 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1637   +/-   ##
=======================================
  Coverage   99.95%   99.95%           
=======================================
  Files         337      337           
  Lines       29524    29590   +66     
=======================================
+ Hits        29512    29578   +66     
  Misses          8        8           
  Partials        4        4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

EnriqueL8 · 2025-02-25T12:03:24Z

@peterbroadhurst This is good to re-review. Test coverage should be good as well.

There was a slight doubt on if this function should restrict and validate the Plugin name inside each operation update, I left it up to the connector plugin to put the right one but I might add validation there or make it pass a plugin.Named() and use that for all the operation updates

internal/operations/operation_updater.go

peterbroadhurst · 2025-02-25T12:59:47Z

internal/operations/operation_updater.go

+		return err
+	}
+	for _, update := range updates {
+		if update.OnComplete != nil {


What's the purpose of this in practice now?

I struggled with this one, it's still available in the struct core.OperationUpdate but it doesn't serve any purpose as we are returning synchronously so the top connector go routine can do what it needs when the updates have been committed.

So we can remove but we need to leave it for the existing flow and just comment on this flow that it will not be used

Why @EnriqueL8 ?
Because you think there's other stuff on core.OperationUpdate that needs to be on this batch API?

The rest of the fields in core.OperationUpdate are needed in this batch API so yes we could create a new type that has that subset of fields

Thanks @peterbroadhurst for the nudge and suggestion - move to OperationUpdateAsync and OperationUpdate

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

peterbroadhurst

Thanks @EnriqueL8 - looks great, and appreciate you going round the loop with me on this

EnriqueL8 added 5 commits February 4, 2025 13:53

fix: various CVEs

e056a35

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

feature: bulk operation update callback

0585016

New callback that can be called to insert a number of operation updates reliably Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

Add Blockchain callback implementation

3678677

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

Rework the implementation to not use workers

22f7a50

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

Update comments

77d478f

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

EnriqueL8 commented Feb 11, 2025

View reviewed changes

internal/operations/manager.go Outdated Show resolved Hide resolved

EnriqueL8 added 4 commits February 11, 2025 15:11

Update to return an error in the channel

387af52

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

Update comment for limit of batch

b29c640

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

Merge branch 'main' into reliable_receipts

91255ab

Add splitting batch inserts based on limit

d75e801

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

peterbroadhurst reviewed Feb 24, 2025

View reviewed changes

EnriqueL8 added 2 commits February 24, 2025 16:59

Make it sync

9c53fff

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

Make function use no retry

283dce4

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

EnriqueL8 force-pushed the reliable_receipts branch from 5acdf0e to 283dce4 Compare February 25, 2025 10:24

EnriqueL8 added 4 commits February 25, 2025 10:24

Merge branch 'main' into reliable_receipts

9ebfc42

Add more tests

b16ed26

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

Update copyright

7fb6b08

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

More tests update

e8fbe13

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

EnriqueL8 marked this pull request as ready for review February 25, 2025 11:53

EnriqueL8 requested a review from a team as a code owner February 25, 2025 11:53

Fixing more test coverage

7ac8f7e

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

peterbroadhurst reviewed Feb 25, 2025

View reviewed changes

internal/operations/operation_updater.go Show resolved Hide resolved

peterbroadhurst reviewed Feb 25, 2025

View reviewed changes

internal/operations/operation_updater.go Outdated Show resolved Hide resolved

peterbroadhurst reviewed Feb 25, 2025

View reviewed changes

EnriqueL8 added 5 commits February 25, 2025 13:45

Review comments

098a188

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

Add onComplete comment

5839d3c

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

Update copyright

b5c4555

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

Split structs into OperationUpdate and OperationUpdateAsync

40fe22e

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

Missed copyright year update

456279a

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>

peterbroadhurst approved these changes Feb 25, 2025

View reviewed changes

EnriqueL8 merged commit 8a0d5c4 into main Feb 25, 2025
19 checks passed

EnriqueL8 deleted the reliable_receipts branch February 25, 2025 15:57

annamcallister mentioned this pull request Feb 27, 2025

Reliable receipts kaleido-io/firefly#123

Closed

	func (ou operationUpdater) doBatchUpdateWithRetry(ctx context.Context, updates []core.OperationUpdate) error {
	return ou.retry.Do(ctx, "operation update", func(attempt int) (retry bool, err error) {
	err = ou.database.RunAsGroup(ctx, func(ctx context.Context) error {
	return ou.doBatchUpdate(ctx, updates)
	})
	if err != nil {
	return true, err
	}
	for _, update := range updates {
	if update.OnComplete != nil {
	update.OnComplete()
	}
	}
	return false, nil
	})
	}

	if batch != nil && (timedOut \|\| len(batch.updates) >= ou.conf.maxInserts) {
	batch.timeoutCancel()
	err := ou.doBatchUpdateWithRetry(ctx, batch.updates)
	if err != nil {
	log.L(ctx).Debugf("Operation update worker exiting: %s", err)
	return
	}
	batch = nil
	}

Operation handling for reliable receipts #1637

Operation handling for reliable receipts #1637

Uh oh!

Conversation

EnriqueL8 commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Reasons why I went this way:

Key part of the code to looks at

Types of changes

Please make sure to follow these points

Uh oh!

Uh oh!

EnriqueL8 commented Feb 24, 2025

Uh oh!

peterbroadhurst Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterbroadhurst Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterbroadhurst Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

EnriqueL8 commented Feb 25, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EnriqueL8 Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterbroadhurst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EnriqueL8 commented Feb 10, 2025 •

edited

Loading

peterbroadhurst Feb 24, 2025 •

edited

Loading

peterbroadhurst Feb 24, 2025 •

edited

Loading

peterbroadhurst Feb 24, 2025 •

edited

Loading

codecov bot commented Feb 25, 2025 •

edited

Loading

EnriqueL8 Feb 25, 2025 •

edited

Loading