Skip to content

Conversation

@EnriqueL8
Copy link
Contributor

@EnriqueL8 EnriqueL8 commented Feb 10, 2025

Proposed changes

Add a new function in the Operation Handler called BulkOperationUpdates that takes in a series of OperationUpdate and will notify the submitter when the TX transaction has been committed to the database.

This is important for blockchain plugins that want to preserve ordering of receipts and updating the operation before acknowledging to the blockchain connector that FireFly has accepted those receipts.

Reasons why I went this way:

  • The idempotency key is the Operation ID so correlation is easy in the blockchain plugin between the receipt and the operation
  • Blockchain plugin framework already has the concept of callbacks for updating operations with transaction information
  • These callback go through the operation manager and finally end up in the DB. Today it's a fire and forget as the data gets offloaded to some arbitrary amount of workers to write to the DB.
  • Need a way to communicate to the blockchain plugin when the operation updates has been committed so it can acknowledge a batch in the durable event stream and carry on processing. This function is now synchronous and it's the responsibility of the connector to wait.

Key part of the code to looks at

  • doBatchUpdateWithRetry runs in a retry.Do and in a DB RunAsGroup, it retries forever see

    func (ou *operationUpdater) doBatchUpdateWithRetry(ctx context.Context, updates []*core.OperationUpdate) error {
    return ou.retry.Do(ctx, "operation update", func(attempt int) (retry bool, err error) {
    err = ou.database.RunAsGroup(ctx, func(ctx context.Context) error {
    return ou.doBatchUpdate(ctx, updates)
    })
    if err != nil {
    return true, err
    }
    for _, update := range updates {
    if update.OnComplete != nil {
    update.OnComplete()
    }
    }
    return false, nil
    })
    }

  • Update worker get picked based on operation ID here

    func (ou *operationUpdater) pickWorker(ctx context.Context, id *fftypes.UUID, update *core.OperationUpdate) chan *core.OperationUpdate {

  • Workers call doBatchUpdateWithRetry after timeout or reached a limit to write

    if batch != nil && (timedOut || len(batch.updates) >= ou.conf.maxInserts) {
    batch.timeoutCancel()
    err := ou.doBatchUpdateWithRetry(ctx, batch.updates)
    if err != nil {
    log.L(ctx).Debugf("Operation update worker exiting: %s", err)
    return
    }
    batch = nil
    }
    - this is in memory and causes ordering problems!

Contributes to #1622

Example Blockchain plugin code that has a durable websocket communication:

func (bc *blockchainConnector) eventLoop() {
   // Get some events from the blockchain connector through a weboscket for transactions that have completed: Receipts

   
   // Synchronous
   err := bc.callbacks.BulkOperationUpdates(ctx, namespace, updates, onCommit)
   if err != nil {
       // Retry based on what the error is
   }

  
  // Acknowledging receipts once they have been committed to the DB
  bc.ws.ack()
}

Types of changes

  • Bug fix
  • New feature added
  • Documentation Update

Please make sure to follow these points

  • I have read the contributing guidelines.
  • I have performed a self-review of my own code or work.
  • I have commented my code, particularly in hard-to-understand areas.
  • My changes generates no new warnings.
  • My Pull Request title is in format < issue name > eg Added links in the documentation.
  • I have added tests that prove my fix is effective or that my feature works.
  • My changes have sufficient code coverage (unit, integration, e2e tests).

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
New callback that can be called to insert a number of operation updates reliably

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
@EnriqueL8
Copy link
Contributor Author

Just thought of something different, instead of using a channel to communicate that it's done back to the blockchain connector I think it's better if the blockchain connector gets a callback when it's done. So the connector passes in a callback function


type OperationCallbacks interface {
OperationUpdate(update *OperationUpdate)
BulkOperationUpdates(ctx context.Context, updates []*OperationUpdate, onCommit chan<- error)
Copy link
Contributor

@peterbroadhurst peterbroadhurst Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think passing a channel might be a little sticky. We've gone back and forth on this a few times in other code interfaces across the FireFly ecosystem.

My feeling is a callback function is much easier to pass in, and you then leave the caller the responsibility to decide if they want that callback to use a channel or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! #1637 (comment) I got to the same conclusion after reading more code recently and that is what I'm reworking 💪🏼

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - I think maybe with the discussion below around making the interface synchronous, this might become a moot point

}
log.L(om.ctx).Debugf("%s updating operation %s status=%s%s", update.Plugin, update.NamespacedOpID, update.Status, errString)
}
om.updater.SubmitBulkOperationUpdates(ctx, updates, onCommit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it that absolutely guarantees that this list of receipts in the order they were submitted, will be committed before any other receipts submitted to this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very insightful, I realize that today that is up to the consumer to call this function only once at a time so this function should handle the case and not allow that to happen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 - with the other discussions to make it a synchronous interface, I think you could just handle this as a comment on the API doc:

An insertion ordering guarantee is only provided when this code is called on a single goroutine inside of the connector. It is the responsibility of the connector code to allocate that routine, and ensure that there is only one.

Comment on lines 114 to 117
// Notice how this is not using the workers
// The reason is because we want for all updates to be stored at once in this order
// If offloaded into workers the updates would be processed in parallel, in different DB TX and in a different order
go func() {
Copy link
Contributor

@peterbroadhurst peterbroadhurst Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think this is the fundamental question of design in this PR, but I'm unclear currently the way this solves it.
This go routine just gets fired off - who's responsibility is it, and exactly how, to do ordering against the API in a way that means future batches are extremely strictly ordered against completion of this batch.

Fundamentally - isn't this over complicated? Shouldn't we just make it synchronous and use the goroutine of the caller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about making it synchronous and wanted the connector to be able to wait for a notification back but you are right if we are treating this API as a Persist and wait until done in the connector this should absolutely be synchronous

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think that could be a big simplification overall to this. Push the responsibility to the connector to:

  1. Manage the goroutine that is the receipt writer
  2. Perform suitable performance batching into this code
  3. Handle all errors/retries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that simplifies massively this code to a synchronous bulk write of receipts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a caveat to the above with is doBatchUpdateWithRetry retries indefinitely, I think we need to break this if we are making it synchronous

}
} else {
// Less than the max inserts, just do it in one go
err := ou.doBatchUpdateWithRetry(ctx, updates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make it syncrhonous you do not need retry here - the caller can be informed it is their responsibility to retry


// This retries forever until there is no error
// but returns on cancelled context
if len(updates) > ou.conf.maxInserts {
Copy link
Contributor

@peterbroadhurst peterbroadhurst Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can choose to push this responsibility back to the connector. I don't think you need to try and do sub-pagination inside of this implementation.

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
@EnriqueL8 EnriqueL8 marked this pull request as ready for review February 25, 2025 11:53
@EnriqueL8 EnriqueL8 requested a review from a team as a code owner February 25, 2025 11:53
@codecov
Copy link

codecov bot commented Feb 25, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.95%. Comparing base (d484497) to head (456279a).
Report is 22 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1637   +/-   ##
=======================================
  Coverage   99.95%   99.95%           
=======================================
  Files         337      337           
  Lines       29524    29590   +66     
=======================================
+ Hits        29512    29578   +66     
  Misses          8        8           
  Partials        4        4           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
@EnriqueL8
Copy link
Contributor Author

@peterbroadhurst This is good to re-review. Test coverage should be good as well.

There was a slight doubt on if this function should restrict and validate the Plugin name inside each operation update, I left it up to the connector plugin to put the right one but I might add validation there or make it pass a plugin.Named() and use that for all the operation updates

return err
}
for _, update := range updates {
if update.OnComplete != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this in practice now?

Copy link
Contributor Author

@EnriqueL8 EnriqueL8 Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggled with this one, it's still available in the struct core.OperationUpdate but it doesn't serve any purpose as we are returning synchronously so the top connector go routine can do what it needs when the updates have been committed.

So we can remove but we need to leave it for the existing flow and just comment on this flow that it will not be used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why @EnriqueL8 ?
Because you think there's other stuff on core.OperationUpdate that needs to be on this batch API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the fields in core.OperationUpdate are needed in this batch API so yes we could create a new type that has that subset of fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @peterbroadhurst for the nudge and suggestion - move to OperationUpdateAsync and OperationUpdate

Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
Signed-off-by: Enrique Lacal <enrique.lacal@kaleido.io>
Copy link
Contributor

@peterbroadhurst peterbroadhurst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @EnriqueL8 - looks great, and appreciate you going round the loop with me on this

@EnriqueL8 EnriqueL8 merged commit 8a0d5c4 into main Feb 25, 2025
19 checks passed
@EnriqueL8 EnriqueL8 deleted the reliable_receipts branch February 25, 2025 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants