Skip to content

CRE Operational Events in Engine #17057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Apr 4, 2025
Merged

Conversation

patrickhuie19
Copy link
Contributor

@patrickhuie19 patrickhuie19 commented Apr 1, 2025

CRE-375

Now that our workflow events have matured, we want to migrate from using BaseMessage with runtime defined KV pairs, to pre-defined KVs in events.proto.

This uses Beholder as CHiP ingress will be included with Beholder when first shipped (Dual Source)

Future Considerations

  • I kept the domain platform instead of migrating to cre - we will migrate later on.

@patrickhuie19 patrickhuie19 changed the title WIP: first pass on proto definitions WIP: CRE Operational Events in Engine Apr 1, 2025
@patrickhuie19 patrickhuie19 force-pushed the CRE-375/operational-events branch from d69c50d to 0afe074 Compare April 2, 2025 17:03
@patrickhuie19 patrickhuie19 changed the title WIP: CRE Operational Events in Engine CRE Operational Events in Engine Apr 3, 2025
platform.KeyDonID, strconv.Itoa(int(nodeState.WorkflowDON.ID)),
platform.KeyDonF, strconv.Itoa(int(nodeState.WorkflowDON.F)),
platform.KeyDonN, strconv.Itoa(len(nodeState.WorkflowDON.Members)),
platform.KeyDonQ, strconv.Itoa(aggregation.ByzantineQuorum(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: today we use F + 1. Updating that to ByzantineQuroum here: #17109

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends what kinds of quorum you mean. We are using ByzQ for consensus/OCR and F+1 for aggregating remote capability responses.

@patrickhuie19 patrickhuie19 requested review from Atrax1 and bolekk April 4, 2025 03:05
Comment on lines 13 to 15
int32 donF = 6;
int32 donN = 7;
int32 donQ = 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry just for my knowledge what is those fields used for ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F is the manually selected maximum number of faulty/dishonest nodes in the DON. N is the manually selected number of nodes in the DON (For prod, N >= 3F + 1). Q is the quorum size we've calculated in the engine, which is the number of identical requests/responses in the trigger and Don2Don layer we need before considering a trigger or capability request/response valid.

@pkcll pkcll self-requested a review April 4, 2025 15:20
@patrickhuie19 patrickhuie19 marked this pull request as ready for review April 4, 2025 17:44
@patrickhuie19 patrickhuie19 requested review from a team as code owners April 4, 2025 17:44
}

return beholder.GetEmitter().Emit(ctx, b,
"beholder_data_schema", schema, // required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these magical values? if there is a typo like schema -> scheam does all sort of stuff downstream break?

does the behold package define the values and import and use them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these magical values? if there is a typo like schema -> scheam does all sort of stuff downstream break?

Yes, they are

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, what's the reason for copying and pasting rather than referencing them in a beholder defined api?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values are magical, in that consumers will break, but they aren't stable to the point where I think any of us thought about putting them behind an API. It's a good idea, and I can add a ticket to do that, but I'd propose we not block on that here.

schema = "/cre-events-workflow-started/v1"
entity = "WorkflowExecutionStarted"
case *pb.WorkflowExecutionFinished:
schema = "/cre-events-workflow-finished/v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the string val here doesn't match the OperationalEventsSchema var val. why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After I decomposed the events proto so that beholder could process it, I didn't update the events.go helper. I'll update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


return beholder.GetEmitter().Emit(ctx, b,
"beholder_data_schema", schema, // required
"beholder_domain", "platform", // required
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be cre domain ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep platform for now and then migrate everything later


option go_package = "github.com/smartcontractkit/chainlink/core/services/workflows/pb/";

package pb;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to follow this pattern for naming proto packages {domain}.{version} e.g cre.v1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its verbose at the go pkg level and in the naming schema in Atlas. I understand this means that all events in the platform (eventually cre) domain have to be unique, and I'm happy with that tradeoff.

switch msg.(type) {
case *pb.WorkflowExecutionStarted:
schema = "/cre-events-workflow-started/v1"
entity = "WorkflowExecutionStarted"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entity should be {pb_package_name}.{message_mane}: pb.WorkflowExecutionStarted
Example

"beholder_domain", "svr",
"beholder_entity", "svr.v1.TxMessage",
"beholder_data_schema", "/beholder-tx-message/versions/2",

Copy link
Collaborator

@pkcll pkcll Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to give a more meaningful name to the proto package e.g
platform or cre

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated so that entity is prefixed by the proto pkg name: 7d49206

pkcll
pkcll previously approved these changes Apr 4, 2025
vyzaldysanchez
vyzaldysanchez previously approved these changes Apr 4, 2025
@@ -676,6 +685,10 @@ func (e *Engine) finishExecution(ctx context.Context, cma custmsg.MessageEmitter
}
logCustMsg(ctx, cma, fmt.Sprintf("execution duration: %d (seconds)", executionDuration), l)
l.Infof("execution duration: %d (seconds)", executionDuration)
err = emitExecutionFinishedEvent(ctx, cma, status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b question, how is this emit different than logCustMsg in L677? they both go to beholder?

Per emitProtoMessage marshals a proto.Message and emits it via beholder. this emits to beholder, and cma.Emit(ctx, msg) in logCustMsg seems to beholder as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They both go through the OTEL pipeline. Eventually once CHiP client integration into Beholder is complete, they will both go through the OTEL and CHiP ingresses.

The difference comes from how the protos are handled. The BaseMessage field has a map[string]string of arbitrary KVs pairs supplied at runtime, so the logCustMsg worked with a struct that would collect those labels (similar to a logger), and then be able to set those on the proto. And, since each custom message emitted used one proto, we could have a typed MessageEmitter interface to handle it.

Here, the approach is from a different direction - we have multiple typed protos that we want to be able to emit a set of known labels for in a repeatable way.

If we wanted to expand that interface to take all protos, we would have to update it to take an any type, which seemed messy to me. @EasterTheBunny gave this a try here: smartcontractkit/chainlink-common#1075

@@ -997,13 +1010,32 @@ func (e *Engine) executeStep(ctx context.Context, lggr logger.Logger, msg stepRe
defer cancel()

e.metrics.with(platform.KeyCapabilityID, curStep.ID).incrementCapabilityInvocationCounter(ctx)
output, err := curStep.capability.Execute(stepCtx, tr)
err = emitCapabilityStartedEvent(ctx, e.cma, curStep.ID, msg.stepRef)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is helpful to emit the payload to the cap, inputsMap? Maybe some concern on this so we are not logging?

Copy link
Contributor Author

@patrickhuie19 patrickhuie19 Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do it with the values.Map type now, but I'm not sure that approach is future proof for the no-dag SDK. How would a front end parse that data? Each capability response will be typed, so to allow a front end to parse that, we would have to register a new proto to the data platform. The inputs/outputs within a capability are also sensitive to data governance issues that high level metadata like status are not.

@patrickhuie19 patrickhuie19 dismissed stale reviews from vyzaldysanchez and pkcll via 7d49206 April 4, 2025 20:08
vyzaldysanchez
vyzaldysanchez previously approved these changes Apr 4, 2025
Copy link
Contributor

github-actions bot commented Apr 4, 2025

Flakeguard Summary

Ran new or updated tests between develop and 7d49206 (CRE-375/operational-events).

View Flaky Detector Details | Compare Changes

Found Flaky Tests ❌

1 Results
Name Pass Ratio Panicked? Timed Out? Race? Runs Successes Failures Skips Package Package Panicked? Avg Duration Code Owners
TestEngine_ConcurrentExecutions 0% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/workflows false 20ms @smartcontractkit/keystone

Artifacts

For detailed logs of the failed tests, please refer to the artifact failed-test-results-with-logs.json.

@@ -1550,6 +1550,9 @@ func emitProtoMessage(ctx context.Context, msg proto.Message) error {
return fmt.Errorf("unknown message type: %T", msg)
}

// entity must be prefixed with the proto package name
entity = fmt.Sprintf("%s.%s", EventsProtoPkg, entity)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@cl-sonarqube-production
Copy link

@vyzaldysanchez vyzaldysanchez enabled auto-merge April 4, 2025 21:25
@vyzaldysanchez vyzaldysanchez disabled auto-merge April 4, 2025 21:25
@patrickhuie19 patrickhuie19 added this pull request to the merge queue Apr 4, 2025
Merged via the queue into develop with commit d489f42 Apr 4, 2025
191 of 192 checks passed
@patrickhuie19 patrickhuie19 deleted the CRE-375/operational-events branch April 4, 2025 21:49

if donIDStr, ok := kvs[platform.KeyDonID]; ok {
if id, err := strconv.ParseInt(donIDStr, 10, 32); err == nil {
m.DonF = int32(id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m.DonID = int32(id)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1355,7 +1387,22 @@ func NewEngine(ctx context.Context, cfg Config) (engine *Engine, err error) {
return nil, fmt.Errorf("could not initialize monitoring resources: %w", err)
}

cma := custmsg.NewLabeler().With(platform.KeyWorkflowID, cfg.WorkflowID, platform.KeyWorkflowOwner, cfg.WorkflowOwner, platform.KeyWorkflowName, cfg.WorkflowName.String())
nodeState, err := cfg.Registry.LocalNode(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed this line. In init() we retrieve this value with retries in case the Registry is not ready. I don't think we have a guarantee that it will be ready here... cc @cedric-cordenier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants