CRE Operational Events in Engine #17057

patrickhuie19 · 2025-04-01T02:09:20Z

CRE-375

Now that our workflow events have matured, we want to migrate from using BaseMessage with runtime defined KV pairs, to pre-defined KVs in events.proto.

This uses Beholder as CHiP ingress will be included with Beholder when first shipped (Dual Source)

Future Considerations

I kept the domain platform instead of migrating to cre - we will migrate later on.

…approach

patrickhuie19 · 2025-04-04T02:58:55Z

core/services/workflows/engine.go

+		platform.KeyDonID, strconv.Itoa(int(nodeState.WorkflowDON.ID)),
+		platform.KeyDonF, strconv.Itoa(int(nodeState.WorkflowDON.F)),
+		platform.KeyDonN, strconv.Itoa(len(nodeState.WorkflowDON.Members)),
+		platform.KeyDonQ, strconv.Itoa(aggregation.ByzantineQuorum(


Note: today we use F + 1. Updating that to ByzantineQuroum here: #17109

Depends what kinds of quorum you mean. We are using ByzQ for consensus/OCR and F+1 for aggregating remote capability responses.

core/services/workflows/engine.go

Atrax1 · 2025-04-04T14:22:08Z

core/services/workflows/pb/events.proto

+  int32 donF = 6;
+  int32 donN = 7;
+  int32 donQ = 8;


Sorry just for my knowledge what is those fields used for ?

F is the manually selected maximum number of faulty/dishonest nodes in the DON. N is the manually selected number of nodes in the DON (For prod, N >= 3F + 1). Q is the quorum size we've calculated in the engine, which is the number of identical requests/responses in the trigger and Don2Don layer we need before considering a trigger or capability request/response valid.

krehermann · 2025-04-04T18:14:29Z

core/services/workflows/engine.go

+	}
+
+	return beholder.GetEmitter().Emit(ctx, b,
+		"beholder_data_schema", schema, // required


are these magical values? if there is a typo like schema -> scheam does all sort of stuff downstream break?

does the behold package define the values and import and use them?

are these magical values? if there is a typo like schema -> scheam does all sort of stuff downstream break?

Yes, they are

ok, what's the reason for copying and pasting rather than referencing them in a beholder defined api?

The values are magical, in that consumers will break, but they aren't stable to the point where I think any of us thought about putting them behind an API. It's a good idea, and I can add a ticket to do that, but I'd propose we not block on that here.

krehermann · 2025-04-04T18:15:32Z

core/services/workflows/engine.go

+		schema = "/cre-events-workflow-started/v1"
+		entity = "WorkflowExecutionStarted"
+	case *pb.WorkflowExecutionFinished:
+		schema = "/cre-events-workflow-finished/v1"


the string val here doesn't match the OperationalEventsSchema var val. why?

After I decomposed the events proto so that beholder could process it, I didn't update the events.go helper. I'll update

pkcll · 2025-04-04T18:38:18Z

core/services/workflows/engine.go

+
+	return beholder.GetEmitter().Emit(ctx, b,
+		"beholder_data_schema", schema, // required
+		"beholder_domain", "platform", // required


should it be cre domain ?

Let's keep platform for now and then migrate everything later

pkcll · 2025-04-04T18:41:53Z

core/services/workflows/pb/events-capability-finished.proto

+
+option go_package = "github.com/smartcontractkit/chainlink/core/services/workflows/pb/";
+
+package pb;


Maybe it makes sense to follow this pattern for naming proto packages {domain}.{version} e.g cre.v1

I think its verbose at the go pkg level and in the naming schema in Atlas. I understand this means that all events in the platform (eventually cre) domain have to be unique, and I'm happy with that tradeoff.

pkcll · 2025-04-04T18:44:31Z

core/services/workflows/engine.go

+	switch msg.(type) {
+	case *pb.WorkflowExecutionStarted:
+		schema = "/cre-events-workflow-started/v1"
+		entity = "WorkflowExecutionStarted"


Entity should be {pb_package_name}.{message_mane}: pb.WorkflowExecutionStarted
Example

chainlink/core/chains/evm/txm/metrics.go

Lines 130 to 132 in 83dc5cd

"beholder_domain", "svr",

"beholder_entity", "svr.v1.TxMessage",

"beholder_data_schema", "/beholder-tx-message/versions/2",

Maybe it makes sense to give a more meaningful name to the proto package e.g
platform or cre

@patrickhuie19 https://github.com/smartcontractkit/atlas/blob/30c75a453575a17e7cc54b7fbc57e3987c801bea/beholder/README.md?plain=1#L107-L109

Updated so that entity is prefixed by the proto pkg name: 7d49206

shileiwill · 2025-04-04T19:25:50Z

core/services/workflows/engine.go

@@ -676,6 +685,10 @@ func (e *Engine) finishExecution(ctx context.Context, cma custmsg.MessageEmitter
 	}
 	logCustMsg(ctx, cma, fmt.Sprintf("execution duration: %d (seconds)", executionDuration), l)
 	l.Infof("execution duration: %d (seconds)", executionDuration)
+	err = emitExecutionFinishedEvent(ctx, cma, status)


n00b question, how is this emit different than logCustMsg in L677? they both go to beholder?

Per emitProtoMessage marshals a proto.Message and emits it via beholder. this emits to beholder, and cma.Emit(ctx, msg) in logCustMsg seems to beholder as well.

They both go through the OTEL pipeline. Eventually once CHiP client integration into Beholder is complete, they will both go through the OTEL and CHiP ingresses.

The difference comes from how the protos are handled. The BaseMessage field has a map[string]string of arbitrary KVs pairs supplied at runtime, so the logCustMsg worked with a struct that would collect those labels (similar to a logger), and then be able to set those on the proto. And, since each custom message emitted used one proto, we could have a typed MessageEmitter interface to handle it.

Here, the approach is from a different direction - we have multiple typed protos that we want to be able to emit a set of known labels for in a repeatable way.

If we wanted to expand that interface to take all protos, we would have to update it to take an any type, which seemed messy to me. @EasterTheBunny gave this a try here: smartcontractkit/chainlink-common#1075

shileiwill · 2025-04-04T19:50:15Z

core/services/workflows/engine.go

@@ -997,13 +1010,32 @@ func (e *Engine) executeStep(ctx context.Context, lggr logger.Logger, msg stepRe
 	defer cancel()

 	e.metrics.with(platform.KeyCapabilityID, curStep.ID).incrementCapabilityInvocationCounter(ctx)
-	output, err := curStep.capability.Execute(stepCtx, tr)
+	err = emitCapabilityStartedEvent(ctx, e.cma, curStep.ID, msg.stepRef)


I think it is helpful to emit the payload to the cap, inputsMap? Maybe some concern on this so we are not logging?

We could do it with the values.Map type now, but I'm not sure that approach is future proof for the no-dag SDK. How would a front end parse that data? Each capability response will be typed, so to allow a front end to parse that, we would have to register a new proto to the data platform. The inputs/outputs within a capability are also sensitive to data governance issues that high level metadata like status are not.

core/services/workflows/engine.go

github-actions · 2025-04-04T20:22:36Z

Flakeguard Summary

Ran new or updated tests between develop and 7d49206 (CRE-375/operational-events).

View Flaky Detector Details | Compare Changes

Found Flaky Tests ❌

1 Results

Name	Pass Ratio	Panicked?	Timed Out?	Race?	Runs	Successes	Failures	Skips	Package	Package Panicked?	Avg Duration	Code Owners
TestEngine_ConcurrentExecutions	0%	false	false	false	3	0	3	0	github.com/smartcontractkit/chainlink/v2/core/services/workflows	false	20ms	@smartcontractkit/keystone

Artifacts

For detailed logs of the failed tests, please refer to the artifact failed-test-results-with-logs.json.

pkcll · 2025-04-04T20:22:50Z

core/services/workflows/engine.go

@@ -1550,6 +1550,9 @@ func emitProtoMessage(ctx context.Context, msg proto.Message) error {
 		return fmt.Errorf("unknown message type: %T", msg)
 	}

+	// entity must be prefixed with the proto package name
+	entity = fmt.Sprintf("%s.%s", EventsProtoPkg, entity)


cl-sonarqube-production · 2025-04-04T20:36:05Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

shileiwill · 2025-04-04T22:41:11Z

core/services/workflows/engine.go

+
+	if donIDStr, ok := kvs[platform.KeyDonID]; ok {
+		if id, err := strconv.ParseInt(donIDStr, 10, 32); err == nil {
+			m.DonF = int32(id)


m.DonID = int32(id)

bolekk · 2025-04-22T20:20:50Z

core/services/workflows/engine.go

@@ -1355,7 +1387,22 @@ func NewEngine(ctx context.Context, cfg Config) (engine *Engine, err error) {
 		return nil, fmt.Errorf("could not initialize monitoring resources: %w", err)
 	}

-	cma := custmsg.NewLabeler().With(platform.KeyWorkflowID, cfg.WorkflowID, platform.KeyWorkflowOwner, cfg.WorkflowOwner, platform.KeyWorkflowName, cfg.WorkflowName.String())
+	nodeState, err := cfg.Registry.LocalNode(ctx)


Just noticed this line. In init() we retrieve this value with retries in case the Registry is not ready. I don't think we have a guarantee that it will be ready here... cc @cedric-cordenier

patrickhuie19 added 2 commits March 31, 2025 22:08

WIP: first pass on proto definitions

5f8cc62

2nd pass of proto definitions

0087299

patrickhuie19 changed the title ~~WIP: first pass on proto definitions~~ WIP: CRE Operational Events in Engine Apr 1, 2025

patrickhuie19 added 2 commits April 2, 2025 12:35

adding stepRef to capability events

28748c7

adding generate command, and events2.proto for comparison with oneof …

0afe074

…approach

patrickhuie19 force-pushed the CRE-375/operational-events branch from d69c50d to 0afe074 Compare April 2, 2025 17:03

adding operational event coverage to engine

f80e72a

patrickhuie19 changed the title ~~WIP: CRE Operational Events in Engine~~ CRE Operational Events in Engine Apr 3, 2025

patrickhuie19 added 3 commits April 3, 2025 19:54

adding labels to cma

77651fc

Merge branch 'develop' into CRE-375/operational-events

c177f8e

cleaning up comments

79786d9

patrickhuie19 commented Apr 4, 2025

View reviewed changes

oops; adding generated proto file

2ad5421

patrickhuie19 requested review from Atrax1 and bolekk April 4, 2025 03:05

patrickhuie19 and others added 3 commits April 3, 2025 23:14

lint

da1022a

lint; fixing err

224f090

Run make generate.

d70fef9

krehermann reviewed Apr 4, 2025

View reviewed changes

core/services/workflows/engine.go Outdated Show resolved Hide resolved

krehermann reviewed Apr 4, 2025

View reviewed changes

core/services/workflows/engine.go Outdated Show resolved Hide resolved

krehermann reviewed Apr 4, 2025

View reviewed changes

core/services/workflows/engine.go Outdated Show resolved Hide resolved

Atrax1 reviewed Apr 4, 2025

View reviewed changes

addressing comments

dc573d5

pkcll self-requested a review April 4, 2025 15:20

patrickhuie19 added 2 commits April 4, 2025 12:57

separated out protos

1ff5ca0

determine entity + schema based on proto message type

9ef95b0

patrickhuie19 marked this pull request as ready for review April 4, 2025 17:44

patrickhuie19 requested review from a team as code owners April 4, 2025 17:44

krehermann reviewed Apr 4, 2025

View reviewed changes

using events.go vars

8c85b70

pkcll reviewed Apr 4, 2025

View reviewed changes

pkcll previously approved these changes Apr 4, 2025

View reviewed changes

vyzaldysanchez previously approved these changes Apr 4, 2025

View reviewed changes

shileiwill reviewed Apr 4, 2025

View reviewed changes

updating entity name to include proto pkg prefix

7d49206

patrickhuie19 dismissed stale reviews from vyzaldysanchez and pkcll via 7d49206 April 4, 2025 20:08

vyzaldysanchez previously approved these changes Apr 4, 2025

View reviewed changes

pkcll reviewed Apr 4, 2025

View reviewed changes

fixing test

d0318c4

patrickhuie19 dismissed vyzaldysanchez’s stale review via d0318c4 April 4, 2025 20:23

pkcll approved these changes Apr 4, 2025

View reviewed changes

vyzaldysanchez enabled auto-merge April 4, 2025 21:25

vyzaldysanchez disabled auto-merge April 4, 2025 21:25

vyzaldysanchez approved these changes Apr 4, 2025

View reviewed changes

patrickhuie19 added this pull request to the merge queue Apr 4, 2025

Merged via the queue into develop with commit d489f42 Apr 4, 2025
191 of 192 checks passed

patrickhuie19 deleted the CRE-375/operational-events branch April 4, 2025 21:49

shileiwill reviewed Apr 4, 2025

View reviewed changes

nanchano mentioned this pull request Apr 8, 2025

Platform Schemas deployment #17192

Draft

bolekk reviewed Apr 22, 2025

View reviewed changes


		option go_package = "github.com/smartcontractkit/chainlink/core/services/workflows/pb/";

		package pb;

	"beholder_domain", "svr",
	"beholder_entity", "svr.v1.TxMessage",
	"beholder_data_schema", "/beholder-tx-message/versions/2",

CRE Operational Events in Engine #17057

CRE Operational Events in Engine #17057

Uh oh!

Conversation

patrickhuie19 commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Future Considerations

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pkcll Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickhuie19 Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Apr 4, 2025

Flakeguard Summary

Found Flaky Tests ❌

Artifacts

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cl-sonarqube-production bot commented Apr 4, 2025

Quality Gate passed

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

patrickhuie19 commented Apr 1, 2025 •

edited

Loading

pkcll Apr 4, 2025 •

edited

Loading

patrickhuie19 Apr 4, 2025 •

edited

Loading