Skip to content

[CRE-406] (fix): WF Registry Syncer update #17608

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 8, 2025
Merged

[CRE-406] (fix): WF Registry Syncer update #17608

merged 6 commits into from
May 8, 2025

Conversation

justinkaseman
Copy link
Contributor

@justinkaseman justinkaseman commented May 7, 2025

Description

Fixes an issue with WF Registry Syncer updates: during updating a workflow, the old engine could never be shut down.

This happens because the logic was changed to no longer first clean up the previous engine before upserting the new DB record. During an engine shutting down, the config for the engine is retrieved. Because the DB entry is modified to the new workflow before shutting down the old one, the old engine's config couldn't be found anymore, leaving it stuck.

This fix restores removing the old engine before upserting the new DB record.

IMPACT:
This means that there will be a small downtime toward updating workflows (which is the previous behavior before the refactor). Zero downtime will be handled in another feature.

Also covers:

CAPPL-804
This issue didn't show in unit testing. Adds integration testing to protect against further regressions. The integration test now includes updating the workflow.

CAPPL-811
Moves validations to after the DB entry is saved. Currently, failure of these validations would lead to us repeatedly refetching the secrets, even though these errors would normally be unrecoverable.

@justinkaseman justinkaseman requested review from a team as code owners May 7, 2025 02:58
@@ -444,12 +445,6 @@ func (h *eventHandler) createWorkflowSpec(ctx context.Context, payload WorkflowR
return nil, err
}

// Calculate the hash of the binary and config files
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to tryEngineCreate (line 702)

@@ -459,11 +454,6 @@ func (h *eventHandler) createWorkflowSpec(ctx context.Context, payload WorkflowR
}
}

// Pre-check: verify that the workflowID matches; if it doesn't abort and log an error via Beholder.
if !types.WorkflowID(hash).Equal(payload.WorkflowID) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to tryEngineCreate (line 710)

@@ -234,13 +234,26 @@ func (h *Store) GetSecrets(ctx context.Context, secretsURL string, workflowID [3
return nil, fmt.Errorf("failed to fetch secrets from %s : %w", secretsURL, fetchErr)
}

// sanity check by decoding the secrets
Copy link
Contributor Author

@justinkaseman justinkaseman May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved check to separate method below ValidateSecrets. This is called in handler.go func tryEngineCreate (line 715)

Copy link
Contributor

github-actions bot commented May 7, 2025

Flakeguard Summary

Ran new or updated tests between develop and bea6d2544bcdffd25c267bf06d653765fd56d1c9 (CRE-406).

View Flaky Detector Details | Compare Changes

Found Flaky Tests ❌

1 Results
Name Pass Ratio Panicked? Timed Out? Race? Runs Successes Failures Skips Package Package Panicked? Avg Duration Code Owners
Test_workflowRegisteredHandler/same_wf_ID,_different_status 0% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/workflows/syncer false 1.223333333s @smartcontractkit/keystone

Artifacts

For detailed logs of the failed tests, please refer to the artifact failed-test-results-with-logs.json.

Copy link
Contributor

@shileiwill shileiwill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handler code LGTM!

if err := h.tryEngineCleanup(payload.WorkflowOwner, payload.WorkflowName); err != nil {
return err
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess this is the fix to the bug. we were not cleaning up first, but updating the workflow store, hence cleaning up the old engine instance later is impossible.

shileiwill
shileiwill previously approved these changes May 7, 2025
Copy link
Contributor

@shileiwill shileiwill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@justinkaseman justinkaseman requested review from MStreet3 and ettec May 7, 2025 19:03
@cl-sonarqube-production
Copy link

@@ -147,12 +169,16 @@ func (n ComputeFetcherFactory) NewFetcher(log commonlogger.Logger, emitter custm
}
}

func setupDons(ctx context.Context, t *testing.T, lggr logger.SugaredLogger, workflowURL string, cronSchedule string,
triggerFactory framework.TriggerFactory) *data_feeds_cache.DataFeedsCache {
func setupDons(ctx context.Context, t *testing.T, lggr logger.SugaredLogger, workflowURL string, cronSchedule string, cronSchedule2 string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit this signature is a bit opaque. consider returning a named interface of hooks that happen at during execution points (or name the return values)

Copy link
Contributor Author

@justinkaseman justinkaseman May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I'm going to punt on a refactor like that for now, as this is a hot fix that needs to go out.

@justinkaseman justinkaseman requested a review from MStreet3 May 7, 2025 20:06
@justinkaseman justinkaseman enabled auto-merge May 7, 2025 20:06
@justinkaseman justinkaseman added this pull request to the merge queue May 8, 2025
@@ -68,7 +68,7 @@ func Test_FetchTrueUSDWorkflow(t *testing.T) {
framework.CreateWasmBinary(t, mainFile, wasmFile)

triggerSink := framework.NewTriggerSink(t, "cron-trigger", "1.0.0")
dataFeedsCacheContract := setupDons(ctx, t, lggr, wasmFile, "*/5 * * * * *", triggerSink)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if I understand the existing POR test has been 'hijacked' to test workflow update, fine for the hot fix, but could this be followed up with a test to explicity test workflow update that is seperate from the POR test? (reusing where necessary)

Merged via the queue into develop with commit 0538438 May 8, 2025
135 of 136 checks passed
@justinkaseman justinkaseman deleted the CRE-406 branch May 8, 2025 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants