Skip to content

Refactor deploying image logic #346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Haivilo
Copy link
Contributor

@Haivilo Haivilo commented Apr 12, 2025

TODO

Fix integration tests

Summary of Changes

  1. Image Mapping Changed from Per-Function to Per-Workflow

    • The deployment logic was simplified to store a single image URI per workflow_instance_id instead of per function.
  2. Introduced Generic Lambda Handler

    • Added generic_handler.py, a dynamic entrypoint that routes invocations based on the target field in the event payload.
  3. Payload Format Enhancement

    • Payloads now include a top-level "target" field indicating the function name to invoke, to support the generic Lambda handler.
  4. Minor Enhancements

    • Docker image names now use workflow_instance_id instead of full function name.
    • _copy_image_to_region was renamed to _copy_image_if_not_exists and now avoids redundant ECR operations when the image exists in the same region.

@engshahrad engshahrad requested a review from Danidite April 15, 2025 19:11
@engshahrad engshahrad added the enhancement New feature or request label Apr 15, 2025
@engshahrad
Copy link
Contributor

@Haivilo for the sake of documentation, can you provide details on performance gains as a result of this change in a comment?

@Danidite and @vGsteiger can you engage with Steve to fix rough edges and bring the new capability to the project? Thanks

@Haivilo
Copy link
Contributor Author

Haivilo commented Apr 15, 2025

Hi @engshahrad, the build time improves as the number of functions increase, which reaches ~50% reduction when you have 5 workflow functions. This was tested locally. Here is the improvement of example build time:
output

For the performance related to migrating speed, I will discuss with @Danidite and get back to you. I was unable to retrieve the results efficiently.

Haivilo added 4 commits May 8, 2025 14:38
…ing and function registration

- Changed `_copy_image_to_region` to `_copy_image_if_not_exists` and modified logic
- Added logic to skip image copying if it already exists in the current region.
- Updated Docker image naming to use `workflow_instance_id`, so every lambda function uses the same docker image.
- Introduced `set_wrapped_function` method in `CaribouFunction` to register wrapped functions.
- Enhanced `CaribouWorkflow` to pass `successor_function_name` in various methods for better tracking.
- Added a new `generic_handler.py` to handle dynamic function routing in Lambda.
- Updated deployment packager to include the new generic handler.
- Adjusted tests to reflect changes in image copying logic.
@Haivilo Haivilo force-pushed the feat-common-image-fix-geo branch from b1eb883 to f0b2c81 Compare May 9, 2025 04:22
Copy link
Contributor

@Danidite Danidite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It overall looks very good, great work! However, I do have a few minor comments and concerns.

workflow = app.workflow

# Get payload and target function
payload = _get_payload(event)
Copy link
Contributor

@Danidite Danidite May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this function was never used. The "result = target_function(event)" seems to read the original event. Is the target function suppose to take in the original event or the parsed payload?

# Get payload and target function
payload = _get_payload(event)
target_function_name = event.get("target") if isinstance(event, dict) else None
target_function, func_name = _find_target_function(workflow, target_function_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function name is also never used here, is this intentional? If so maybe the _find_target_function(...) function should only return the target_function as its not used anywhere else.

target_function_name = event.get("target") if isinstance(event, dict) else None
target_function, func_name = _find_target_function(workflow, target_function_name)

_, _ = payload, func_name # Unused variables, for now disabled to run tests
Copy link
Contributor

@Danidite Danidite May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to remove this after the above fixes/changes, I temporarily added it to pass the compliance tests.

parts = deployed_image_uri.split("/")
original_region = parts[0].split(".")[3]
original_image_name = parts[1]

ecr_client = self._client("ecr")
new_region = ecr_client.meta.region_name
if new_region == original_region:
logger.info("Image already exists in the %s region, skipping copy", new_region)
return deployed_image_uri
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am understanding it correctly, all this checks is if the new region does not equal to the original region name (home region I assume). So for scenarios where this is not the first re-deployment, would it still perform redundant ECR operations? If so would a better way to check simply just checking if the image already exist in the ECR via some boto3 call?

for func_name, caribou_func in workflow.functions.items():
if caribou_func.name == target_name:
return caribou_func.wrapped_function, func_name
else:
Copy link
Contributor

@Danidite Danidite May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, this else statement where if no target_name is provided it defaults to calling the entry point seems potentially risky for future maintenance. If the target is somehow lost in a future change, it may result in infinite loops (if they also bypass our other failsafes). A better approach might be to modify the invoker (client CLI) so that even for the first function, it must explicitly specify the target_name of the entry_point function. Then, in this code, you can simply raise an error and terminate if target_name is missing. This doesn't need to be implemented in this PR, just create a new issue for the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants