Skip to content

feat(wanda): implement artifact extraction from built images#408

Closed
andrew-anyscale wants to merge 2 commits intoandrew/revup/main/wanda-artifact-specfrom
andrew/revup/main/wanda-artifact-extract
Closed

feat(wanda): implement artifact extraction from built images#408
andrew-anyscale wants to merge 2 commits intoandrew/revup/main/wanda-artifact-specfrom
andrew/revup/main/wanda-artifact-extract

Conversation

@andrew-anyscale
Copy link
Contributor

Implement artifact extraction using Docker with optimized batching. All artifacts are extracted in a single docker run command for efficiency, using a generated shell script with volume mount.

  • Uses Docker (not Crane) because docker can extract specific files while Crane must export entire filesystem as tarball
  • Single container spawned regardless of artifact count
  • Shell expansion handles glob patterns natively
  • Optional artifacts use || echo fallback to allow failures
  • Extraction runs for root spec only, even on cache hit

Adds ArtifactsDir to ForgeConfig (defaults to /tmp/artifacts in RayCI mode).

Topic: wanda-artifact-extract
Relative: wanda-artifact-spec
Labels: draft

Signed-off-by: andrew andrew@anyscale.com

Add support for defining artifacts to extract from built container images. The Artifact struct specifies:
- src: path inside container (supports globs and variable expansion)
- dst: destination path on host (relative to artifacts dir)
- optional: if true, extraction failure warns instead of failing build

Topic: wanda-artifact-spec
Labels: draft

Signed-off-by: andrew <andrew@anyscale.com>
Implement artifact extraction using Docker with optimized batching. All artifacts are extracted in a single docker run command for efficiency, using a generated shell script with volume mount.

- Uses Docker (not Crane) because docker can extract specific files while Crane must export entire filesystem as tarball
- Single container spawned regardless of artifact count
- Shell expansion handles glob patterns natively
- Optional artifacts use || echo fallback to allow failures
- Extraction runs for root spec only, even on cache hit

Adds ArtifactsDir to ForgeConfig (defaults to /tmp/artifacts in RayCI mode).

Topic: wanda-artifact-extract
Relative: wanda-artifact-spec
Labels: draft

Signed-off-by: andrew <andrew@anyscale.com>
@andrew-anyscale
Copy link
Contributor Author

Reviews in this chain:
#407 feat(wanda): add Artifact struct to spec for extraction
 └#408 feat(wanda): implement artifact extraction from built images
  └#409 feat(wanda): add artifacts_dir CLI flag
   └#410 feat(raycicmd): add artifacts_dir to wanda steps

@andrew-anyscale
Copy link
Contributor Author

# head base diff date summary
0 ae920eeb 7699d792 diff Feb 3 13:24 PM 14 files changed, 852 insertions(+), 2 deletions(-)

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @andrew-anyscale, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust and efficient mechanism for extracting artifacts from Docker images after they have been built. By utilizing a single Docker container execution with a dynamically generated shell script, it streamlines the process of retrieving specific files or patterns from the image's filesystem to a designated host directory. This enhancement provides greater control and flexibility for managing build outputs, ensuring that necessary components are readily available for subsequent steps in the development pipeline.

Highlights

  • Artifact Extraction: Implemented a new feature to extract specified artifacts from built Docker images using an optimized Docker run command.
  • Docker-based Extraction: The extraction process leverages Docker directly, generating a shell script to copy files from the image to a host-mounted volume, which is more efficient for specific file extraction compared to exporting entire filesystems.
  • Flexible Artifact Specification: The system supports glob patterns for source paths and allows marking artifacts as optional, preventing build failures if an optional file is not found.
  • Root Spec Only Extraction: Artifact extraction is performed exclusively for the root specification, even when dealing with cached builds or dependent images.
  • Configuration Update: A new ArtifactsDir field has been added to ForgeConfig to specify the destination directory for extracted artifacts.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • wanda/docker_cmd.go
    • Added runExtract function to execute a shell script within a Docker container, facilitating artifact extraction.
  • wanda/docker_cmd_test.go
    • Added TestDockerCmdRunExtract and TestDockerCmdRunExtract_bestEffort to verify the functionality of the new runExtract method, including glob pattern handling and optional file behavior.
  • wanda/extract.go
    • New file introduced to house the core logic for building the shell script used in artifact extraction, including functions for path resolution (resolveDst), glob character detection (hasGlobChars), and shell quoting (shellQuote).
  • wanda/extract_test.go
    • New file containing comprehensive unit tests for the artifact extraction script generation and path resolution logic, covering various scenarios like absolute/relative paths, globbing, optional artifacts, and path escape prevention.
  • wanda/forge.go
    • Integrated artifact extraction into the main Build flow, ensuring that artifacts are extracted from the root spec if ArtifactsDir is configured.
    • Added ExtractArtifacts method to the Forge struct, encapsulating the artifact extraction process.
  • wanda/forge_config.go
    • Added ArtifactsDir field to the ForgeConfig struct to allow users to specify the output directory for extracted artifacts.
  • wanda/forge_test.go
    • Added several integration tests (TestBuild_WithArtifacts_exact, TestBuild_WithArtifacts_glob, TestBuild_WithArtifacts_optional, TestBuild_WithArtifacts_rootOnly, TestBuild_WithArtifacts_cacheHit) to validate the end-to-end artifact extraction feature under different conditions.
  • wanda/testdata/Dockerfile.artifact
    • New test Dockerfile to create a sample image with various files for artifact extraction testing.
  • wanda/testdata/Dockerfile.artifact-top
    • New test Dockerfile used in multi-stage build scenarios for artifact extraction testing.
  • wanda/testdata/artifact-dep-base.wanda.yaml
    • New test Wanda spec defining a base image with artifacts for dependency testing.
  • wanda/testdata/artifact-dep-top.wanda.yaml
    • New test Wanda spec defining a top-level image that depends on artifact-dep-base, used to verify root-only artifact extraction.
  • wanda/testdata/artifact-exact.wanda.yaml
    • New test Wanda spec demonstrating exact file artifact extraction.
  • wanda/testdata/artifact-glob.wanda.yaml
    • New test Wanda spec demonstrating artifact extraction using glob patterns.
  • wanda/testdata/artifact-optional.wanda.yaml
    • New test Wanda spec demonstrating the extraction of optional artifacts.
Activity
  • This pull request introduces a new feature for artifact extraction, complete with a dedicated extract.go file for core logic and extensive unit and integration tests to ensure correctness and robustness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to extract artifacts from built Docker images. The implementation uses a single docker run command with a generated shell script to handle batch extraction efficiently. The logic for resolving destination paths and handling glob patterns is well-contained in the new extract.go file, and it is accompanied by a comprehensive set of tests.

My review has identified a critical command injection vulnerability in the script generation logic. I've also included a few medium-severity suggestions to improve code portability in tests and documentation clarity. Overall, this is a solid implementation of a useful feature, pending the resolution of the security issue.

Comment on lines +41 to +45
cmd := fmt.Sprintf("mkdir -p %s && cp -r %s %s",
shellQuote(mkdirPath),
srcParam,
shellQuote(dst),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

There is a command injection vulnerability here. If a.Src contains glob characters, it is passed unquoted and unsanitized to the shell script. A malicious a.Src like *; rm -rf / could lead to arbitrary command execution inside the container.

Since a.Src comes from a YAML file, it should be treated as untrusted user input.

To fix this, you should validate a.Src to ensure it doesn't contain any shell metacharacters other than the intended glob patterns when hasGlobChars is true.

For example, you could add a validation function:

func isValidGlob(path string) bool {
    // This is a basic check. A more robust validation might be needed.
    if strings.ContainsAny(path, ";|&()`\n$") {
        return false
    }
    return true
}

And then use it in buildExtractionScript before constructing the command.

t.Fatalf("runExtract: %v", err)
}

if _, err := os.Stat(tmpDir + "/etc/alpine-release"); os.IsNotExist(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using + to concatenate path components is not portable and can lead to issues on different operating systems. It's better to use filepath.Join to construct file paths.

Suggested change
if _, err := os.Stat(tmpDir + "/etc/alpine-release"); os.IsNotExist(err) {
if _, err := os.Stat(filepath.Join(tmpDir, "etc", "alpine-release")); os.IsNotExist(err) {

t.Error("alpine-release was not copied")
}

entries, err := os.ReadDir(tmpDir + "/etc")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For portability, please use filepath.Join to construct file paths instead of string concatenation.

Suggested change
entries, err := os.ReadDir(tmpDir + "/etc")
entries, err := os.ReadDir(filepath.Join(tmpDir, "etc"))

t.Fatalf("runExtract should not fail (best-effort): %v", err)
}

if _, err := os.Stat(tmpDir + "/etc/alpine-release"); os.IsNotExist(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For portability, please use filepath.Join to construct file paths instead of string concatenation.

Suggested change
if _, err := os.Stat(tmpDir + "/etc/alpine-release"); os.IsNotExist(err) {
if _, err := os.Stat(filepath.Join(tmpDir, "etc", "alpine-release")); os.IsNotExist(err) {

}

// Build builds a container image from the given specification.
// If the spec has artifacts defined, they will be extracted after build.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is misleading. The (f *Forge) Build(spec *Spec) method only builds a single image and does not handle artifact extraction. The extraction logic is handled in the top-level Build function after all dependencies are built. I suggest removing this line to avoid confusion.

Suggested change
// If the spec has artifacts defined, they will be extracted after build.
// Build builds a container image from the given specification.

@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/main/wanda-artifact-spec branch from 7699d79 to 70e0db3 Compare February 3, 2026 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant