OpenCLIP bootstrap downloader and hardening#70
Conversation
|
@claude review |
|
Claude finished @tazarov's task in 2m 37s —— View job Code Review: OpenCLIP Bootstrap Downloader and HardeningTasks
Overall this is a well-structured, security-conscious implementation. The checksum-by-default, HTTPS enforcement with tokens, atomic temp-file-then-rename, and redirect allowlisting are all solid choices. Below are findings organized by severity.
|
| Severity | Count | Topics |
|---|---|---|
| Medium | 3 | Test env var in prod config, redundant security setup, no context propagation |
| Minor/Low | 7 | URL path sanitization, string-based transient detection, lock spin interval, image resize quality, pixel-by-pixel preprocessing, closed-check accuracy, pad token assumption |
| Test gaps | 3 | LRU eviction, redirect integration, duplicated helpers |
The overall implementation quality is high. The most impactful fix before merge would be removing ONNXRUNTIME_TEST_MODEL_CACHE_DIR from the production default config, as it could silently cross test/production environment boundaries.
There was a problem hiding this comment.
Pull request overview
Adds an embeddings/openclip package that can (a) bootstrap OpenCLIP ONNX assets from Hugging Face into a local cache with integrity/redirect hardening, and (b) run text+image embedding inference with session caching and preprocessing aligned to the exported artifacts.
Changes:
- Introduce OpenCLIP asset bootstrapper with checksum/size verification, safe(ish) redirects, retry logic, and lock-based concurrency control.
- Add OpenCLIP embedder implementation (text + vision) including tokenizer integration, image preprocessing, LRU session caching, and similarity helpers.
- Add unit + integration tests and README documentation for OpenCLIP usage and default bootstrap behavior.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| embeddings/openclip/bootstrap.go | Implements default asset download/cache with integrity checks, retries, redirects policy, and file locks. |
| embeddings/openclip/bootstrap_test.go | Adds coverage for caching, corrupt-file replacement, oversize protection, retry classification, lock behavior, and redirect allowlisting. |
| embeddings/openclip/embedder.go | Implements the OpenCLIP embedder (tokenization, preprocessing, ONNX sessions, normalization, similarity helpers). |
| embeddings/openclip/embedder_test.go | Adds unit tests for options/validation, preprocessing helpers, normalization, and similarity helpers. |
| embeddings/openclip/embedder_integration_test.go | End-to-end embedding test using bootstrapped artifacts (or env-provided paths) against ONNX Runtime. |
| README.md | Documents OpenCLIP artifacts, defaults, bootstrap usage, and similarity helpers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| revisionSlug := strings.ReplaceAll(cfg.revision, "/", "--") | ||
| baseDir := filepath.Join(cfg.cacheDir, repoSlug, revisionSlug) | ||
| if err := os.MkdirAll(baseDir, 0o700); err != nil { | ||
| return ModelAssets{}, fmt.Errorf("failed to create bootstrap cache directory %q: %w", baseDir, err) | ||
| } | ||
|
|
There was a problem hiding this comment.
repoID/revision are used to form baseDir via filepath.Join(cfg.cacheDir, repoSlug, revisionSlug), but repoSlug/revisionSlug are only sanitized by replacing "/" with "--". Values like ".." (or platform-specific separators) will cause filepath.Join to clean the path and can escape cacheDir, allowing writes outside the intended cache root. Consider rejecting unsafe values and/or verifying the resulting baseDir stays within cacheDir (e.g., compare filepath.Abs + filepath.Rel), and sanitize both repo and revision for all OS path separators.
| revisionSlug := strings.ReplaceAll(cfg.revision, "/", "--") | |
| baseDir := filepath.Join(cfg.cacheDir, repoSlug, revisionSlug) | |
| if err := os.MkdirAll(baseDir, 0o700); err != nil { | |
| return ModelAssets{}, fmt.Errorf("failed to create bootstrap cache directory %q: %w", baseDir, err) | |
| } | |
| repoSlug = strings.ReplaceAll(repoSlug, "\\", "--") | |
| if repoSlug == "." || repoSlug == ".." { | |
| return ModelAssets{}, fmt.Errorf("invalid repo ID %q: resolved slug %q is not allowed", cfg.repoID, repoSlug) | |
| } | |
| revisionSlug := strings.ReplaceAll(cfg.revision, "/", "--") | |
| revisionSlug = strings.ReplaceAll(revisionSlug, "\\", "--") | |
| if revisionSlug == "." || revisionSlug == ".." { | |
| return ModelAssets{}, fmt.Errorf("invalid revision %q: resolved slug %q is not allowed", cfg.revision, revisionSlug) | |
| } | |
| baseDir := filepath.Join(cfg.cacheDir, repoSlug, revisionSlug) | |
| absBaseDir, err := filepath.Abs(baseDir) | |
| if err != nil { | |
| return ModelAssets{}, fmt.Errorf("failed to resolve absolute path for bootstrap cache directory %q: %w", baseDir, err) | |
| } | |
| absCacheDir, err := filepath.Abs(cfg.cacheDir) | |
| if err != nil { | |
| return ModelAssets{}, fmt.Errorf("failed to resolve absolute path for cache root %q: %w", cfg.cacheDir, err) | |
| } | |
| rel, err := filepath.Rel(absCacheDir, absBaseDir) | |
| if err != nil { | |
| return ModelAssets{}, fmt.Errorf("failed to compute relative path from cache root %q to %q: %w", absCacheDir, absBaseDir, err) | |
| } | |
| if rel == ".." || strings.HasPrefix(rel, ".."+string(os.PathSeparator)) { | |
| return ModelAssets{}, fmt.Errorf("bootstrap cache directory %q escapes cache root %q", absBaseDir, absCacheDir) | |
| } | |
| if err := os.MkdirAll(absBaseDir, 0o700); err != nil { | |
| return ModelAssets{}, fmt.Errorf("failed to create bootstrap cache directory %q: %w", absBaseDir, err) | |
| } |
| func getRGB(c color.Color, convertRGB bool) (float32, float32, float32) { | ||
| if !convertRGB { | ||
| gray := color.GrayModel.Convert(c).(color.Gray) | ||
| v := float32(gray.Y) | ||
| return v, v, v | ||
| } | ||
| rgba := color.NRGBAModel.Convert(c).(color.NRGBA) | ||
| return float32(rgba.R), float32(rgba.G), float32(rgba.B) | ||
| } |
There was a problem hiding this comment.
getRGB treats do_convert_rgb=false as “convert to grayscale” by converting to color.Gray and returning (v,v,v). That changes the semantics of the preprocessor flag: disabling RGB conversion should usually mean “don’t perform a colorspace conversion step”, not “force grayscale”. This will silently distort color inputs when do_convert_rgb is false. Consider always extracting R/G/B channels (via color.NRGBAModel.Convert) and only using a special-case conversion when the model truly expects grayscale.
| maxBytes := cfg.maxDownloadBytes | ||
| if expectedSize > 0 && expectedSize < maxBytes { | ||
| maxBytes = expectedSize | ||
| } | ||
| if maxBytes <= 0 { | ||
| return fmt.Errorf("invalid max download limit for %s: %d", fileName, maxBytes) | ||
| } |
There was a problem hiding this comment.
If a caller sets expectedSize larger than cfg.maxDownloadBytes, maxBytes remains capped while downloadFileOnce still enforces written == expectedSize, making the download impossible to succeed (it will always hit size-mismatch after being limited). Consider validating expectedSize <= maxDownloadBytes up-front (or auto-bumping maxBytes to expectedSize when verification is enabled) so misconfiguration fails with a clear error.
|
@claude review |
|
Claude finished @tazarov's task in 2m 43s —— View job Tasks
Code Review: OpenCLIP Bootstrap Downloader and HardeningThis is a well-constructed, security-conscious implementation overall. The checksum-by-default, HTTPS enforcement with tokens, atomic temp-file-then-rename pattern, and redirect allowlisting are solid choices. Below are findings organized by severity, including items that supersede or add to the previous automated review.
|
| Severity | # | Topics |
|---|---|---|
| Medium | 3 | Test env var in prod config, URL path injection risk, no context propagation |
| Minor/Low | 10 | Redundant HTTP setup, string-based error classification, lock spin interval, retry jitter, silent stale reason, cross-modal nil checks, doConvertRGB semantics, resize quality, pixel-by-pixel throughput, pad token assumption |
| Test gaps | 3 | LRU eviction, redirect integration, duplicated helpers |
The most impactful fixes before merge:
- Remove
ONNXRUNTIME_TEST_MODEL_CACHE_DIRfromdefaultBootstrapConfig(env bleed) - Fix
getRGBfordoConvertRGB=false(correctness bug — grayscale conversion is wrong) - Apply
url.PathEscapeto URL path segments (injection hardening)
Summary
Closes #65.
This change introduces OpenCLIP model bootstrap and hardening in
embeddings/openclip, plus follow-up fixes from iterative review.Change Log
Added
README/documentation updates describing bootstrap defaults and offline artifact flow.EnsureDefaultAssets.Changed
Security and Reliability
Files
embeddings/openclip/bootstrap.goembeddings/openclip/bootstrap_test.goembeddings/openclip/embedder.goembeddings/openclip/embedder_test.goembeddings/openclip/embedder_integration_test.goREADME.mdVerification
go test ./embeddings/openclip/... -shortgo test ./...go vet ./embeddings/openclip/...Reviewer Checklist
Notes