Honor registry mirrors when fetching SOCI artifacts#1844
Conversation
sondavidb
left a comment
There was a problem hiding this comment.
Hey, thanks for the contribution and sorry for leaving this hanging for so long. I left a comment to make sure the feature is working. Would also love an integration test as well if possible. After changes are made happy to run them through CI.
SOCI artifact discovery/fetch (index + zTOCs) was constructing an ORAS repository from the original image locator (e.g. docker.io/...), causing the snapshotter to ignore configured registry mirrors and talk to the origin registry for SOCI artifacts. This changes the SOCI index fetch path to use the resolved docker.RegistryHost (mirror host + client) when constructing the ORAS repository locator. This makes SOCI artifact requests follow the same mirror configuration used for layer pulls. Signed-off-by: Milos Gajdos <milosthegajdos@gmail.com>
|
Hey! Sorry, I was away on hols, but will be picking this up today. Thanks for your patience |
Follow-up to c751c6d to fully address mirror behavior for SOCI artifact discovery/fetch (index + zTOCs). The original fix moved SOCI artifact lookups away from the image’s origin locator, but it still effectively depended on the first resolved host and didn’t fully preserve per-host details. In practice that meant we could still fail in mirror-heavy setups where the first endpoint is unavailable or where the mirror is configured with HTTP/path differences. This change finishes that work so SOCI index and zTOC fetches behave like the rest of our mirror-aware pull flow. We now carry the full RegistryHost context through the fetch path, rewrite locators with host path awareness, honor mirror scheme when constructing ORAS repositories, and retry across all configured hosts instead of stopping at the first one. I also tightened error handling around missing host clients and added tests that cover the new behavior directly: locator rewriting edge cases, host scheme handling, and fallback from one mirror endpoint to another (including the all-hosts-fail path). The goal is to make artifact discovery/fetch resilient and predictable in the same environments where layer pulls already are. Also adds a simple integration test. Signed-off-by: Milos Gajdos <milosthegajdos@gmail.com>
76ea63e to
72a8115
Compare
Added some integration tests - not sure if that's what you are after but figured it could be something to at least start a conversation |
|
Ah, I was looking more at testing in the |
|
Oh I see! I misunderstood 😬 yeah I can look into that; it seems sensible yeah |
Make TestMirror explicitly exercise fallback by configuring two mirrors for the same origin host: a dead mirror first, then the working local mirror. This verifies we recover from a bad mirror endpoint and still pull via SOCI. Signed-off-by: Milos Gajdos <milosthegajdos@gmail.com>
|
@sondavidb I smacked in an integration test case in 76fff9b My thinking was adding a case where we test a fallback to good mirror when trying a dead mirror (bogus) first. Basically that should verify that we recover from a bad mirror endpoint and still pull via SOCI. LMK if this is along the lines you're thinking too. |
|
Yeah that should be fine by me. I'll look closer at it tomorrow but will run our CI for now. Thanks! |
Signed-off-by: Milos Gajdos <milosthegajdos@gmail.com>
|
I've added some additional build instructions to docs and some improvements for [the speed of] CI. |
Co-authored-by: David Son <55555210+sondavidb@users.noreply.github.com>
|
Just FYI for whenever this gets merged, we do rebase and merge so you will have to squash the commits accordingly and have each commit DCO signed. Doesn't have to be now of course, just a heads up 🙂 |
Retry both MountParallel and MountLocal across the full configured registry host list instead of only using the first host entry. We build per-host refspecs up front so each attempt uses the correct mirror host and path prefix. For MountLocal, we keep using the real content store during retries so content fetched from a fallback host can still be reused locally during apply, and do not delete cached content on local apply failures. Clear partial contents from the prepared mountpoint between local retry attempts while preserving the mountpoint itself and its metadata. This allows later host attempts and containerd fallback to reuse the same upperdir safely. We added regression tests for local host fallback, cached blob preservation, and mountpoint cleanup. Signed-off-by: Milos Gajdos <milosthegajdos@gmail.com>
|
CI fail is on the accepted commit suggestion 😬 |
|
CI fail just looks like DCO. Unless I'm misunderstanding, which, apologies if so |
yeah, GH [platform] fails to reconcile commit signatures when you accept suggestions on a fork; that failing DCO is my accepting your Makefile suggestion - itll go away after squash |
|
@sondavidb anything else you want me to address here? |
|
Nope! Just need to take a look a little closer at the changes before approving. But at a glance they look good. Thanks again! |
sondavidb
left a comment
There was a problem hiding this comment.
LGTM, just a small thing and squash and I'll approve it. I'll also see if I can get someone else on my team to take a look.
Awesome work on this, we've been wanting this for a while :)
| } | ||
| } | ||
|
|
||
| err := fs.rebase(ctx, desc.Digest, imageDigest, mountpoint) |
There was a problem hiding this comment.
Rebasing can probably be put outside of this loop as it's a fully offline-operation so we don't need to try it more than once on a bad registry.
There was a problem hiding this comment.
So, I'm wondering... shouldnt rebase actually remain in the loop because it is where the current host’s async premount failure becomes visible...preloadAllLayers only starts premount work. The actual host-specific failure can surface later when rebase waits on the layer job...wouldnt we lose the retry if we moved it outside the loop? 🤔
There was a problem hiding this comment.
the rename/stat portion of rebase is offline, but rebase currently also waits for the host-specific premount job via layerJob.errCh https://github.com/milosgajdos/soci-snapshotter/blob/9315f827d274b45862a1854c6cacd4a78bf5a17f/fs/fs.go#L629
There was a problem hiding this comment.
The jobs are keyed via digest and not by ref, and if preloading a layer fails then it will cancel the entire image job and return, so we won't be accidentally getting another registry host's image job.
Though, I wonder now if that's a problem. Should fallbacks for the parallel pull mode be per-layer or per-image? I think per-layer would be more consistent with other snapshotter behaviors. If so, we should maybe move this to the preloadAllLayers and rotate through registry hosts per layer.
But I'm also OK taking this as a followup. We've iterated on this plenty already. (You're more than welcome to change this now if you wish, though.)
|
@sondavidb mind poking some maintainers so we can move forward with this? 🙏 |
|
Sure, sorry for the drag @milosgajdos. I've gone ahead and pinged someone but as he's not around right now I don't expect him to look at this immediately. |
henry118
left a comment
There was a problem hiding this comment.
@milosgajdos Thanks for this contribution.
At high level, it occurs to me that the retry granularity here is a bit coarse. I think the only errors that worth retrying the next host are the host I/O related errors. i.e, failing to resolve the host. Other types of errors such as post-processing of downloaded artifacts (decompression/unpacking) should result an immediate failure.
So in other words, I think the logic of iterating the hosts should be implemented at lower level where the actual HTTP requests are initiated.
| if err != nil { | ||
| return fmt.Errorf("failed to preload layers for image manifest digest %s: %w", imageDigest, err) | ||
| hostErr = errors.Join(hostErr, fmt.Errorf("registry host %q: failed to rebase layer %s: %w", hostRef.host.Host, desc.Digest, err)) | ||
| continue |
There was a problem hiding this comment.
rebase failures should have nothing to do with registry mirrors. i don't think we need to try the next host.
| if hostErr != nil { | ||
| return fmt.Errorf("failed to preload layers for image manifest digest %s: %w", imageDigest, hostErr) | ||
| } | ||
| return fmt.Errorf("failed to preload layers for image manifest digest %s", imageDigest) | ||
| } |
There was a problem hiding this comment.
Feel like these can be simplified. esp the last return somehow swallowed the internal errors.
There was a problem hiding this comment.
I completely missed this comment; work has kept me busy for a few weeks, but I'll be picking this up today
|
Interested user here :) I’ve been following this because I’d like to use the feature. The remaining blocker seems to be retry scope rather than the mirror-awareness behavior itself. The maintainer comments make sense to me: retrying the next host for actual registry-host / network-facing failures seems reasonable, while local post-processing failures like unpack/apply/rebase should fail immediately. It also seems important to preserve the underlying error rather than returning a generic fallback error. Would a narrow fix along those lines be enough to unblock this PR, with any finer-grained retry redesign handled as follow-up? Happy to test an updated branch or create a PR to this current branch if useful . |
Signed-off-by: Milos Gajdos <milosthegajdos@gmail.com>
Issue #, if available:
Fixes #1013
Description of changes:
SOCI artifact discovery/fetch (index + zTOCs) was constructing an ORAS repository from the original image locator (e.g. docker.io/...), causing the snapshotter to ignore configured registry mirrors and talk to the origin registry for SOCI artifacts.
This changes the SOCI index fetch path to use the resolved docker.RegistryHost (mirror host + client) when constructing the ORAS repository locator. This makes SOCI artifact requests follow the same mirror configuration used for layer pulls.
Testing performed:
✅
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.