Skip to content

Honor registry mirrors when fetching SOCI artifacts#1844

Open
milosgajdos wants to merge 7 commits into
awslabs:mainfrom
milosgajdos:honor-mirrors
Open

Honor registry mirrors when fetching SOCI artifacts#1844
milosgajdos wants to merge 7 commits into
awslabs:mainfrom
milosgajdos:honor-mirrors

Conversation

@milosgajdos
Copy link
Copy Markdown

@milosgajdos milosgajdos commented Jan 14, 2026

Issue #, if available:
Fixes #1013

Description of changes:
SOCI artifact discovery/fetch (index + zTOCs) was constructing an ORAS repository from the original image locator (e.g. docker.io/...), causing the snapshotter to ignore configured registry mirrors and talk to the origin registry for SOCI artifacts.

This changes the SOCI index fetch path to use the resolved docker.RegistryHost (mirror host + client) when constructing the ORAS repository locator. This makes SOCI artifact requests follow the same mirror configuration used for layer pulls.

Testing performed:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@milosgajdos milosgajdos requested a review from a team as a code owner January 14, 2026 13:56
@github-actions github-actions Bot added go Pull requests that update Go code testing Unit and/or integration tests labels Jan 14, 2026
Copy link
Copy Markdown
Contributor

@sondavidb sondavidb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for the contribution and sorry for leaving this hanging for so long. I left a comment to make sure the feature is working. Would also love an integration test as well if possible. After changes are made happy to run them through CI.

Comment thread fs/fs.go Outdated
SOCI artifact discovery/fetch (index + zTOCs) was constructing an ORAS repository
from the original image locator (e.g. docker.io/...), causing the snapshotter to
ignore configured registry mirrors and talk to the origin registry for SOCI
artifacts.

This changes the SOCI index fetch path to use the resolved docker.RegistryHost
(mirror host + client) when constructing the ORAS repository locator. This makes
SOCI artifact requests follow the same mirror configuration used for layer pulls.

Signed-off-by: Milos Gajdos <milosthegajdos@gmail.com>
@milosgajdos
Copy link
Copy Markdown
Author

Hey! Sorry, I was away on hols, but will be picking this up today. Thanks for your patience

Follow-up to c751c6d to fully address mirror behavior for SOCI artifact
discovery/fetch (index + zTOCs).

The original fix moved SOCI artifact lookups away from the image’s origin
locator, but it still effectively depended on the first resolved host and
didn’t fully preserve per-host details. In practice that meant we could still
fail in mirror-heavy setups where the first endpoint is unavailable or where
the mirror is configured with HTTP/path differences.

This change finishes that work so SOCI index and zTOC fetches behave like the
rest of our mirror-aware pull flow. We now carry the full RegistryHost context
through the fetch path, rewrite locators with host path awareness, honor mirror
scheme when constructing ORAS repositories, and retry across all configured
hosts instead of stopping at the first one.

I also tightened error handling around missing host clients and added tests that
cover the new behavior directly: locator rewriting edge cases, host scheme
handling, and fallback from one mirror endpoint to another (including the
all-hosts-fail path). The goal is to make artifact discovery/fetch resilient
and predictable in the same environments where layer pulls already are.

Also adds a simple integration test.

Signed-off-by: Milos Gajdos <milosthegajdos@gmail.com>
@milosgajdos
Copy link
Copy Markdown
Author

Would also love an integration test as well if possible.

Added some integration tests - not sure if that's what you are after but figured it could be something to at least start a conversation

@milosgajdos milosgajdos requested a review from sondavidb March 6, 2026 23:08
@sondavidb
Copy link
Copy Markdown
Contributor

Ah, I was looking more at testing in the integration/ folder. I think something along the lines of mirroring the image to a local registry, having the SOCI config primarily pull from some bogus source, and then have the mirror point to the local registry. Is that a sensical workflow?

@milosgajdos
Copy link
Copy Markdown
Author

Oh I see! I misunderstood 😬 yeah I can look into that; it seems sensible yeah

Make TestMirror explicitly exercise fallback by configuring two mirrors for
the same origin host: a dead mirror first, then the working local mirror.
This verifies we recover from a bad mirror endpoint and still pull via SOCI.

Signed-off-by: Milos Gajdos <milosthegajdos@gmail.com>
@milosgajdos
Copy link
Copy Markdown
Author

@sondavidb I smacked in an integration test case in 76fff9b

My thinking was adding a case where we test a fallback to good mirror when trying a dead mirror (bogus) first. Basically that should verify that we recover from a bad mirror endpoint and still pull via SOCI. LMK if this is along the lines you're thinking too.

@sondavidb
Copy link
Copy Markdown
Contributor

sondavidb commented Mar 10, 2026

Yeah that should be fine by me. I'll look closer at it tomorrow but will run our CI for now. Thanks!

Signed-off-by: Milos Gajdos <milosthegajdos@gmail.com>
Comment thread Dockerfile
@milosgajdos
Copy link
Copy Markdown
Author

I've added some additional build instructions to docs and some improvements for [the speed of] CI.

Comment thread Dockerfile
Comment thread fs/fetch_soci_index_test.go
Comment thread fs/fs.go Outdated
Comment thread Makefile Outdated
Co-authored-by: David Son <55555210+sondavidb@users.noreply.github.com>
@sondavidb
Copy link
Copy Markdown
Contributor

Just FYI for whenever this gets merged, we do rebase and merge so you will have to squash the commits accordingly and have each commit DCO signed. Doesn't have to be now of course, just a heads up 🙂

Retry both MountParallel and MountLocal across the full configured
registry host list instead of only using the first host entry. We build
per-host refspecs up front so each attempt uses the correct mirror host
and path prefix.

For MountLocal, we keep using the real content store during retries so
content fetched from a fallback host can still be reused locally during
apply, and do not delete cached content on local apply failures.

Clear partial contents from the prepared mountpoint between local retry
attempts while preserving the mountpoint itself and its metadata. This
allows later host attempts and containerd fallback to reuse the same
upperdir safely.

We added regression tests for local host fallback, cached blob preservation,
and mountpoint cleanup.

Signed-off-by: Milos Gajdos <milosthegajdos@gmail.com>
@milosgajdos milosgajdos requested a review from sondavidb March 12, 2026 23:25
@milosgajdos
Copy link
Copy Markdown
Author

CI fail is on the accepted commit suggestion 😬

@sondavidb
Copy link
Copy Markdown
Contributor

CI fail just looks like DCO. Unless I'm misunderstanding, which, apologies if so

@milosgajdos
Copy link
Copy Markdown
Author

CI fail just looks like DCO. Unless I'm misunderstanding, which, apologies if so

yeah, GH [platform] fails to reconcile commit signatures when you accept suggestions on a fork; that failing DCO is my accepting your Makefile suggestion - itll go away after squash

@milosgajdos
Copy link
Copy Markdown
Author

@sondavidb anything else you want me to address here?

@sondavidb
Copy link
Copy Markdown
Contributor

sondavidb commented Mar 17, 2026

Nope! Just need to take a look a little closer at the changes before approving. But at a glance they look good.

Thanks again!

Copy link
Copy Markdown
Contributor

@sondavidb sondavidb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a small thing and squash and I'll approve it. I'll also see if I can get someone else on my team to take a look.

Awesome work on this, we've been wanting this for a while :)

Comment thread fs/fs.go
}
}

err := fs.rebase(ctx, desc.Digest, imageDigest, mountpoint)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebasing can probably be put outside of this loop as it's a fully offline-operation so we don't need to try it more than once on a bad registry.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'm wondering... shouldnt rebase actually remain in the loop because it is where the current host’s async premount failure becomes visible...preloadAllLayers only starts premount work. The actual host-specific failure can surface later when rebase waits on the layer job...wouldnt we lose the retry if we moved it outside the loop? 🤔

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rename/stat portion of rebase is offline, but rebase currently also waits for the host-specific premount job via layerJob.errCh https://github.com/milosgajdos/soci-snapshotter/blob/9315f827d274b45862a1854c6cacd4a78bf5a17f/fs/fs.go#L629

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jobs are keyed via digest and not by ref, and if preloading a layer fails then it will cancel the entire image job and return, so we won't be accidentally getting another registry host's image job.

Though, I wonder now if that's a problem. Should fallbacks for the parallel pull mode be per-layer or per-image? I think per-layer would be more consistent with other snapshotter behaviors. If so, we should maybe move this to the preloadAllLayers and rotate through registry hosts per layer.

But I'm also OK taking this as a followup. We've iterated on this plenty already. (You're more than welcome to change this now if you wish, though.)

Comment thread fs/fs.go
@milosgajdos milosgajdos requested a review from sondavidb March 18, 2026 16:56
sondavidb
sondavidb previously approved these changes Mar 18, 2026
@milosgajdos
Copy link
Copy Markdown
Author

@sondavidb mind poking some maintainers so we can move forward with this? 🙏

@sondavidb
Copy link
Copy Markdown
Contributor

Sure, sorry for the drag @milosgajdos. I've gone ahead and pinged someone but as he's not around right now I don't expect him to look at this immediately.

Copy link
Copy Markdown
Member

@henry118 henry118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milosgajdos Thanks for this contribution.

At high level, it occurs to me that the retry granularity here is a bit coarse. I think the only errors that worth retrying the next host are the host I/O related errors. i.e, failing to resolve the host. Other types of errors such as post-processing of downloaded artifacts (decompression/unpacking) should result an immediate failure.

So in other words, I think the logic of iterating the hosts should be implemented at lower level where the actual HTTP requests are initiated.

Comment thread fs/fs.go
if err != nil {
return fmt.Errorf("failed to preload layers for image manifest digest %s: %w", imageDigest, err)
hostErr = errors.Join(hostErr, fmt.Errorf("registry host %q: failed to rebase layer %s: %w", hostRef.host.Host, desc.Digest, err))
continue
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase failures should have nothing to do with registry mirrors. i don't think we need to try the next host.

Comment thread fs/fs.go
Comment on lines +474 to 478
if hostErr != nil {
return fmt.Errorf("failed to preload layers for image manifest digest %s: %w", imageDigest, hostErr)
}
return fmt.Errorf("failed to preload layers for image manifest digest %s", imageDigest)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like these can be simplified. esp the last return somehow swallowed the internal errors.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely missed this comment; work has kept me busy for a few weeks, but I'll be picking this up today

@YQ-Wang
Copy link
Copy Markdown

YQ-Wang commented Apr 9, 2026

Interested user here :) I’ve been following this because I’d like to use the feature.

The remaining blocker seems to be retry scope rather than the mirror-awareness behavior itself. The maintainer comments make sense to me: retrying the next host for actual registry-host / network-facing failures seems reasonable, while local post-processing failures like unpack/apply/rebase should fail immediately. It also seems important to preserve the underlying error rather than returning a generic fallback error.

Would a narrow fix along those lines be enough to unblock this PR, with any finer-grained retry redesign handled as follow-up? Happy to test an updated branch or create a PR to this current branch if useful .

Signed-off-by: Milos Gajdos <milosthegajdos@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Pull requests that update Go code testing Unit and/or integration tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Artifact fetcher does not honor mirrors

4 participants