feat(scale-out): support for Artifact GQL with local storage scale out #3074

vrajashkr · 2025-04-06T19:43:55Z

What type of PR is this?
feature

Which issue does this PR fix:
Towards #2434

What does this PR do / Why do we need it:
Previously, only dist-spec APIs were supported for scale-out as in a shared storage environment, the metadata was shared and any instance could correctly respond to the GQL queries as all the data is available.

In a local scale-out cluster deployment, the metadata store, in addition to the file storage is isolated to each member in the cluster. Due to this, there is a need to proxy the GQL queries as well for UI and client requests to work as expected.

This change introduces a new GQL proxy + a generic fan-out handler for GQL requests.

Testing done on this change:
Unit testing for the supported GQL operations as part of the PR.
TODO manual testing.

Will this break upgrades or downgrades?
No

Does this PR introduce any user-facing change?:
Yes

With this change, users will be able to execute supported GQL operations on any member of a scale-out zot cluster with local storage only. This will facilitate GraphQL queries from the UI as well as other zot clients that use GraphQL to query information.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

vrajashkr · 2025-04-06T19:49:24Z

A previous attempt at this was made in #2588

How is this different?

Instead of a dedicated handler for each operation, in this PR, an approach is explored where there are generic and specific handlers. (Refer to [Feat]: Scale-out cluster support for independent per-instance storage deployments #2434 (comment))
Generic handlers cater to common use cases such as fan-out proxy for GQL or single target Proxying
Specific handlers cater to specific use cases which do not align with a generic handler
GQL schema-merging was done manually in the previous PR, however, in this PR, a generic approach has been taken to dynamically merge the responses by treating them as maps instead of dedicated structs. This allows the code to be reusable across all types of handlers and reduces the amount of code to be maintained.

Note: Pagination is still pending
Note: The code itself is just in a Proof of concept state. Quite a bit of clean up is needed.

vrajashkr · 2025-04-06T20:00:32Z

pkg/extensions/search/gql_proxy/gql_proxy.go

+	log log.Logger,
+	gqlSchema *ast.Schema,
+) func(handler http.Handler) http.Handler {
+	proxyFunctionalityMap := map[string]GqlScaleOutHandlerFunc{


@rchincha this map statically stores which GQL operation needs which kind of handler.

There will be a couple of generic handlers such as fanout and some specific handlers if any of the operations need custom behavior.

What do you think about this approach?

It's better than last time since we don't need to maintain a separate handler for each operation type, but I'm open to more ideas on making this better.

vrajashkr · 2025-04-06T20:02:04Z

pkg/extensions/search/gql_proxy/handler_utils.go

+	}
+}
+
+func deepMergeMaps(a, b map[string]any) map[string]any {


@rchincha this is the new approach for aggregating the data. Since the response is a JSON, we can aggregate the data as a map type with individual logic for the embedded types - nested map, numeric types, and arrays.

This should be common across all handlers and may change when pagination comes into picture.

This sounds good. Only concern is whether we can validate schema so that we never emit garbage out.

pkg/extensions/search/search_test.go

rchincha · 2025-04-15T15:53:45Z

examples/scale-out-cluster-local/config-cluster-member0.json

+{
+  "distSpecVersion": "1.1.0",
+  "storage": {
+    "rootDirectory": "./workspace/zot/data/mem1",


We have to figure out a scheme to append a member path.
Would like to have a single zot configuration that folks don't have to tweak.

I agree. The reason for this is mostly because I was starting 2 binaries on the same host for development (so I had to change the path and port). For an actual deployment, the config files would be identical.

Are you able to run a zb benchmark to show things scaling up?

We did see some results when we developed scale out for the dist-spec APIs, however, can zb benchmark GQL queries? I believe it was only for dist-spec APIs if I recall correctly.

vrajashkr · 2025-04-28T18:24:39Z

Latest analysis of GQL queries:
✅ - indicates a GQL Query for which local storage scale out support has been implemented in this PR without any issues.
🚨 - indicates a GQL Query for which local storage scale out support cannot be implemented at this time due to technical limitations.
🏗️ - indicates a GQL Query that has some TODOs
📝 - indicates a GQL query with some notes

CVEListForImage ✅

Proxy-once to target
uses "repository:tag" or "repository@digest"

CVEDiffListForImages 🚨

image and compared image names in "repository:tag" or "repository@digest" format.
challenging as the images may be on different servers.
requires metadata of the input images to be available locally on the server that computes the result. Not implementable with this approach.

ImageListForCVE ✅

fan-out to all members

ImageListWithCVEFixed ✅

use the "image" param which indicates the repository name to proxy once to the target server.

ImageListForDigest ✅

fan-out to all members

RepoListWithNewestImage 🏗️

fan-out to all members

ImageList ✅

use the repo param to proxy once to the target server.

ExpandedRepoInfo ✅

use the repo param to proxy once to the target server.

GlobalSearch ✅

fan-out to all members
future optimization: if the repo names are known, these don't need to be fanned-out.

DerivedImageList 🚨

uses "repository:tag" format.
however, expected output is all images using the given one so a fan-out is needed.
requires metadata of the input image to be available locally on every server. Not implementable with this approach.

BaseImageList 🚨

image uses "repository:tag" format.
as expectation is to get all images, a fan-out is needed.
requires metadata of the input image to be available locally on every server. Not implementable with this approach.

Image ✅

image uses "repository:tag" format
use repo value to proxy once to target

Referrers ✅

has a repo as an argument.
use repo value to proxy once to target

StarredRepos 📝

fan-out to all members
this is part of the userprefs system so has some additional considerations.

BookmarkedRepos 📝

fan-out to all members
this is part of the userprefs system so has some additional considerations.

codecov · 2025-04-29T15:26:26Z

Codecov Report

Attention: Patch coverage is 78.00000% with 77 lines in your changes missing coverage. Please review.

Project coverage is 90.48%. Comparing base (100dfec) to head (40086a3).

Files with missing lines	Patch %	Lines
...ns/search/gql_proxy/generic_fan_out_gql_handler.go	38.46%	20 Missing and 4 partials ⚠️
...search/gql_proxy/generic_proxy_once_gql_handler.go	76.31%	15 Missing and 3 partials ⚠️
pkg/extensions/search/gql_proxy/handler_utils.go	68.51%	14 Missing and 3 partials ⚠️
pkg/api/cluster_proxy.go	79.24%	9 Missing and 2 partials ⚠️
pkg/extensions/search/gql_proxy/gql_proxy.go	92.22%	6 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3074      +/-   ##
==========================================
- Coverage   90.63%   90.48%   -0.15%     
==========================================
  Files         182      187       +5     
  Lines       32909    33177     +268     
==========================================
+ Hits        29826    30020     +194     
- Misses       2319     2379      +60     
- Partials      764      778      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rchincha · 2025-04-29T19:47:33Z

pkg/extensions/search/gql_proxy/generic_fan_out_gql_handler.go

+
+	for _, targetMember := range config.Cluster.Members {
+		proxyResponse, err := proxy.ProxyHTTPRequest(request.Context(), request, targetMember, config)
+		if err != nil {


Do we want to return failure even if just one member fails?
I think if one member responds, we should return that and swallow errors maybe with some indicator somewhere. Maybe logs? HTTP status - 206 Partial Content could work also since this is our own API.

That sounds like a good idea. One thought that I had in mind is that instead of swallowing the error, perhaps we could append an error to the Errors list key in the GQL response and send it to the client so there is awareness of some error in the system.

The client can choose to ignore the error and use the valid data in the response, or ideally, show both the valid data as well as indicate that there were some errors in processing. With this approach status 206 could be the return status as you've suggested.

What do you think?

vrajashkr · 2025-05-17T16:50:22Z

Some observations:
DerivedImageList and BaseImageList both need the metadata of the input image available to perform their logic so can't be supported with local scale-out.

Currently trying out implementing CVEListForImage. Working on writing tests for it which is trickier than expected since trivy needs to be running to scan the vulnerable layers. Thinking about some alternative approaches to this test case (and for future CVE related GQL test cases).

vrajashkr · 2025-05-30T18:53:07Z

Currently trying out implementing CVEListForImage. Working on writing tests for it which is trickier than expected since trivy needs to be running to scan the vulnerable layers. Thinking about some alternative approaches to this test case (and for future CVE related GQL test cases).

Solved this by manually replacing the cveScanner with a mock instance based on other existing examples in the test source code.

vrajashkr · 2025-05-31T20:10:10Z

All the artifact related GQL queries are supported in the latest commit except for the operations that are explicitly marked as unsupported.

Things that need to be worked on:

Pagination
Standardized error handling
Handling for Partial responses

vrajashkr · 2025-05-31T20:23:25Z

Looks like an unrelated failure, I'll force push once again to retrigger the tests.

.panic: listen tcp 127.0.0.1:42000: bind: address already in use

goroutine 3750 [running]:
zotregistry.dev/zot/pkg/test/common.(*ControllerManager).RunServer(0xc0020cc170)
	zotregistry.dev/zot/pkg/test/common/utils.go:78 +0xfe
zotregistry.dev/zot/pkg/test/common.(*ControllerManager).StartServer.func1()
	zotregistry.dev/zot/pkg/test/common/utils.go:88 +0x45
created by zotregistry.dev/zot/pkg/test/common.(*ControllerManager).StartServer in goroutine 3611
	zotregistry.dev/zot/pkg/test/common/utils.go:87 +0x125
FAIL	zotregistry.dev/zot/pkg/extensions/sync	28.097s

vrajashkr · 2025-06-01T05:07:15Z

Looks like the re-push had some more intermittent failures:

stateless with minio and redis failed
Additionally, an error in the extensions tests related to sync:

--- FAIL: TestSignatures (16.25s)
    sync_test.go:4653: 41151
    sync_test.go:5168: 33735
FAIL

Will force push another commit.

rchincha · 2025-06-02T17:56:30Z

pkg/api/cluster_proxy.go

+	"zotregistry.dev/zot/pkg/proxy"
+)
+
+// ClusterProxy wraps an http.HandlerFunc which requires proxying between zot instances to ensure


This moved from pkg/api/proxy.go to here? Reason?

This content was moved because the original api/proxy.go was split to move the common functionality that the GQL Proxy required into a new package called 'proxy' to avoid cyclic dependency.

The new file name here was cluster_proxy.go arbitrarily.

Why is this change needed? ============================= Currently, only dist-spec APIs have support for scale proxy with cloud backed storage as well as local storage. In case a user would like to use scale out with only local storage, the metadata DB will be isolated per instance of zot and so would the repository data. Accordingly, GQL queries would need to also be proxied to get all the relevant information for zot clients. What has changed? ============================= - Wrapper crafted around the GQL server that checks for the GQL operation and performs the necessary proxying as required. - All GQL logic itself is still handled by the gqlgen GQL server. - New GQL proxy introduced to support the GQL proxying requirements. How does it work? ============================= Upon receiving a GQL request, the operation type is checked. Based on a static map, the GQL query is proxied out to other members in a fan out fashion or a direct proxy to target if the repository information is known. The results are received and aggregated dynamically based on datatypes. Supported GQL queries ============================= Currently, the following GQL queries are supported: GlobalSearch - fanout ImageList - proxy to target ExpandedRepoInfo - proxy to target CVEListForImage - proxy to target ImageListForCVE - fanout ImageListWithCVEFixed - proxy to target ImageListForDigest - fanout RepoListWithNewestImage - fanout Image - proxy to target Referrers - proxy to target Unsupported GQL queries ============================== CVEDiffListForImages - needs metadata for both images on the handling server DerivedImageList - needs metadata for argument image on the handling server BaseImageList - needs metadata for argument image on the handling server StarredRepos - part of userprefs and needs additional consideration BookmarkedRepos - part of userprefs and needs additional consideration What is yet to be done ============================== - Pagination is broken entirely because if a client asks for 2 entries, the client will get up to 2 entries from each cluster member in case the request is a fanout type. - Error handling needs to be standardized. - Support for Partial Response is required. Signed-off-by: Vishwas Rajashekar <[email protected]>

vrajashkr · 2025-06-21T19:47:58Z

Couple of updates in the last push:

Rebased to the latest main branch
Part of the TestRepoListWithNewestImageWithScaleOutProxyLocalStorage was previously commented out due to inconsistency in tag response. After debugging in a single zot setup, I learned that the Created timestamp is used for figuring out the newest tag and not the timestamp of a tag itself being pushed - it's data in the Config that matters. This has been fixed in the test code by adding a delay to the Created timestamp and the test is now passing.

vrajashkr · 2025-06-21T19:52:16Z

While Unit Tests are working, manual testing hit a snag:
When zot uses Auth for the UI (session and cookie are involved), the session is only available in the zot instance where the initial login was performed.
Since the load balancer sets a sticky cookie, the user will always be sent to the same instance, but internally, the GQL would be proxied to one or more instances where the session is not present and this fails Auth there.

The error seen is:

"error":"securecookie: the value is not valid","goroutine":729,"caller":"zotregistry.dev/zot/pkg/api/authn.go:858","time":"2025-06-19T17:35:15.281808846Z","message":"failed to decode existing session"

At this point, it would seem that there is a need for some manner to distribute the session data across the members as well, but this needs further discussion.

vrajashkr · 2025-06-21T19:56:32Z

Ecosystem client tools tests appear to have failed.

I see # {"level":"error","error":"listen tcp 0.0.0.0:39642: bind: address already in use","time":"2025-06-21T19:50:03.518264715Z","message":"failed to start controller, exiting"}

After that message, a ton of errors show up with failed connections to the server. Will re-trigger the commit another time.

rchincha · 2025-06-24T17:52:41Z

@vrajashkr restarted the failing test.

vrajashkr commented Apr 6, 2025

View reviewed changes

pkg/extensions/search/search_test.go Show resolved Hide resolved

vrajashkr force-pushed the feat/gql-scale-out-v2 branch from a5dbb43 to ac85801 Compare April 6, 2025 20:17

rchincha reviewed Apr 15, 2025

View reviewed changes

vrajashkr force-pushed the feat/gql-scale-out-v2 branch 2 times, most recently from 13f6084 to c991fe4 Compare April 29, 2025 15:07

rchincha reviewed Apr 29, 2025

View reviewed changes

vrajashkr force-pushed the feat/gql-scale-out-v2 branch from c991fe4 to 5696b3f Compare May 30, 2025 18:36

vrajashkr force-pushed the feat/gql-scale-out-v2 branch from 5696b3f to 9601786 Compare May 31, 2025 20:02

vrajashkr changed the title ~~feat/gql-proxy: new approach for GraphQL local scale out~~ feat(scale-out): support for Artifact GQL with local storage scale out May 31, 2025

vrajashkr force-pushed the feat/gql-scale-out-v2 branch from 9601786 to 7691c11 Compare May 31, 2025 20:25

vrajashkr force-pushed the feat/gql-scale-out-v2 branch from 7691c11 to 286cdcd Compare June 1, 2025 05:07

vrajashkr marked this pull request as ready for review June 1, 2025 05:07

vrajashkr requested review from shimish2 and andaaron as code owners June 1, 2025 05:07

vrajashkr force-pushed the feat/gql-scale-out-v2 branch from 286cdcd to 3461e7b Compare June 2, 2025 16:41

rchincha reviewed Jun 2, 2025

View reviewed changes

vrajashkr force-pushed the feat/gql-scale-out-v2 branch from 3461e7b to 40086a3 Compare June 21, 2025 19:44

feat(scale-out): support for Artifact GQL with local storage scale out #3074

Are you sure you want to change the base?

feat(scale-out): support for Artifact GQL with local storage scale out #3074

Uh oh!

Conversation

vrajashkr commented Apr 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrajashkr commented Apr 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrajashkr Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrajashkr commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrajashkr commented May 17, 2025

Uh oh!

vrajashkr commented May 30, 2025

Uh oh!

vrajashkr commented May 31, 2025

Uh oh!

vrajashkr commented May 31, 2025

Uh oh!

vrajashkr commented Jun 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrajashkr commented Jun 21, 2025

Uh oh!

vrajashkr commented Jun 21, 2025

Uh oh!

vrajashkr commented Jun 21, 2025

Uh oh!

rchincha commented Jun 24, 2025

Uh oh!

Uh oh!

vrajashkr commented Apr 6, 2025 •

edited

Loading

vrajashkr commented Apr 6, 2025 •

edited

Loading

vrajashkr Apr 17, 2025 •

edited

Loading

vrajashkr commented Apr 28, 2025 •

edited

Loading

codecov bot commented Apr 29, 2025 •

edited

Loading