-
Notifications
You must be signed in to change notification settings - Fork 142
feat(scale-out): support for Artifact GQL with local storage scale out #3074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
A previous attempt at this was made in #2588 How is this different?
Note: Pagination is still pending |
log log.Logger, | ||
gqlSchema *ast.Schema, | ||
) func(handler http.Handler) http.Handler { | ||
proxyFunctionalityMap := map[string]GqlScaleOutHandlerFunc{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rchincha this map statically stores which GQL operation needs which kind of handler.
There will be a couple of generic handlers such as fanout and some specific handlers if any of the operations need custom behavior.
What do you think about this approach?
It's better than last time since we don't need to maintain a separate handler for each operation type, but I'm open to more ideas on making this better.
} | ||
} | ||
|
||
func deepMergeMaps(a, b map[string]any) map[string]any { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rchincha this is the new approach for aggregating the data. Since the response is a JSON, we can aggregate the data as a map type with individual logic for the embedded types - nested map, numeric types, and arrays.
This should be common across all handlers and may change when pagination comes into picture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds good. Only concern is whether we can validate schema so that we never emit garbage out.
a5dbb43
to
ac85801
Compare
{ | ||
"distSpecVersion": "1.1.0", | ||
"storage": { | ||
"rootDirectory": "./workspace/zot/data/mem1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to figure out a scheme to append a member path.
Would like to have a single zot configuration that folks don't have to tweak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The reason for this is mostly because I was starting 2 binaries on the same host for development (so I had to change the path and port). For an actual deployment, the config files would be identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you able to run a zb
benchmark to show things scaling up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did see some results when we developed scale out for the dist-spec APIs, however, can zb
benchmark GQL queries? I believe it was only for dist-spec APIs if I recall correctly.
Latest analysis of GQL queries: CVEListForImage ✅
CVEDiffListForImages 🚨
ImageListForCVE ✅
ImageListWithCVEFixed ✅
ImageListForDigest ✅
RepoListWithNewestImage 🏗️
ImageList ✅
ExpandedRepoInfo ✅
GlobalSearch ✅
DerivedImageList 🚨
BaseImageList 🚨
Image ✅
Referrers ✅
StarredRepos 📝
BookmarkedRepos 📝
|
13f6084
to
c991fe4
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3074 +/- ##
==========================================
- Coverage 90.63% 90.48% -0.15%
==========================================
Files 182 187 +5
Lines 32909 33177 +268
==========================================
+ Hits 29826 30020 +194
- Misses 2319 2379 +60
- Partials 764 778 +14 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
||
for _, targetMember := range config.Cluster.Members { | ||
proxyResponse, err := proxy.ProxyHTTPRequest(request.Context(), request, targetMember, config) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to return failure even if just one member fails?
I think if one member responds, we should return that and swallow errors maybe with some indicator somewhere. Maybe logs? HTTP status - 206 Partial Content could work also since this is our own API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good idea. One thought that I had in mind is that instead of swallowing the error, perhaps we could append an error to the Errors list key in the GQL response and send it to the client so there is awareness of some error in the system.
The client can choose to ignore the error and use the valid data in the response, or ideally, show both the valid data as well as indicate that there were some errors in processing. With this approach status 206 could be the return status as you've suggested.
What do you think?
Some observations: Currently trying out implementing |
c991fe4
to
5696b3f
Compare
Solved this by manually replacing the cveScanner with a mock instance based on other existing examples in the test source code. |
5696b3f
to
9601786
Compare
All the artifact related GQL queries are supported in the latest commit except for the operations that are explicitly marked as unsupported. Things that need to be worked on:
|
Looks like an unrelated failure, I'll force push once again to retrigger the tests.
|
9601786
to
7691c11
Compare
Looks like the re-push had some more intermittent failures:
Will force push another commit. |
7691c11
to
286cdcd
Compare
286cdcd
to
3461e7b
Compare
"zotregistry.dev/zot/pkg/proxy" | ||
) | ||
|
||
// ClusterProxy wraps an http.HandlerFunc which requires proxying between zot instances to ensure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This moved from pkg/api/proxy.go to here? Reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This content was moved because the original api/proxy.go was split to move the common functionality that the GQL Proxy required into a new package called 'proxy' to avoid cyclic dependency.
The new file name here was cluster_proxy.go
arbitrarily.
Why is this change needed? ============================= Currently, only dist-spec APIs have support for scale proxy with cloud backed storage as well as local storage. In case a user would like to use scale out with only local storage, the metadata DB will be isolated per instance of zot and so would the repository data. Accordingly, GQL queries would need to also be proxied to get all the relevant information for zot clients. What has changed? ============================= - Wrapper crafted around the GQL server that checks for the GQL operation and performs the necessary proxying as required. - All GQL logic itself is still handled by the gqlgen GQL server. - New GQL proxy introduced to support the GQL proxying requirements. How does it work? ============================= Upon receiving a GQL request, the operation type is checked. Based on a static map, the GQL query is proxied out to other members in a fan out fashion or a direct proxy to target if the repository information is known. The results are received and aggregated dynamically based on datatypes. Supported GQL queries ============================= Currently, the following GQL queries are supported: GlobalSearch - fanout ImageList - proxy to target ExpandedRepoInfo - proxy to target CVEListForImage - proxy to target ImageListForCVE - fanout ImageListWithCVEFixed - proxy to target ImageListForDigest - fanout RepoListWithNewestImage - fanout Image - proxy to target Referrers - proxy to target Unsupported GQL queries ============================== CVEDiffListForImages - needs metadata for both images on the handling server DerivedImageList - needs metadata for argument image on the handling server BaseImageList - needs metadata for argument image on the handling server StarredRepos - part of userprefs and needs additional consideration BookmarkedRepos - part of userprefs and needs additional consideration What is yet to be done ============================== - Pagination is broken entirely because if a client asks for 2 entries, the client will get up to 2 entries from each cluster member in case the request is a fanout type. - Error handling needs to be standardized. - Support for Partial Response is required. Signed-off-by: Vishwas Rajashekar <[email protected]>
3461e7b
to
40086a3
Compare
Couple of updates in the last push:
|
While Unit Tests are working, manual testing hit a snag: The error seen is:
At this point, it would seem that there is a need for some manner to distribute the session data across the members as well, but this needs further discussion. |
Ecosystem client tools tests appear to have failed. I see After that message, a ton of errors show up with failed connections to the server. Will re-trigger the commit another time. |
@vrajashkr restarted the failing test. |
What type of PR is this?
feature
Which issue does this PR fix:
Towards #2434
What does this PR do / Why do we need it:
Previously, only dist-spec APIs were supported for scale-out as in a shared storage environment, the metadata was shared and any instance could correctly respond to the GQL queries as all the data is available.
In a local scale-out cluster deployment, the metadata store, in addition to the file storage is isolated to each member in the cluster. Due to this, there is a need to proxy the GQL queries as well for UI and client requests to work as expected.
This change introduces a new GQL proxy + a generic fan-out handler for GQL requests.
Testing done on this change:
Unit testing for the supported GQL operations as part of the PR.
TODO manual testing.
Will this break upgrades or downgrades?
No
Does this PR introduce any user-facing change?:
Yes
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.