Skip to content

feat/gql-proxy: new approach for GraphQL local scale out #3074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

vrajashkr
Copy link
Contributor

What type of PR is this?
feature

Which issue does this PR fix:
Towards #2434

What does this PR do / Why do we need it:
Previously, only dist-spec APIs were supported for scale-out as in a shared storage environment, the metadata was shared and any instance could correctly respond to the GQL queries as all the data is available.

In a local scale-out cluster deployment, the metadata store, in addition to the file storage is isolated to each member in the cluster. Due to this, there is a need to proxy the GQL queries as well for UI and client requests to work as expected.

This change introduces a new GQL proxy + a generic fan-out handler for GQL requests.

Testing done on this change:
Just manual testing with local setup for now. Real testing TODO.

Will this break upgrades or downgrades?
No

Does this PR introduce any user-facing change?:
TODO


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@vrajashkr
Copy link
Contributor Author

vrajashkr commented Apr 6, 2025

A previous attempt at this was made in #2588

How is this different?

  • Instead of a dedicated handler for each operation, in this PR, an approach is explored where there are generic and specific handlers. (Refer to [Feat]: Scale-out cluster support for independent per-instance storage deployments #2434 (comment))
  • Generic handlers cater to common use cases such as fan-out proxy for GQL or single target Proxying
  • Specific handlers cater to specific use cases which do not align with a generic handler
  • GQL schema-merging was done manually in the previous PR, however, in this PR, a generic approach has been taken to dynamically merge the responses by treating them as maps instead of dedicated structs. This allows the code to be reusable across all types of handlers and reduces the amount of code to be maintained.

Note: Pagination is still pending
Note: The code itself is just in a Proof of concept state. Quite a bit of clean up is needed.

log log.Logger,
gqlSchema *ast.Schema,
) func(handler http.Handler) http.Handler {
proxyFunctionalityMap := map[string]GqlScaleOutHandlerFunc{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rchincha this map statically stores which GQL operation needs which kind of handler.

There will be a couple of generic handlers such as fanout and some specific handlers if any of the operations need custom behavior.

What do you think about this approach?

It's better than last time since we don't need to maintain a separate handler for each operation type, but I'm open to more ideas on making this better.

}
}

func deepMergeMaps(a, b map[string]any) map[string]any {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rchincha this is the new approach for aggregating the data. Since the response is a JSON, we can aggregate the data as a map type with individual logic for the embedded types - nested map, numeric types, and arrays.

This should be common across all handlers and may change when pagination comes into picture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good. Only concern is whether we can validate schema so that we never emit garbage out.

@vrajashkr vrajashkr force-pushed the feat/gql-scale-out-v2 branch from a5dbb43 to ac85801 Compare April 6, 2025 20:17
{
"distSpecVersion": "1.1.0",
"storage": {
"rootDirectory": "./workspace/zot/data/mem1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to figure out a scheme to append a member path.
Would like to have a single zot configuration that folks don't have to tweak.

Copy link
Contributor Author

@vrajashkr vrajashkr Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The reason for this is mostly because I was starting 2 binaries on the same host for development (so I had to change the path and port). For an actual deployment, the config files would be identical.

@vrajashkr
Copy link
Contributor Author

Latest analysis of GQL queries:

CVEListForImage

  • Proxy-once to target
  • uses "repository:tag" or "repository@digest"

CVEDiffListForImages

  • image and compared image names in "repository:tag" or "repository@digest" format.
  • challenging as the images may be on different servers.

ImageListForCVE

  • fan-out to all members

ImageListWithCVEFixed

  • fan-out to all members

ImageListForDigest

  • fan-out to all members

RepoListWithNewestImage

  • fan-out to all members

ImageList

  • use the repo param to proxy once to the target server.

ExpandedRepoInfo

  • use the repo param to proxy once to the target server.

GlobalSearch

  • fan-out to all members
  • future optimization: if the repo names are known, these don't need to be fanned-out.

DerivedImageList

  • uses "repository:tag" format.
  • however, expected output is all images using the given one so a fan-out is needed.

BaseImageList

  • image uses "repository:tag" format.
  • as expectation is to get all images, a fan-out is needed.

Image

  • image uses "repository:tag" format
  • use repo value to proxy once to target

Referrers

  • has a repo as an argument.
  • use repo value to proxy once to target

StarredRepos

  • fan-out to all members

BookmarkedRepos

  • fan-out to all members

@vrajashkr vrajashkr force-pushed the feat/gql-scale-out-v2 branch from ac85801 to 13f6084 Compare April 28, 2025 19:58
@vrajashkr vrajashkr force-pushed the feat/gql-scale-out-v2 branch from 13f6084 to c991fe4 Compare April 29, 2025 15:07
Copy link

codecov bot commented Apr 29, 2025

Codecov Report

Attention: Patch coverage is 69.03915% with 87 lines in your changes missing coverage. Please review.

Project coverage is 90.64%. Comparing base (293f424) to head (c991fe4).

Files with missing lines Patch % Lines
pkg/extensions/search/gql_proxy/gql_proxy.go 64.47% 23 Missing and 4 partials ⚠️
...search/gql_proxy/generic_proxy_once_gql_handler.go 40.00% 17 Missing and 4 partials ⚠️
...ns/search/gql_proxy/generic_fan_out_gql_handler.go 50.00% 12 Missing and 3 partials ⚠️
pkg/extensions/search/gql_proxy/handler_utils.go 73.46% 11 Missing and 2 partials ⚠️
pkg/api/cluster_proxy.go 79.24% 9 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3074      +/-   ##
==========================================
- Coverage   90.79%   90.64%   -0.15%     
==========================================
  Files         172      177       +5     
  Lines       32385    32584     +199     
==========================================
+ Hits        29404    29536     +132     
- Misses       2242     2298      +56     
- Partials      739      750      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


for _, targetMember := range config.Cluster.Members {
proxyResponse, err := proxy.ProxyHTTPRequest(request.Context(), request, targetMember, config)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to return failure even if just one member fails?
I think if one member responds, we should return that and swallow errors maybe with some indicator somewhere. Maybe logs? HTTP status - 206 Partial Content could work also since this is our own API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea. One thought that I had in mind is that instead of swallowing the error, perhaps we could append an error to the Errors list key in the GQL response and send it to the client so there is awareness of some error in the system.

The client can choose to ignore the error and use the valid data in the response, or ideally, show both the valid data as well as indicate that there were some errors in processing. With this approach status 206 could be the return status as you've suggested.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants