Skip to content

backend: caching: implemented caching for single-cluster#3499

Closed
upsaurav12 wants to merge 42 commits intokubernetes-sigs:mainfrom
upsaurav12:caching-pagination-search
Closed

backend: caching: implemented caching for single-cluster#3499
upsaurav12 wants to merge 42 commits intokubernetes-sigs:mainfrom
upsaurav12:caching-pagination-search

Conversation

@upsaurav12
Copy link
Contributor

@upsaurav12 upsaurav12 commented Jun 20, 2025

Implementation

I have implemented caching for the single-cluster. For multi-cluster i am trying to check by creating user by changing use-context.

Here how i have implemented:

  1. I am capturing response during ProxyRequest and storing it in cache in string.
  2. Then I am changing the response from gzip to byte and then to string.
  3. If the user is making request for the first time it will first first selfsubjectrulesreviews and if it returns Failure then it is not authenticated. If doesn't then process to acquire from cache.
  4. A new key will be generated for new kind and namespace and clusters and it will be same if the user is making the same requests.
  5. If key is not present in cache then it will go for actual request to k8's and then store in cache.

However more things i have to add like cache should be synced with the latest k8's value but i am finding solution whether i should go for countinuously make requests to k8's after 5s or fixed time oruse watch events from k8's.

Test Results

=== RUN   TestInitialize
=== RUN   TestInitialize/initializes_responseCapture_with_defaults
--- PASS: TestInitialize/initializes_responseCapture_with_defaults (0.00s)
--- PASS: TestInitialize (0.00s)
=== RUN   TestExtractNamespace
=== RUN   TestExtractNamespace/valid_url_with_namespace
--- PASS: TestExtractNamespace/valid_url_with_namespace (0.00s)
=== RUN   TestExtractNamespace/empty_url
--- PASS: TestExtractNamespace/empty_url (0.00s)
=== RUN   TestExtractNamespace/invalid_url_format
--- PASS: TestExtractNamespace/invalid_url_format (0.00s)
--- PASS: TestExtractNamespace (0.00s)
=== RUN   TestGetResponseBody
=== RUN   TestGetResponseBody/valid_response
--- PASS: TestGetResponseBody/valid_response (0.00s)
=== RUN   TestGetResponseBody/empty_Response
--- PASS: TestGetResponseBody/empty_Response (0.00s)
=== RUN   TestGetResponseBody/empty_contentType
--- PASS: TestGetResponseBody/empty_contentType (0.00s)
--- PASS: TestGetResponseBody (0.00s)
=== RUN   TestGenerateKey
=== RUN   TestGenerateKey/url_was_valid_
--- PASS: TestGenerateKey/url_was_valid_ (0.00s)
=== RUN   TestGenerateKey/empty_cluster
--- PASS: TestGenerateKey/empty_cluster (0.00s)
--- PASS: TestGenerateKey (0.00s)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 20, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: upsaurav12
Once this PR has been reviewed and has the lgtm label, please assign joaquimrocha for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 20, 2025
@illume
Copy link
Contributor

illume commented Jun 23, 2025

Nice one.

ps. See commit guidelines, and run make backend-lint to see lint issues localy.

Copy link
Contributor

@illume illume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start. Thanks.

Would you mind separating the code out from headlamp.go? Code can be put into a separate module/package, and then headlamp.go can import it and use it. We are in the process of moving as much code out of headlamp.go as possible into package modules. It's all a bit too much code tangled up in there.

Could you please add tests? As well, please add documentation to your functions.

I left a comment about commit message guidelines and running backend-lint.

@upsaurav12
Copy link
Contributor Author

upsaurav12 commented Jun 25, 2025

@illume I have added extracted the cache implementation from headlamp.go. I am sorry i wasn't able to write tests because my system is running slow will write the code and push them

I am not sure that this implementation might work for multiple users however i am finding way to test it with services accounts but in development mode it is directly fetching from kubeconfig file and it loads clusters directly without asking for bearer token

@upsaurav12 upsaurav12 force-pushed the caching-pagination-search branch 4 times, most recently from 2452767 to 9aa03d1 Compare June 26, 2025 12:29
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 27, 2025
@illume
Copy link
Contributor

illume commented Jun 27, 2025

How about making an own package for pkg/k8cache? I think maybe it should not go inside pkg/cache, since it's topic is slightly different than what pkg/cache already does.

@upsaurav12
Copy link
Contributor Author

upsaurav12 commented Jun 27, 2025

How about making an own package for pkg/k8cache? I think maybe it should not go inside pkg/cache, since it's topic is slightly different than what pkg/cache already does.

I was thinking the same but eventually had to impelment in the same package because of cache initialization and ease of importing functions.

I will. modify it.

@upsaurav12
Copy link
Contributor Author

I have moved the code to pkg/k8cache as package.

@upsaurav12
Copy link
Contributor Author

upsaurav12 commented Jun 28, 2025

Hi I have just tested this cache implementation with different serviceaccounts as a users. for example service accounts have no-access to cluster resources, user having limited access and user having full access. From my end it is working fine like showing the data that the user is authorized.

Screen.Recording.-.Jun.28.2025-VEED.mp4

However still there are still some points that is need to be fixed/address.

  1. Initialization of Cache:- Currently i am initializing the cache as a global variable which i am not sure that it would be ideal approach for this project specifically. it is fine to initialize as it?
  2. Synchronization of Cache data :- We can have two options Watch API and Polling to K8;s at a fixed time interval. which approach should be the ideal for the project ?
  3. Use of token :- Would it be good approach to use token in the key generation because here i am already using SSAR request to authorize the user but in K8's dashboard PR they had used it.

Copy link
Contributor

@illume illume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few notes.

@illume illume requested review from yolossn and removed request for joaquimrocha and skoeva July 3, 2025 08:13
@upsaurav12
Copy link
Contributor Author

I left a few notes.

I am sorry i could get these replies. i don't know why but they weren't visible to me?
thanks for suggestion i have made some changes for your requested changes.

@illume
Copy link
Contributor

illume commented Jul 4, 2025

I am sorry i could get these replies. i don't know why but they weren't visible to me?

No worries. I think I wrote them, and then few days came back and noticed they were still pending and submitted the review. I'm not sure why they kept the old date on them, I think it's a github bug. That's why they weren't visible to you - I didn't submit the review yet.

Co-authored-by: S Santhosh Nagaraj <nssvlr@gmail.com>
Signed-off-by: Saurav Upadhyay <116784047+upsaurav12@users.noreply.github.com>
Comment on lines +104 to +108
ctx, span := telemetry.CreateSpan(ctx, r, "cluster-api", "handleClusterAPI",
attribute.String("cluster", mux.Vars(r)["clusterName"]),
)

defer span.End()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this span to be instrumented. @illume wdyt? this function just gets the context key and context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have added because i was getting linting error of length of CacheMiddleware function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU telemetry doesn't have anything to do with linting issue. Can you share the exact error that you faced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i am talking about funlen linting issue, it was saying that the function length is >60 lines of code, that's why i made a saparate function that will return contextKey , KContext, span and ctx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate function is okay, not every new function needs to have a telemetry span.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay understood, should i need to remove telemetry and span for some new functions such as ReturnAuthErrRespone , ReturnAfterAuthError etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now, lets remove this.

// RequestK8ClusterAPIAndStore ensures if the key was not found inside the cache then this will make actual call to k8's
// and this will capture the response body and convert the captured response to string.
// After converting it will store the response with the key and TTL of 10*min.
func RequestK8ClusterAPIAndStore(k8scache cache.Cache[string],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't make a request to K8ClusterAPI, please improve the function name here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@illume said that previous one was good, so can we change the name to previous one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would RequestToK8AndStore be good for you @yolossn

Comment on lines 175 to 178
err = k8cache.CheckAndPurge(w, r, k8scache, next, rcw, isAllowed)
if err != nil {
c.handleError(w, ctx, span, err, "error while purging data", http.StatusInternalServerError)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here, It is not clear why the Check is done here.

Comment on lines 486 to 500
// CheckAndPurge check whether the r is POST, if the request was POST then it will let to PURGING DATA.
func CheckAndPurge(w http.ResponseWriter, r *http.Request, k8scache cache.Cache[string],
next http.Handler, rcw *responseCapture, isAllowed bool,
) error {
if r.Method != http.MethodPost {
return nil
}

err := PurgeDataForPostRequest(w, r, k8scache, next, rcw, isAllowed)
if err != nil {
return err
}

return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain the context behind purging post request alone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was adding for more methods but from the last meet you asked for only POST

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is a misunderstanding here, all kind of update request that can make the cache stale should be handled here. ie put, post, delete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now i am pushing the changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also synchronously looping through every request in cache and deleting the request at the end of each update request will make it slower. Also the method type check in the cacheMiddleware is removed. Almost all of the update request (put,post,delete) will be unique and we need not have to check them in the cache.

Copy link
Contributor Author

@upsaurav12 upsaurav12 Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yolossn @illume
would this work ?

// PurgeDataForPostRequest remove the stale data which is related to POST request.
// It uses UnmarshalCacheData to filter the response for purging the resource data.
// It also uses GetAll method from cache.Cache to retrieve all the data that was stored.
func PurgeDataForPostRequest(w http.ResponseWriter, r *http.Request, k8scache cache.Cache[string],
	next http.Handler, rcw *responseCapture, isAllowed bool,
) error {
	lasts, _ := GetKindAndVerb(r)

	last := Singular(lasts)

	go func() error {

		// start := time.Now()
		val, err := k8scache.GetAll(context.Background(), nil)
		if err != nil {
			return err
		}

		for key, v := range val {
			var cachedData CachedResponseData

			_, err := UnmarshalCacheData(v, cachedData)
			if err != nil {
				return err
			}

			err = ProcessCachedItem(key, v, last, k8scache, next, w, r, rcw, isAllowed)
			if err != nil {
				return err
			}
		}

		// elapsed := time.Since(start)
		// fmt.Println("elasped : ", elapsed)

		return nil
	}()

	return nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have made some changes in the code

@upsaurav12 upsaurav12 marked this pull request as draft August 4, 2025 05:28
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 4, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

upsaurav12 and others added 4 commits August 10, 2025 12:20
Co-authored-by: S Santhosh Nagaraj <nssvlr@gmail.com>
Signed-off-by: Saurav Upadhyay <116784047+upsaurav12@users.noreply.github.com>
Co-authored-by: S Santhosh Nagaraj <nssvlr@gmail.com>
Signed-off-by: Saurav Upadhyay <116784047+upsaurav12@users.noreply.github.com>
Copy link
Contributor

@skoeva skoeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an updated PR for this, will close it for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Development

Successfully merging this pull request may close these issues.

5 participants