feat: refactor cleanup cronjob to use pagination and reduce memory usage#411
feat: refactor cleanup cronjob to use pagination and reduce memory usage#411davidmogar wants to merge 1 commit into
Conversation
Review Summary by Qodo(Agentic_describe updated until commit f49ddc7)Refactor cleanup cronjob with pagination and memory optimization
WalkthroughsDescription• Replace AWK-based processing with kubectl pagination API for efficient batch handling • Process resources in batches of 100 using --chunk-size parameter • Stream JSON output through jq to minimize memory footprint • Increase memory limit from 256Mi to 512Mi for batch processing • Add deletion and error counters for improved observability Diagramflowchart LR
A["kubectl get with --chunk-size=100"] --> B["JSON output to temp file"]
B --> C["jq stream processing"]
C --> D["Parse name:namespace:completionTime"]
D --> E["Date conversion to epoch"]
E --> F["kubectl delete with counters"]
F --> G["Summary with DELETED_COUNT and ERROR_COUNT"]
File Changes1. components/internal-services/internal-staging/cronjob/cleanup-internal-requests-pipelineruns.yaml
|
Code Review by Qodo
1. Tempfiles on read-only FS
|
|
Where is your Assisted-by line? ;) |
| CONTINUE="" | ||
| while true; do | ||
| # Build kubectl command with limit and continue token | ||
| CMD="kubectl get ${CR_TYPE} -n ${CR_NAMESPACE} -l ${LABELS} --limit=${BATCH_SIZE}" |
There was a problem hiding this comment.
kubectl get does not have a --limit flag unfortunately, it has a --chunk-size which still returns all items, but in chunks, which I don't think is what we want here.
Perhaps we can use something like:
kubectl get --raw "/api/v1/{cr_type}?limit=${BATCH_SIZE}"
and parse the raw response, to fetch the continue token and do the whole pagination manually.
There was a problem hiding this comment.
I changed it to chunk-size.
There was a problem hiding this comment.
Using chunk-size would still keep all items in memory in the $output var, so we would hit the same issue here I think.
There was a problem hiding this comment.
so we would hit the same issue here
I guess that depends on what was eating the resources, right? the only improvement with --chunk-size is that kubectl itself should perform better, but the rest will be exactly the same.
There was a problem hiding this comment.
I am not sure how chunk size works, if it loads everything and just display in chunks, probably the issue will continue. If this does not solve, we could try adding the completionTIme as a customfiled selector to the CR` so we might be able to "pre-filter" the internalrequests: and not need the mechanics we are doing now:
Reference: https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/#custom-resources-fields
There was a problem hiding this comment.
The field selector supports ==, != so "maybe" we might be able to filter with the desired date (I did not test, though).
There was a problem hiding this comment.
It seem chunk-size will solve it:
The Mechanism
Instead of making one massive, stressful API call that requests all 100,000 items at once (which could time out or crash the API server), kubectl breaks the request into 100 smaller sequential batches.
kubectl asks the API server for the first 1,000 items.
The server sends them back along with a continue token.
kubectl prints those 1,000 items to your terminal, then immediately uses the token to ask for the next 1,000.
This loops automatically until all 100,000 items are printed.
The Bottom Line: --chunk-size is a tool for performance and reliability, not for truncating your terminal output.
There was a problem hiding this comment.
yeah, so this confirms what I said earlier:
I guess that depends on what was eating the resources, right? the only improvement with --chunk-size is that kubectl itself should perform better, but the rest will be exactly the same.
| CONTINUE="" | ||
| while true; do | ||
| # Build kubectl command with limit and continue token | ||
| CMD="kubectl get ${CR_TYPE} -n ${CR_NAMESPACE} -l ${LABELS} --limit=${BATCH_SIZE}" |
There was a problem hiding this comment.
--limit seems to be an hallucination
➜ kubectl version
Client Version: v1.35.5
Kustomize Version: v5.7.1
Server Version: v1.35.0
➜ kubectl get pods --limit=4
error: unknown flag: --limit
See 'kubectl get --help' for usage.
There was a problem hiding this comment.
Totally! Moved to chunk-size.
| BATCH_OUTPUT=$(mktemp -p /var/tmp) | ||
| if ! eval "$CMD" > "$BATCH_OUTPUT" 2>&1; then | ||
| echo "ERROR: failed to list ${CR_TYPE} resources" | ||
| rm -f "$BATCH_OUTPUT" |
There was a problem hiding this comment.
why bother removing the file? it's a container.
There was a problem hiding this comment.
The code changed but I'd say nothing wrong removing the file here. Happy to remove the cleanup lines if you prefer.
There was a problem hiding this comment.
I'd say it's just redundant - we just never do that. but if you prefer to keep it, fine with me.
|
|
||
| # Check for continuation token | ||
| CONTINUE=$(jq -r '.metadata.continue // empty' "$BATCH_OUTPUT") | ||
| rm -f "$BATCH_OUTPUT" |
There was a problem hiding this comment.
why bother removing the file? it's a container.
There was a problem hiding this comment.
The code changed but I'd say nothing wrong removing the file here. Happy to remove the cleanup lines if you prefer.
There was a problem hiding this comment.
I'd say it's just redundant - we just never do that. but if you prefer to keep it, fine with me.
| fi | ||
|
|
||
| # Process each item in the batch individually to minimize memory | ||
| jq -r '.items[] | select(.status.completionTime != null) | "\(.metadata.name):\(.metadata.namespace):\(.status.completionTime)"' "$BATCH_OUTPUT" | \ |
There was a problem hiding this comment.
jq | while runs the while loop. DELETED_COUNT and ERROR_COUNT will always be 0 in the final log, regardless of actual deletions. Since it runs in a subshell.
Guess something like:
while IFS=: read -r cr_name cr_namespace completion_time; do
done < <(jq -r <> "$BATCH_OUTPUT")
| echo "ERROR: namespace=${namespace} ${output}" | ||
| # Check if old enough to delete | ||
| if [ "$SECONDS_BACK" -gt "$completion_epoch" ]; then | ||
| if kubectl delete "${CR_TYPE}" "${cr_name}" -n "${cr_namespace}" --wait=false --timeout=30s 2>&1; then |
There was a problem hiding this comment.
The old code outputs the delete log. Now we don't show the output, we won't see the reason it failed.
There was a problem hiding this comment.
Now it's displayed again.
The PR can be garbage and I still need to work on it and test it, but the commit contains |
|
Code review by qodo was updated up to the latest commit f49ddc7 |
|
Sean kindly tested this PR (thanks Sean) and got a successful run: |
| KUBECTL_OUTPUT=$(mktemp) | ||
| KUBECTL_ERR=$(mktemp) |
There was a problem hiding this comment.
1. Tempfiles on read-only fs 🐞 Bug ☼ Reliability
The script now uses mktemp without targeting the writable /var/tmp mount while readOnlyRootFilesystem: true, so temp file creation (and the kubectl get redirect) can fail and break the cleanup job. This is a regression vs the internal-production manifest which explicitly places temp files under /var/tmp.
Agent Prompt
### Issue description
The cronjob creates temp files via `mktemp` with no directory, which defaults to `/tmp`. This CronJob runs with `readOnlyRootFilesystem: true` and only mounts an `emptyDir` at `/var/tmp`, so creating temp files in `/tmp` is likely to fail.
### Issue Context
The manifest already mounts `/var/tmp` specifically for scratch space, and the internal-production variant uses `mktemp -p /var/tmp`.
### Fix Focus Areas
- components/internal-services/internal-staging/cronjob/cleanup-internal-requests-pipelineruns.yaml[36-56]
- components/internal-services/internal-staging/cronjob/cleanup-internal-requests-pipelineruns.yaml[80-99]
### Suggested change
- Set `TMPDIR=/var/tmp` or update both tempfiles to `mktemp -p /var/tmp` (for `KUBECTL_OUTPUT` and `KUBECTL_ERR`).
- (Optional hardening) Add a `trap` to delete temp files on exit.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
|
lgtm :) |
|
lgtm |
I know, but I thought we were specifically required to use Assisted-by: |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: davidmogar, theflockers The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Replace AWK-based processing with kubectl pagination API to handle large resource sets more efficiently. Process resources in batches of 100 using continuation tokens, and stream through jq to minimize memory footprint. Increase memory limit from 256Mi to 512Mi to accommodate batch processing. Add deletion and error counters for better observability. Generated-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: David Moreno García <damoreno@redhat.com>
|
New changes are detected. LGTM label has been removed. |
Replace AWK-based processing with kubectl pagination API to handle large resource sets more efficiently. Process resources in batches of 100 using continuation tokens, and stream through jq to minimize memory footprint. Increase memory limit from 256Mi to 512Mi to accommodate batch processing. Add deletion and error counters for better observability.
Generated-By: Claude Sonnet 4.5 noreply@anthropic.com