fix: batch-limit stale object cleanup + bump litellm-enterprise to 0.1.37 by ishaan-berri · Pull Request #25264 · BerriAI/litellm

ishaan-berri · 2026-04-07T04:00:25Z

Relevant issues

Pre-Submission checklist

I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem
I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

CI (LiteLLM team)

Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:

Type

🐛 Bug Fix
🚄 Infrastructure

Changes

Adds STALE_OBJECT_CLEANUP_BATCH_SIZE constant (default 1000) to litellm/constants.py
Rewrites the stale managed object cleanup in check_responses_cost.py to use a single bounded SQL query (DELETE ... WHERE id IN (SELECT id ... LIMIT batch_size)) instead of fetching all rows then deleting one-by-one — prevents unbounded memory usage and N+1 deletes on large tables
Bumps litellm-enterprise to 0.1.37 in pyproject.toml and requirements.txt

Configurable batch limit (default 1000) for stale managed object cleanup, preventing unbounded UPDATE queries from hitting 300K+ rows at once.

Two fixes to _cleanup_stale_managed_objects: 1. Replace unbounded update_many with a single execute_raw using a subquery LIMIT, capping each poll cycle to STALE_OBJECT_CLEANUP_BATCH_SIZE rows. Zero rows loaded into Python memory — everything stays in Postgres. Uses the same PostgreSQL raw-SQL pattern as spend_log_cleanup.py (the proxy requires PostgreSQL per schema.prisma). 2. Extract _expire_stale_rows as a separate method for testability. Keeps the file_purpose='response' filter to avoid incorrectly expiring long-running batch or fine-tune jobs that legitimately exceed the staleness cutoff.

vercel · 2026-04-07T04:00:32Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
litellm	Ready	Preview, Comment	Apr 7, 2026 4:02am

CLAassistant · 2026-04-07T04:00:32Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

This reverts commit 5a76335.

codspeed-hq · 2026-04-07T04:02:41Z

Merging this PR will not alter performance

✅ 16 untouched benchmarks

_{Comparing fix/stale-object-cleanup-batch-limit (caa4b96) with main (7a9a9f0)}

codecov · 2026-04-07T04:04:06Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

greptile-apps · 2026-04-07T04:06:31Z

Greptile Summary

This PR fixes an unbounded stale-object cleanup query that could lock or overwhelm the database when a large backlog of LiteLLM_ManagedObjectTable rows exists. It replaces the previous update_many (no row limit) with a bounded cleanup capped at STALE_OBJECT_CLEANUP_BATCH_SIZE rows per invocation, adds a configurable constant (default 1,000), and bumps litellm-enterprise to 0.1.37.

The performance motivation is valid: the old update_many had no LIMIT and could touch hundreds of thousands of rows in a single query
The new _expire_stale_rows() method is correctly isolated for testability, and STALE_OBJECT_CLEANUP_BATCH_SIZE is properly bounded with max(1, ...) and configurable via env var
The implementation uses execute_raw with hand-written SQL, which regresses from the Prisma model-methods rule that this file previously followed — a conforming two-query approach (find_many(..., take=batch_size, select={"id": True}) + update_many) is viable and avoids raw SQL
No unit tests were added for the new method despite this being a stated hard requirement in CLAUDE.md

Confidence Score: 4/5

Safe to merge with minor concerns — SQL logic is correct and the performance fix is valid, but raw SQL regresses coding standards and no tests were added

All findings are P2, but execute_raw directly regresses a rule this same file previously followed correctly, and the missing tests violate a stated hard requirement in CLAUDE.md — warranting a 4 rather than 5

enterprise/litellm_enterprise/proxy/common_utils/check_responses_cost.py

Important Files Changed

Filename	Overview
enterprise/litellm_enterprise/proxy/common_utils/check_responses_cost.py	Replaces `update_many` with bounded `execute_raw` SQL to add LIMIT support; regresses no-raw-SQL rule and lacks unit tests
litellm/constants.py	Adds configurable `STALE_OBJECT_CLEANUP_BATCH_SIZE` constant with env-var override and lower bound of 1
pyproject.toml	Bumps litellm-enterprise optional dependency from 0.1.36 to 0.1.37
requirements.txt	Bumps litellm-enterprise pin from 0.1.36 to 0.1.37

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[check_responses_cost] --> B[_cleanup_stale_managed_objects]
    B --> C[cutoff = now - STALENESS_CUTOFF_DAYS]
    C --> D[_expire_stale_rows\ncutoff, BATCH_SIZE]
    D --> E[execute_raw\nUPDATE ... WHERE id IN\nSELECT id ... LIMIT N]
    E --> F{rows updated > 0?}
    F -- yes --> G[Log warning with count]
    F -- no --> H[Return]
    G --> I
    H --> I[find_many\nLIMIT MAX_OBJECTS_PER_POLL_CYCLE]
    I --> J[For each job:\naget_responses]
    J --> K{Terminal status?}
    K -- completed/failed/cancelled --> L[Append to completed_jobs]
    K -- still active --> M[Skip]
    L --> N[update_many\ncompleted_jobs as 'completed']

Comments Outside Diff (1)

enterprise/litellm_enterprise/proxy/common_utils/check_responses_cost.py, line 36-83 (link)

Missing unit tests for the new cleanup logic

CLAUDE.md states "Adding at least 1 test is a hard requirement," and the PR's pre-submission checklist has this item unchecked. No tests for _expire_stale_rows or the modified _cleanup_stale_managed_objects appear in tests/test_litellm/.

The _expire_stale_rows isolation boundary was designed exactly for mocking — a minimal test suite should cover:
1. _expire_stale_rows is called with the correct cutoff and STALE_OBJECT_CLEANUP_BATCH_SIZE.
2. When the return value is > 0, the warning is logged.
3. _cleanup_stale_managed_objects does not raise when _expire_stale_rows returns 0.
Rule Used: What: Ensure that any PR claiming to fix an issue ... (source)

_{Reviews (1): Last reviewed commit: "Revert "bump litellm-enterprise to 0.1.3..." | Re-trigger Greptile}

greptile-apps · 2026-04-07T04:06:35Z

enterprise/litellm_enterprise/proxy/common_utils/check_responses_cost.py

+        return await self.prisma_client.db.execute_raw(
+            """
+            UPDATE "LiteLLM_ManagedObjectTable"
+            SET "status" = 'stale_expired'
+            WHERE "id" IN (
+                SELECT "id" FROM "LiteLLM_ManagedObjectTable"
+                WHERE "file_purpose" = 'response'
+                AND "status" NOT IN ('completed', 'complete', 'failed', 'expired', 'cancelled', 'stale_expired')
+                AND "created_at" < $1::timestamptz
+                ORDER BY "created_at" ASC
+                LIMIT $2
+            )
+            """,
+            cutoff,
+            batch_size,
+        )


execute_raw regresses the no-raw-SQL coding standard

CLAUDE.md is explicit: "Do not write raw SQL for proxy DB operations. Use Prisma model methods instead of execute_raw / query_raw." The previous code in this same file correctly used Prisma's update_many model method — this PR regresses a file that was already compliant.

The PR cites spend_log_cleanup.py as precedent, but that file predates the rule and is not a justification for new code to follow the same pattern.

Conforming two-query approach — fetch a bounded set of IDs with find_many, then update_many on that set:

stale_rows = await self.prisma_client.db.litellm_managedobjecttable.find_many( where={ "file_purpose": "response", "status": {"not_in": ["completed", "complete", "failed", "expired", "cancelled", "stale_expired"]}, "created_at": {"lt": cutoff}, }, take=batch_size, select={"id": True}, order={"created_at": "asc"}, ) stale_ids = [row.id for row in stale_rows] if not stale_ids: return 0 result = await self.prisma_client.db.litellm_managedobjecttable.update_many( where={"id": {"in": stale_ids}}, data={"status": "stale_expired"}, ) return result

This preserves the batch-size bound, avoids raw SQL, and is mockable with simple Prisma stubs.

Context Used: CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

ishaan-berri added 3 commits April 6, 2026 09:25

Add STALE_OBJECT_CLEANUP_BATCH_SIZE constant

0aeeb01

Configurable batch limit (default 1000) for stale managed object cleanup, preventing unbounded UPDATE queries from hitting 300K+ rows at once.

bump litellm-enterprise to 0.1.37

5a76335

Revert "bump litellm-enterprise to 0.1.37"

caa4b96

This reverts commit 5a76335.

vercel bot deployed to Preview April 7, 2026 04:02 View deployment

greptile-apps bot reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: batch-limit stale object cleanup + bump litellm-enterprise to 0.1.37#25264

fix: batch-limit stale object cleanup + bump litellm-enterprise to 0.1.37#25264
ishaan-berri wants to merge 4 commits intomainfrom
fix/stale-object-cleanup-batch-limit

ishaan-berri commented Apr 7, 2026

Uh oh!

vercel bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Apr 7, 2026

Uh oh!

codspeed-hq bot commented Apr 7, 2026

Uh oh!

codecov bot commented Apr 7, 2026

Uh oh!

greptile-apps bot commented Apr 7, 2026 •

edited

Loading

Important Files Changed

Comments Outside Diff (1)

Uh oh!

greptile-apps bot Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ishaan-berri commented Apr 7, 2026

Relevant issues

Pre-Submission checklist

CI (LiteLLM team)

Type

Changes

Uh oh!

vercel bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Apr 7, 2026

Uh oh!

codspeed-hq bot commented Apr 7, 2026

Merging this PR will not alter performance

Uh oh!

codecov bot commented Apr 7, 2026

Codecov Report

Uh oh!

greptile-apps bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Apr 7, 2026 •

edited

Loading

greptile-apps bot commented Apr 7, 2026 •

edited

Loading