[Fix] Prevent atexit flush hangs and guard proxy_server_request header lookup by yuneng-berri · Pull Request #26470 · BerriAI/litellm

yuneng-berri · 2026-04-25T01:43:38Z

Relevant issues

Summary

Reduce flakiness in local_testing_part1/part2. Workers were intermittently dying with [gw3] node down: Not properly terminated while running tests/local_testing/test_amazing_vertex_completion.py, causing the whole 8-minute job to fail and need re-running.

Failure path (before fix)

Worker crash: real vertex_ai/gemini-2.5-flash calls in the test suite occasionally hung past CI's --timeout=300. pytest-timeout in thread mode then calls os._exit(1) (see pytest_timeout.py:542), bypassing every atexit handler and removing the xdist worker mid-run. xdist had no restart budget, so the job failed.
AttributeError on every SDK call: get_proxy_server_request_headers did litellm_params.get("proxy_server_request", {}).get("headers", {}). The chained .get only falls back to {} when the key is missing — every SDK call sets proxy_server_request to literal None, so the second .get raised AttributeError: 'NoneType' object has no attribute 'get'. The exception was caught and silently logged on every non-proxy completion.
Atexit flush could block indefinitely: LoggingWorker._flush_on_exit ran each queued coroutine with loop.run_until_complete(task["coroutine"]) and no per-coroutine timeout. The outer 5s budget was only checked between iterations, so one hung coroutine could block the whole atexit handler.

Fix

Added --max-worker-restart=5 --retries 2 --retry-delay 5 to local_testing_part1 and local_testing_part2 in .circleci/config.yml. xdist now replaces crashed workers instead of failing the job, and the affected tests get re-run automatically. Matches the resilience config the llm_translation job already uses. Preserves all real integration test coverage.
get_proxy_server_request_headers now treats None and non-dict values as empty and always returns a dict.
_flush_on_exit wraps each task in asyncio.wait_for(coro, timeout=2.0) so no single coroutine can stall the flush past the outer budget.

Testing

tests/test_litellm/litellm_core_utils/test_logging_worker.py and tests/test_litellm/enterprise/enterprise_callbacks/test_callback_controls.py — 30 passed.
Local repro of get_proxy_server_request_headers across 6 input shapes (None, missing key, explicit None, headers None, non-dict, real headers) — all return sensible dicts, no exceptions.
Local repro of the flush hang: a queue containing a sleep(60) coroutine plus two fast ones — flush returned in 2.0s instead of blocking.
Black formatting clean on all touched files.

Type

🐛 Bug Fix
🚄 Infrastructure

Screenshots

[Infra] Promote internal staging to main

* feat(openai): day-0 support for GPT-5.5 and GPT-5.5 Pro Add pricing + capability entries for the new GPT-5.5 family launched by OpenAI on 2026-04-24: - gpt-5.5 / gpt-5.5-2026-04-23 (chat): $5/$30/$0.50 per 1M input/output/cached input - gpt-5.5-pro / gpt-5.5-pro-2026-04-23 (responses-only): $60/$360/$6 per 1M input/output/cached input Other fees (long-context >272k, flex, batches, priority, cache discounts) follow the same ratios as GPT-5.4, with context window retained at 1.05M input / 128K output. No transformation / classifier code changes are required: OpenAIGPT5Config.is_model_gpt_5_4_plus_model() already matches 5.5+ via numeric version parsing, and model registration is driven from the JSON. The existing responses-API bridge for tools + reasoning_effort (litellm/main.py:970) already covers gpt-5.5-pro. Tests: - GPT5_MODELS regression list now covers gpt-5.5-pro and dated variants - New test_generic_cost_per_token_gpt55_pro cost-calc test - Updated test_generic_cost_per_token_gpt55 for long-context fields * fix(openai): mirror reasoning_effort flags onto gpt-5.5 dated variants gpt-5.5-2026-04-23 and gpt-5.5-pro-2026-04-23 were missing the supports_none_reasoning_effort, supports_xhigh_reasoning_effort, and supports_minimal_reasoning_effort flags that their non-dated counterparts define. Reasoning-effort routing in OpenAIGPT5Config is fully capability-driven from these JSON flags — since an absent flag is treated as False for opt-in levels (xhigh), users pinning to a dated snapshot would silently lose xhigh support and diverge from the base alias on logprobs + flexible temperature handling. Copy the flags onto both dated variants so every dated snapshot inherits the base model's reasoning-effort capability profile. Adds a parametrized regression test that asserts supports_{none,minimal,xhigh}_reasoning_effort parity between each dated variant and its non-dated counterpart, preventing future drift when new snapshots are added.

- get_proxy_server_request_headers raised AttributeError when litellm_params["proxy_server_request"] was set to None (the default for SDK callers). Guard the helper so it returns {} for None or non-dict values. - LoggingWorker._flush_on_exit now wraps each queued coroutine in asyncio.wait_for(..., timeout=2.0) so a single hung coroutine cannot block the atexit handler past the outer 5s budget. - Drop _turn_on_debug() from test_vertex_schema_test to reduce noise on a network-dependent test.

CLAassistant · 2026-04-25T01:43:45Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ yuneng-berri
❌ mateo-berri
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

veria-ai · 2026-04-25T01:44:21Z

Low: No security issues found

This PR adds defensive type-checking to proxy server request header lookup, adds a per-task timeout to the atexit flush to prevent hangs, and adjusts CI test retry settings. None of the changes introduce new attack surface or weaken existing security controls.

Status: 0 open
Risk: 1/10

Posted by Veria AI · 2026-04-25T06:00:22.256Z

greptile-apps · 2026-04-25T01:46:38Z

Greptile Summary

This PR addresses three independent issues contributing to CI flakiness: an AttributeError in get_proxy_server_request_headers when proxy_server_request is None, unbounded coroutine execution in LoggingWorker._flush_on_exit, and test worker crashes in xdist. The core code fixes (llm_request_utils.py and logging_worker.py) are correct and well-scoped. Note that the PR description claims both schema-test functions are now mocked and hermetic, but the diff shows only litellm._turn_on_debug() was removed — both tests still issue real Vertex AI HTTP calls, so CI resilience actually comes from the added retry/worker-restart flags rather than test isolation.

Confidence Score: 5/5

Safe to merge; all P-level findings are P2 (misleading PR description, no runtime regressions introduced).

Both code-level bug fixes are correct and narrowly scoped. The CI config change uses an already-declared dependency. The only concern is that the PR description overstates test hermeticity — a P2 documentation gap that does not affect runtime behavior.

tests/local_testing/test_amazing_vertex_completion.py — still makes real Vertex AI calls contrary to PR description.

Important Files Changed

Filename	Overview
litellm/litellm_core_utils/llm_request_utils.py	Correctly fixes AttributeError when `proxy_server_request` is `None` by guarding the chained `.get()` with a dict check before accessing `headers`.
litellm/litellm_core_utils/logging_worker.py	Wraps each queued coroutine in `asyncio.wait_for(..., timeout=2.0)` to prevent a single hung task from blocking the entire atexit flush past the outer budget; hardcoded 2.0s was flagged in a prior comment.
.circleci/config.yml	Adds `--retries 2 --retry-delay 5 --max-worker-restart=5` to both `local_testing_part1` and `local_testing_part2` jobs; `pytest-retry==1.7.0` is a declared dependency so the new flags are valid.
tests/local_testing/test_amazing_vertex_completion.py	Only removes `litellm._turn_on_debug()` from `test_vertex_schema_test`; despite the PR description claiming tests are mocked, both target tests still make real Vertex AI network calls.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[atexit triggered] --> B[_flush_on_exit]
    B --> C{Queue empty?}
    C -- Yes --> Z[return]
    C -- No --> D[new event loop]
    D --> E{time < MAX_TIME_TO_CLEAR_QUEUE AND iterations < MAX_ITERATIONS}
    E -- No --> F[log warning & break]
    E -- Yes --> G[queue.get_nowait]
    G --> H[asyncio.wait_for coroutine timeout=2.0]
    H -- success --> I[processed += 1]
    H -- TimeoutError/Exception --> J[silent failure]
    I --> E
    J --> E

    subgraph llm_request_utils fix
    K[litellm_params.get proxy_server_request] --> L{is None or not dict?}
    L -- Yes --> M[return empty dict]
    L -- No --> N[.get headers or empty]
    end

_{Reviews (4): Last reviewed commit: "[Infra] add worker restart and retry fla..." | Re-trigger Greptile}

greptile-apps · 2026-04-25T01:46:42Z

                try:
-                    loop.run_until_complete(task["coroutine"])
+                    loop.run_until_complete(
+                        asyncio.wait_for(task["coroutine"], timeout=2.0)


Hardcoded per-task timeout may exceed configurable outer budget

The per-task timeout of 2.0 s is a magic number that doesn't track the configurable MAX_TIME_TO_CLEAR_QUEUE constant (defaults to 5.0 s, overridable via env). If an operator sets MAX_TIME_TO_CLEAR_QUEUE=1.0, every single task would consume more than the entire outer budget before the loop-level time check could fire, effectively turning the outer limit into dead code for any queue with at least one slow coroutine.

Consider deriving the per-task limit from the outer budget or at least capping it:

_per_task_timeout = min(2.0, MAX_TIME_TO_CLEAR_QUEUE) loop.run_until_complete( asyncio.wait_for(task["coroutine"], timeout=_per_task_timeout) )

codecov · 2026-04-25T01:51:47Z

Codecov Report

❌ Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
litellm/litellm_core_utils/llm_request_utils.py	75.00%	1 Missing ⚠️
litellm/litellm_core_utils/logging_worker.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…itellm_fix_atexit_flush_attribute_error

gitguardian · 2026-04-25T04:44:22Z

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
31539530	Triggered	Generic Password	`88faae2`	.github/workflows/_test-unit-services-base.yml	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secret safely. Learn here the best practices.
Revoke and rotate this secret.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

test_vertex_schema_test and test_gemini_nullable_object_tool_schema_httpx were calling vertex_ai/gemini-2.5-flash for real. When Vertex was slow, the calls hung past pytest's 300s timeout, pytest-timeout fired os._exit(1), and the xdist worker disappeared mid-run as "Not properly terminated". The actual purpose of these tests is to exercise the schema-conversion code path (anyOf with null types, nullable object tool params). Mocking the HTTP layer and the auth token retrieval makes them hermetic and removes the worker-crash failure mode entirely.

…rk calls" This reverts commit eddae66.

Match the resilience config that the llm_translation job already uses: --max-worker-restart=5 lets xdist replace workers killed by pytest-timeout's os._exit(1) instead of failing the whole job, and --retries 2 --retry-delay 5 self-heals transient network flakes. Preserves real integration test coverage; no test code changes needed.

yuneng-berri and others added 3 commits April 23, 2026 17:55

Merge pull request #26375 from BerriAI/litellm_internal_staging

7b47dff

[Infra] Promote internal staging to main

greptile-apps Bot reviewed Apr 25, 2026

View reviewed changes

Merge remote-tracking branch 'origin/litellm_internal_staging' into l…

88faae2

…itellm_fix_atexit_flush_attribute_error

yuneng-berri added 3 commits April 24, 2026 22:00

Revert "[Test] mock vertex schema tests so they don't make real netwo…

1a67454

…rk calls" This reverts commit eddae66.

yuneng-berri requested a review from a team April 25, 2026 05:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix] Prevent atexit flush hangs and guard proxy_server_request header lookup#26470

[Fix] Prevent atexit flush hangs and guard proxy_server_request header lookup#26470
yuneng-berri wants to merge 7 commits intolitellm_internal_stagingfrom
litellm_fix_atexit_flush_attribute_error

yuneng-berri commented Apr 25, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Apr 25, 2026

Uh oh!

veria-ai Bot commented Apr 25, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 25, 2026 •

edited

Loading

Important Files Changed

Uh oh!

greptile-apps Bot Apr 25, 2026

Uh oh!

codecov Bot commented Apr 25, 2026 •

edited

Loading

Uh oh!

gitguardian Bot commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yuneng-berri commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Relevant issues

Summary

Failure path (before fix)

Fix

Testing

Type

Screenshots

Uh oh!

CLAassistant commented Apr 25, 2026

Uh oh!

veria-ai Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Low: No security issues found

Uh oh!

greptile-apps Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gitguardian Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuneng-berri commented Apr 25, 2026 •

edited

Loading

veria-ai Bot commented Apr 25, 2026 •

edited

Loading

greptile-apps Bot commented Apr 25, 2026 •

edited

Loading

codecov Bot commented Apr 25, 2026 •

edited

Loading

gitguardian Bot commented Apr 25, 2026 •

edited

Loading