feat: Improve client perf and error handling by LukeAVanDrie · Pull Request #247 · kubernetes-sigs/inference-perf

LukeAVanDrie · 2025-10-07T21:20:39Z

Reuses aiohttp.ClientSession across requests in openAIModelServerClient to reduce connection overhead. This change improves client-side throughput and latency.

Additional improvements:

Refines error handling to distinguish between network errors (like aiohttp.ClientError), non-200 HTTP status codes, and errors during response processing.
Ensures non-200 responses with text bodies are captured.
Guarantees response body is always consumed to release connections.

Reuses aiohttp.ClientSession across requests in openAIModelServerClient to reduce connection overhead. This change improves client-side throughput and latency. Additional improvements: - Refines error handling to distinguish between network errors (like aiohttp.ClientError), non-200 HTTP status codes, and errors during response processing. - Ensures non-200 responses with text bodies are captured. - Guarantees response body is always consumed to release connections.

k8s-ci-robot · 2025-10-07T21:20:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LukeAVanDrie
Once this PR has been reviewed and has the lgtm label, please assign achandrasekar for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jjk-g · 2025-10-07T21:32:47Z

inference_perf/client/modelserver/openai_client.py

-                    )
-                )
+
+            end_time = time.perf_counter()


can move to a finally block

jjk-g

Thanks for adding! One nit

/lgtm

achandrasekar

Can you add how the change was tested and if you have any numbers on improvements that'd be great too?

inference_perf/client/modelserver/openai_client.py

achandrasekar · 2025-10-13T16:51:34Z

Please address the linting and type check issue above

k8s-ci-robot · 2025-10-16T08:00:24Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jjk-g · 2025-10-23T18:55:54Z

@LukeAVanDrie friendly ping for linting and type check errors

diamondburned · 2025-11-10T22:37:22Z

inference_perf/client/modelserver/openai_client.py

        elif not tokenizer_config:
            tokenizer_config = CustomTokenizerConfig(pretrained_model_name_or_path=self.model_name)
        self.tokenizer = CustomTokenizer(tokenizer_config)
+        self.session = aiohttp.ClientSession(connector=aiohttp.TCPConnector(limit=self.max_tcp_connections))


Please correct me if I'm wrong, but isn't openAIModelServerClient shared across multiple asyncio event loops because of multiprocessing? Creating a single ClientSession here might cause issues if the same instance is also being shared to all the multiprocessing workers.

Relevant link: https://stackoverflow.com/questions/62707369/one-aiohttp-clientsession-per-thread

Slightly refactor `openAIModelServerClient` to accept a custom `aiohttp.ClientSession` per request, which allows us to use exactly 1 client session per worker. Prior to this commit, a new `aiohttp.ClientSession` is created for each request. Not only is this inefficient and lowers throughput, on certain environments, it also leads to inotify watch issues: aiodns - WARNING - Failed to create DNS resolver channel with automatic monitoring of resolver configuration changes. This usually means the system ran out of inotify watches. Falling back to socket state callback. Consider increasing the system inotify watch limit: Failed to initialize c-ares channel Indeed, because each DNS resolver is created for a new `ClientSession`, creating tons of new `ClientSession`s causes eventual inotify watch exhaustion. Sharing `ClientSession`s solves this issue. Relevant links: - https://docs.aiohttp.org/en/stable/http_request_lifecycle.html - https://stackoverflow.com/questions/62707369/one-aiohttp-clientsession-per-thread - home-assistant/core#144457 (comment) Relevant PR: kubernetes-sigs#247 (doesn't address the issue of worker sharing).

Slightly refactor `openAIModelServerClient` to add a new method, `process_request_with_session`, that accepts a custom `ReusableHTTPClientSession` per request, which allows the caller to reuse an HTTP client session per worker. The previous method, `process_request`, is made to create a fresh HTTP client session then call `process_request_with_session`, preserving the previous behavior. Prior to this commit, a new `aiohttp.ClientSession` is created for each request. Not only is this inefficient and lowers throughput, on certain environments, it also leads to inotify watch issues: aiodns - WARNING - Failed to create DNS resolver channel with automatic monitoring of resolver configuration changes. This usually means the system ran out of inotify watches. Falling back to socket state callback. Consider increasing the system inotify watch limit: Failed to initialize c-ares channel Indeed, because each DNS resolver is created for a new `ClientSession`, creating tons of new `ClientSession`s causes eventual inotify watch exhaustion. Sharing `ClientSession`s solves this issue. Relevant links: - https://docs.aiohttp.org/en/stable/http_request_lifecycle.html - https://stackoverflow.com/questions/62707369/one-aiohttp-clientsession-per-thread - home-assistant/core#144457 (comment) Relevant PR: kubernetes-sigs#247 (doesn't address the issue of worker sharing).

SachinVarghese · 2025-11-20T19:54:38Z

@LukeAVanDrie Thanks for the contribution. Can you please rebase this PR?

LukeAVanDrie · 2025-11-20T20:08:17Z

@LukeAVanDrie Thanks for the contribution. Can you please rebase this PR?

Yes, apologies for the long delay here. I will make sure to update the description with my testing results and verify @diamondburned's concern regarding multiprocessing.

Slightly refactor `openAIModelServerClient` to add a new method, `process_request_with_session`, that accepts a custom `ReusableHTTPClientSession` per request, which allows the caller to reuse an HTTP client session per worker. The previous method, `process_request`, is made to create a fresh HTTP client session then call `process_request_with_session`, preserving the previous behavior. Prior to this commit, a new `aiohttp.ClientSession` is created for each request. Not only is this inefficient and lowers throughput, on certain environments, it also leads to inotify watch issues: aiodns - WARNING - Failed to create DNS resolver channel with automatic monitoring of resolver configuration changes. This usually means the system ran out of inotify watches. Falling back to socket state callback. Consider increasing the system inotify watch limit: Failed to initialize c-ares channel Indeed, because each DNS resolver is created for a new `ClientSession`, creating tons of new `ClientSession`s causes eventual inotify watch exhaustion. Sharing `ClientSession`s solves this issue. Relevant links: - https://docs.aiohttp.org/en/stable/http_request_lifecycle.html - https://stackoverflow.com/questions/62707369/one-aiohttp-clientsession-per-thread - home-assistant/core#144457 (comment) Relevant PR: kubernetes-sigs#247 (doesn't address the issue of worker sharing).

Slightly refactor `openAIModelServerClient` to add a new method, `process_request_with_session`, that accepts a custom `ReusableHTTPClientSession` per request, which allows the caller to reuse an HTTP client session per worker. The previous method, `process_request`, is made to create a fresh HTTP client session then call `process_request_with_session`, preserving the previous behavior. Prior to this commit, a new `aiohttp.ClientSession` is created for each request. Not only is this inefficient and lowers throughput, on certain environments, it also leads to inotify watch issues: aiodns - WARNING - Failed to create DNS resolver channel with automatic monitoring of resolver configuration changes. This usually means the system ran out of inotify watches. Falling back to socket state callback. Consider increasing the system inotify watch limit: Failed to initialize c-ares channel Indeed, because each DNS resolver is created for a new `ClientSession`, creating tons of new `ClientSession`s causes eventual inotify watch exhaustion. Sharing `ClientSession`s solves this issue. Relevant links: - https://docs.aiohttp.org/en/stable/http_request_lifecycle.html - https://stackoverflow.com/questions/62707369/one-aiohttp-clientsession-per-thread - home-assistant/core#144457 (comment) Relevant PR: #247 (doesn't address the issue of worker sharing).

jjk-g · 2026-01-22T17:26:32Z

@LukeAVanDrie any updates on this PR?

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 7, 2025

k8s-ci-robot requested review from Bslabe123 and jjk-g October 7, 2025 21:20

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 7, 2025

jjk-g reviewed Oct 7, 2025

View reviewed changes

inference_perf/client/modelserver/openai_client.py

)

)

end_time = time.perf_counter()

Copy link

Collaborator

jjk-g Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can move to a finally block

jjk-g reviewed Oct 7, 2025

View reviewed changes

k8s-ci-robot assigned jjk-g Oct 7, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 7, 2025

achandrasekar reviewed Oct 10, 2025

View reviewed changes

inference_perf/client/modelserver/openai_client.py Show resolved Hide resolved

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 16, 2025

diamondburned reviewed Nov 10, 2025

View reviewed changes

diamondburned mentioned this pull request Nov 11, 2025

Share aiohttp.ClientSessions per worker #282

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Improve client perf and error handling#247

feat: Improve client perf and error handling#247
LukeAVanDrie wants to merge 1 commit intokubernetes-sigs:mainfrom
LukeAVanDrie:refactor/client-performance

LukeAVanDrie commented Oct 7, 2025

Uh oh!

k8s-ci-robot commented Oct 7, 2025

Uh oh!

jjk-g Oct 7, 2025

Uh oh!

jjk-g left a comment

Uh oh!

achandrasekar left a comment

Uh oh!

Uh oh!

achandrasekar commented Oct 13, 2025

Uh oh!

k8s-ci-robot commented Oct 16, 2025

Uh oh!

jjk-g commented Oct 23, 2025

Uh oh!

diamondburned Nov 10, 2025

Uh oh!

SachinVarghese commented Nov 20, 2025

Uh oh!

LukeAVanDrie commented Nov 20, 2025

Uh oh!

jjk-g commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

LukeAVanDrie commented Oct 7, 2025

Uh oh!

k8s-ci-robot commented Oct 7, 2025

Uh oh!

jjk-g Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

jjk-g left a comment

Choose a reason for hiding this comment

Uh oh!

achandrasekar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

achandrasekar commented Oct 13, 2025

Uh oh!

k8s-ci-robot commented Oct 16, 2025

Uh oh!

jjk-g commented Oct 23, 2025

Uh oh!

diamondburned Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

SachinVarghese commented Nov 20, 2025

Uh oh!

LukeAVanDrie commented Nov 20, 2025

Uh oh!

jjk-g commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants