Skip to content

[refactor] Serving into proper modules#44796

Draft
SunMarc wants to merge 11 commits intomainfrom
refactor-serving
Draft

[refactor] Serving into proper modules#44796
SunMarc wants to merge 11 commits intomainfrom
refactor-serving

Conversation

@SunMarc
Copy link
Member

@SunMarc SunMarc commented Mar 17, 2026

What does this PR do?

This PR refactors transformers serve so that it is not in a single file. We split it into multiple files with clear responsabilities. For now, this is a POC and I will add the rest of the current features back into it once we agree on the refactor.

  • serve_refactored.py — only CLI args + wiring
  • model_manager.py — only model loading/caching
  • chat_completion.py — only the /v1/chat/completions logic
  • protocol.py — only API types and format conversion
  • app.py — only FastAPI routes and middleware

To be added:

  • KV cache reuse, request queue for model.generate()
  • Continuous batching
  • Tool calling
  • /v1/responses endpoint
  • /v1/audio/transcriptions endpoint
  • ModelBehavior (replace GPT-OSS hacks)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants