Implemented a script that automatically adjusts Qwen3's inference and non-inference capabilities, based on an OpenAI-like API. The inference framework can be sglang, or it can be adapted/modified to use vLLM
Idea and partial code got from https://github.com/AaronFeng753/Better-Qwen3
Thanks @AaronFeng753
This system automatically applies Qwen3's classification and 'thinking' processes using a unified model. By default, the classifier activates a 'no thinking' state to save tokens and reduce latency, with negligible impact on the overall response. The classifier then automatically selects between 'think' and 'no think' modes based on the incoming query. The output clearly shows the debug status, whether 'thinking' mode is active, and all output content. This information can be selectively displayed.



