Two expensive ML computations are happening in the analytics service:
- Topic derivation
- Language identification
Both are CPU-bound.
Both take as input the whole text of the incoming chat completion or embedding requests.
To ease the CPU load and make it more predictable, we suggest taking only the last, let's say, 10K characters from the request for the analysis.
The last messages of a chat completion request, plus its system message, usually have the most weight for the LLM, so it should be for the analytics service too.
Two expensive ML computations are happening in the analytics service:
Both are CPU-bound.
Both take as input the whole text of the incoming chat completion or embedding requests.
To ease the CPU load and make it more predictable, we suggest taking only the last, let's say, 10K characters from the request for the analysis.
The last messages of a chat completion request, plus its system message, usually have the most weight for the LLM, so it should be for the analytics service too.