To enable running guardrails on chat completion use-cases, we want to provide a chat completion API that allows users to run various types of detectors in their chat completion flow.
The OpenAI chat completions feature has been a popular implementation of using generative models to respond in chat conversations. With various conversation use cases such as customer service, we want to expand our API to not only support applying detectors on text in general, but specifically on chat messages.
- We will use the OpenAI chat completions API as our the base API for chat completions. Parameters detailed here can be referenced.
- Different chat completions server solutions like vllm may provide extra parameters. For the orchestrator API to remain more agnostic of particular server implementation, the orchestrator can pass through any additional parameters to the generation server, which can do validation of parameters.
- Guardrails-related API updates would be additive, and we will not modify any OpenAI API elements, like
messagesorchoices.- We will add a
detectorsblock in the request, containinginputandoutputblocks. This will provide flexibility and control, allowing users to provide list of detectors with parameters for both chat completions input and output separately. - Here,
inputandoutputdoes not strictly refer to whether only theinputto chat completions is provided to the detectors or that only theoutputto chat completions is provided to the detectors. Detectors specified inoutputcan potentially take both input and output of a chat completions model. - We will add a
detectionsblock in the response.
- We will add a
- If no input or output detectors were requested, then the
detections.<missing element / input or output>would not be present in the response. - If none of input or output detectors are requested, then we will return
422(similar behavior to other standalonev2endpoints).
This is particular to the choices array on the chat completion response.
- Each choice is independent.
- Each choice is generated with the same input.
- When output detectors are requested, if the LLM (model used to generate chat completions) doesn't generate
contentin anymessageinchoices, then we will ignore thechoicesand return a warning aboutdetections.output. This was written based on detectors at this time designed to work oncontent, such as text detectors. This may change if new detectors work on tool calls, for example. - We will run all detectors included in
detectors.outputof the user request on allchoices(that havemessage.content), unless there are certain detectors that are supposed to be run across choices. - We will order all detections that include span by
startand move all the other ones towards the end (by not ordering them in any particular order), but grouping the remaining ones based ondetector_id.- An alternate consideration made here was to not order the detections and just include
detector_id. However, since detectors are called in parallel, just collecting results may lead to different ordering each call. This ordering will allow processing of detections with spans in order bystart.
- An alternate consideration made here was to not order the detections and just include
The streaming response with guardrails is built off the chat completion chunk responses. This applies when both streaming and output detectors are requested.
- Each choice's
delta.content, if present, will be thecontentused fordetections. Like the unary case, if there is nocontentoverall, a warning will be returned. - Since detector use will cause content to be "chunked" (e.g. collected into sentences for sentence detectors), to prevent ambiguity, the
delta.rolewill be provided on each stream event, anddelta.contentwill include the "chunked" text. This will apply to eachchoiceinchoices. Like previous text generation cases, this API will also only return fully processed chunks, to ensure that users receive text back only when all detectors have been run. - A
detectionsblock will be added to each stream event. Each processedchoicewill be noted with the inclusion of the choice's index in the originalchoicesarray by achoice_indexfield, even if there are no particularresultsfrom running the output detectors. - For output detectors that run on the contents of an entire output stream, results will be included as
detectionson the second-last event (last one before[DONE]) of the stream, regardless of presence ofusage.- Note here that overall
usageis returned as-is to the user, just addingdetections. Likewise, any other optional response parameters will be unchanged. - An alternate consideration made here was the currently implemented aggregation strategy, where if the user requested any output detectors that required the entire output, the stream events with detection results would only be returned all together. The problem here was the results would then not really be "streamed" back.
- Note here that overall
This part gives a brief overview of how we would expect a user to access information from the response object.
- Identifying which choice of potentially multiple choices that results exist for:
detections.output[0].choice_index(will give index of choice in originalchoices) - Result info of a given choice:
detections.output[0].results[0].detection_type
[This part is an amendment with the addition of detector types, documented in ADR 006]
Since the OpenAI chat completions API messages could have multiple messages in one user request, potentially detectors with different detector types could be applied (e.g. whole chat history can be analyzed with a /chat detector, while a single message can be analyzed with a /contents detector).
As various detector type support is added, which detector types will be supported by detections on chat completions will be documented. Unsupported detector types will produce errors on usage.
- We will add a
warningselement to show cases where the response is not 4xx, but there are issues in processing.
- Only additive changes to the OpenAI chat completions results will be made in the unary case. In the streaming case, streamed events may need to have updated text in
delta.content. This can affect token counts per event inusage, as well as whattokensare included forlogprobs.contentor similar fields. - Validation and error handling will have to be provided for the cases where there is not expected
contentto run detectors on, raising warnings to the user when appropriate. - Since the results of various detectors can be included as the results of each
choice, the inclusion ofdetector_idin each result will help distinguish which detector produced which result. - Results from input detectors will include
message_indexto indicate which message was processed. In cases where only the last message of the chat history may be processed to limit resource usage, this will help make it obvious to the user which message was processed. - Similarly, since output detectors will be run on all
choicesprovided by the chat completions model,choice_indexwill be provided to inform the user whichchoicethat detector result(s) apply for. This will apply regardless if the detector itself may take full or partial input to chat completions model. - A new result aggregation strategy will need to be implemented to put any detector results that require the full output text from a chat completions model (e.g. all the text of a particular
choice) on the second-last event. - For the streaming response, implementation-wise, aggregating the streaming responses will be complicated since currently for the openAI chat completions API, each stream event returns content for only one choice, e.g.
data: {"id":"chat-72643462c1164571b8c8a2207dd89dc9","object":"chat.completion.chunk","created":1727139047,"model":"model","choices":[{"index":1,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":32,"total_tokens":32,"completion_tokens":0}}
data: {"id":"chat-72643462c1164571b8c8a2207dd89dc9","object":"chat.completion.chunk","created":1727139047,"model":"model,"choices":[{"index":2,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":32,"total_tokens":32,"completion_tokens":0}}
data: {"id":"chat-72643462c1164571b8c8a2207dd89dc9","object":"chat.completion.chunk","created":1727139047,"model":"model","choices":[{"index":1,"delta":{"content":"MO"},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":32,"total_tokens":33,"completion_tokens":1}}
This means that the orchestrator will have to be able to track chunking and detecting for each choice, potentially with multiple detectors and their respective chunkers.
- Based on
detector_type, detectors may "filter", or work on partial information, from all the input to chat completions. For example, some detectors may only work on individualcontentmessages, while other detectors may work on the entire history of chat messages.
Accepted