feat: introduce APIs for retrieving chat completion requests #2145

ehhuang · 2025-05-12T17:30:26Z

What does this PR do?

This PR introduces APIs to retrieve past chat completion requests, which will be used in the LS UI.

Our current Telemetry is ill-suited for this purpose as it's untyped so we'd need to filter by obscure attribute names, making it brittle.

Since these APIs are 'provided by stack' and don't need to be implemented by inference providers, we introduce a new InferenceProvider class, containing the existing inference protocol, which is implemented by inference providers.

The APIs are OpenAI-compliant, with an additional input_messages field.

Test Plan

This PR just adds the API and marks them provided_by_stack. S
tart stack server -> doesn't crash

rhuss · 2025-05-12T18:35:40Z

Sorry for my ignorance, but how is this supposed to work in a multi-user environment to return only user-specific completions and not every log for all users? Or is the assumption that only one user per llama-stack server is supported ? (sorry if this is a stupid question, still making my way through lls)

ashwinb · 2025-05-12T21:22:32Z

Relying on untyped attributes also hurts reliability.

I think this is the key argument and it makes sense. There are a few things to discuss here:

would you have similar endpoints for Responses? There's a /responses/{:id} endpoint for Responses also. how would this namespace (log) co-habit with that pre-existing one?
continuing on that thread, we have /agents/{:id} retrieveal APIs also for listing and retrieving information about Sessions, Turns, etc.
so in a certain way, this isn't quite telemetry actually but something more opinionated and higher level -- namely, for each completion you pass a store: boolean value and then state is kept by the Stack. and then, you would situate these (query) APIs within Api.inference itself.

An alternate design could be to double down on Telemetry itself and say (in the spec, somehow) what each API call must do as Telemetry side-effects with standardized attributes. Then, you can lean on the more general Telemetry APIs for querying.

====
My inclination is do what you are proposing: APIs which serve a clear purpose work better and can both be relied upon and correctly deprecated. Adding very generic requirements to very general APIs (prematurely while the set of use cases isn't mature) is a recipe for disaster.

ehhuang · 2025-05-12T22:09:26Z

Sorry for my ignorance, but how is this supposed to work in a multi-user environment to return only user-specific completions and not every log for all users? Or is the assumption that only one user per llama-stack server is supported ? (sorry if this is a stupid question, still making my way through lls)

no stupid questions :)

We'll follow up with attribute based auth support, a la what was done in agents API: #1737

ehhuang · 2025-05-12T22:26:00Z

Relying on untyped attributes also hurts reliability.

I think this is the key argument and it makes sense. There are a few things to discuss here:

would you have similar endpoints for Responses? There's a /responses/{:id} endpoint for Responses also. how would this namespace (log) co-habit with that pre-existing one?

continuing on that thread, we have /agents/{:id} retrieveal APIs also for listing and retrieving information about Sessions, Turns, etc.

so in a certain way, this isn't quite telemetry actually but something more opinionated and higher level -- namely, for each completion you pass a store: boolean value and then state is kept by the Stack. and then, you would situate these (query) APIs within Api.inference itself.

I used log namespace mainly to avoid confusion:

There's no list-responses (not in OpenAI API either) so that would have to be added. Adding it to the same openAI path could be confusing.
Similarly, while OpenAI chat completions does support 'list chat completions' op, it doesn't return input messages anywhere. I don't think changing the behavior of the openai compat endpoint (GET /openai/v1/chat/completions) is a good idea.

We could have it under the corresponding API's namespace, but I guess the path wouldn't be RESTful given the above? I.e. for inference, we could have (GET /inference/openai_chat_completions) for 'list openai chat completions'? Maybe this is fine as it's not that different from GET /log/openai_chat_completions anyway. I can move it.

An alternate design could be to double down on Telemetry itself and say (in the spec, somehow) what each API call must do as Telemetry side-effects with standardized attributes. Then, you can lean on the more general Telemetry APIs for querying.

==== My inclination is do what you are proposing: APIs which serve a clear purpose work better and can both be relied upon and correctly deprecated. Adding very generic requirements to very general APIs (prematurely while the set of use cases isn't mature) is a recipe for disaster.

Yes, agreed.

ashwinb · 2025-05-13T18:07:42Z

@ehhuang yeah I don't think we should use a new namespace (log) because this isn't quite generic logging and we already have something for that (namely, telemetry). This is basically some new stateful APIs we want to have for a well-defined purpose so let's locate them to the most specific place we can place them to be.

Relatedly, assuming we standardize on to the OpenAI inference APIs, do we need openai_ prefix in the URL path?

ehhuang · 2025-05-13T18:45:33Z

@ehhuang yeah I don't think we should use a new namespace (log) because this isn't quite generic logging and we already have something for that (namely, telemetry). This is basically some new stateful APIs we want to have for a well-defined purpose so let's locate them to the most specific place we can place them to be.

Relatedly, assuming we standardize on to the OpenAI inference APIs, do we need openai_ prefix in the URL path?

Ok, I can move it to /inference and get rid of openai_ prefix assuming the stack chat_completions will soon be deprecated

rhuss · 2025-05-14T05:41:09Z

Sorry for my ignorance, but how is this supposed to work in a multi-user environment to return only user-specific completions and not every log for all users? Or is the assumption that only one user per llama-stack server is supported ? (sorry if this is a stupid question, still making my way through lls)

no stupid questions :)

We'll follow up with attribute based auth support, a la what was done in agents API: #1737

Thanks for getting back to me :-). I understand the authentication part. What I still don't get is how a user gets back only her own chat completions and not chat completions from every user. Looking at

cursor.execute("SELECT * FROM chat_completions")

for list_chat_completions, I don't see how this is filtered on the current user. Or does every user get their own backend DB? 🤔

The difference between telemetry and conversational logs, IMO, is that telemetry is a global concern, with access on an admin level, whereas conversational logs target the end user persona and need to be specific for each user. For that (and other reasons), I wouldn't put user-specific logs under the telemetry category. Global logs, though, are definitely observability features (like traces and metrics).

leseb

In general, the name “log” makes me think of server logs, but that might just be my perspective. Because of that, this feels more like an admin-level API. While I fully understand the UI use case, could we also explore other scenarios that would justify making this API “public” rather than keeping it as an internal-only interface?

I get that using SQLite to store logs is a simple starting point, but in a real-world scenario, this new API would be a strong candidate for Redis or MongoDB, as the data will likely grow quickly.

Also, I thought we had agreed to have one PR for the API design and a separate PR for the implementation. :)

Lastly, we still need tests! 😄 Thanks!

leseb · 2025-05-14T12:46:33Z

llama_stack/providers/inline/log/meta_reference/config.py

+
+
+class LogConfig(BaseModel):
+    log_db_path: str = Field(


Suggested change

log_db_path: str = Field(

log_db_path: Path = Field(

Use Path instead or str?

leseb · 2025-05-14T12:47:45Z

llama_stack/apis/log/log.py

+    async def store_chat_completion(self, chat_completion: ChatCompletion) -> None: ...
+
+    @webmethod(route="/log/openai_chat_completions", method="GET")
+    async def list_chat_completions(self) -> ListChatCompletionsResponse:


I believe this should return a PaginatedResponse.

leseb · 2025-05-14T12:50:22Z

llama_stack/distribution/routers/routers.py

+                def convert_response_to_chat_completion(response: ChatCompletionResponse) -> ChatCompletion:
+                    from rich.pretty import pprint
+
+                    pprint(response)


leseb · 2025-05-14T12:53:42Z

llama_stack/providers/inline/log/meta_reference/sqlite_log_store.py

+        conn.commit()
+        cursor.close()
+
+    async def list_chat_completions(self) -> ListChatCompletionsResponse:


We should paginate the output

leseb · 2025-05-14T12:54:27Z

llama_stack/providers/inline/log/meta_reference/sqlite_log_store.py

+        conn.row_factory = sqlite3.Row
+        cursor = conn.cursor()
+
+        cursor.execute("SELECT * FROM chat_completions")


Use a range with pagination?

Not sure if a particular user should be able to see all chat_completions coming from all users, too, except when LLS is operated in a single-user setup. (but see my comments above, too)

ehhuang · 2025-05-14T20:14:51Z

Thank you all for the feedback so far. As discussed, this will not be a separate /log namespace, and I will move this into /inference instead. Will put it up again when ready.

ehhuang · 2025-05-16T05:10:04Z

I've updated the PR to only include API definition and the description to reflect such. API implementation will be in a follow-up PR.

ashwinb · 2025-05-16T05:15:38Z

docs/openapi_generator/pyopenapi/generator.py

@@ -759,7 +759,7 @@ def _build_operation(self, op: EndpointOperation) -> Operation:
        )

        return Operation(
-            tags=[op.defining_class.__name__],
+            tags=[op.defining_class.__name__ if op.defining_class.__name__ != "InferenceProvider" else "Inference"],


is there a way to make this general somehow?

could we stick some attribute in the protocol perhaps?

llama_stack/apis/inference/inference.py

bbrowning · 2025-05-16T14:34:20Z

Just a note as we think about OpenAI-API compatibility for inference, we'll need the following API endpoints along with the parameters and such from the API spec (for example https://platform.openai.com/docs/api-reference/chat/list) if we're aiming for 100% compatibility:

get a chat completion: GET v1/chat/completions/{completion_id}
list messages for a completion: GET v1/chat/completions/{completion_id}/messages
list chat completions: GET v1/chat/completions/
update a chat completion: POST v1/chat/completions/{completion_id}
delete a chat completion: DELETE v1/chat/completions/{completion_id}
get a response: GET v1/responses/{response_id} - already implemented at

llama-stack/llama_stack/apis/agents/agents.py

Line 582 in 7aae8fa

@webmethod(route="/openai/v1/responses/{id}", method="GET")
delete a response: DELETE v1/responses/{response_id} - PR at feat: Add webmethod for deleting openai responses #2160
list input items in a response: GET v1/responses/{response_id}/input_items

If we're following that API spec, then it also makes decisions for us on things like pagination and sort keys. This isn't to say we have to follow it exactly, as these others methods are not as widely used in the wild that I've seen compared to the endpoints to create chat completions and responses. It may be worth at least considering since we want overlapping functionality.

ehhuang · 2025-05-16T17:02:28Z

Just a note as we think about OpenAI-API compatibility for inference, we'll need the following API endpoints along with the parameters and such from the API spec (for example https://platform.openai.com/docs/api-reference/chat/list) if we're aiming for 100% compatibility:

get a chat completion: GET v1/chat/completions/{completion_id}

list messages for a completion: GET v1/chat/completions/{completion_id}/messages

list chat completions: GET v1/chat/completions/

update a chat completion: POST v1/chat/completions/{completion_id}

delete a chat completion: DELETE v1/chat/completions/{completion_id}

get a response: GET v1/responses/{response_id} - already implemented at

llama-stack/llama_stack/apis/agents/agents.py

Line 582 in 7aae8fa

@webmethod(route="/openai/v1/responses/{id}", method="GET")

delete a response: DELETE v1/responses/{response_id} - PR at feat: Add webmethod for deleting openai responses #2160

list input items in a response: GET v1/responses/{response_id}/input_items

If we're following that API spec, then it also makes decisions for us on things like pagination and sort keys. This isn't to say we have to follow it exactly, as these others methods are not as widely used in the wild that I've seen compared to the endpoints to create chat completions and responses. It may be worth at least considering since we want overlapping functionality.

This is indeed a discussion point.

Note the APIs here are not under /v1/openai/, and hence they're not compatible. I opted to have it not under openai since 1) OpenAI's list chat completions don't return input messages, and 2) it has fields that's not immediately applicable to us (e.g. system_fingerprint, service_tier, etc.). An alternative is to make this openai-compliant and add another field input_messages and leave some fields empty/unimplemented.

Thoughts? cc @ashwinb

ashwinb · 2025-05-16T17:07:48Z

@bbrowning @ehhuang

Yes let's just go for OpenAI compatibility at this point for the inference-related APIs. If they can make it work, we can make it work too. I had missed that angle when I looked at it first.

ehhuang · 2025-05-16T18:26:31Z

I updated the PR to make it fully OAI compliant and under openai/v1/. The returned data has an additional input_messages field. Note that OAI has a separate v1/dashboard/chat/completions called from the UI which has the additional messages field compared to the non-dashboard version. Not sure if there's any concern with just having one API.

ashwinb · 2025-05-16T20:29:42Z

docs/openapi_generator/pyopenapi/generator.py

@@ -759,7 +759,7 @@ def _build_operation(self, op: EndpointOperation) -> Operation:
        )

        return Operation(
-            tags=[op.defining_class.__name__],
+            tags=[getattr(op.defining_class, "API_NAMESPACE", op.defining_class.__name__)],


ashwinb · 2025-05-16T20:29:58Z

llama_stack/apis/inference/inference.py

    """

+    API_NAMESPACE: str = "Inference"


yeah this is perfect

# What does this PR do? ## Test Plan # What does this PR do? ## Test Plan

# What does this PR do? * Provide sqlite implementation of the APIs introduced in #2145. * Introduced a SqlStore API: llama_stack/providers/utils/sqlstore/api.py and the first Sqlite implementation * Pagination support will be added in a future PR. ## Test Plan Unit test on sql store: <img width="1005" alt="image" src="https://github.com/user-attachments/assets/9b8b7ec8-632b-4667-8127-5583426b2e29" /> Integration test: ``` INFERENCE_MODEL="llama3.2:3b-instruct-fp16" llama stack build --template ollama --image-type conda --run ``` ``` LLAMA_STACK_CONFIG=http://localhost:5001 INFERENCE_MODEL="llama3.2:3b-instruct-fp16" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-fp16" -k 'inference_store and openai' ```

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 12, 2025

ehhuang changed the title ~~apis~~ feat: RFC: introduce log APIs May 12, 2025

ehhuang marked this pull request as ready for review May 12, 2025 17:46

ehhuang requested review from ashwinb, yanxi0830, hardikjshah, dltn, raghotham, dineshyv, vladimirivic, sixianyi0721, terrytangyuan, SLR722 and leseb as code owners May 12, 2025 17:46

leseb reviewed May 14, 2025

View reviewed changes

ehhuang changed the title ~~feat: RFC: introduce log APIs~~ feat: RFC: introduce APIs for retrieving chat completion requests May 14, 2025

ehhuang marked this pull request as draft May 14, 2025 20:14

ehhuang force-pushed the pr2145 branch 5 times, most recently from 24b46cc to 966fc32 Compare May 16, 2025 05:03

ehhuang force-pushed the pr2145 branch 2 times, most recently from 96194a1 to dfdd9ef Compare May 16, 2025 05:07

ehhuang marked this pull request as ready for review May 16, 2025 05:10

ashwinb reviewed May 16, 2025

View reviewed changes

llama_stack/apis/inference/inference.py Show resolved Hide resolved

ehhuang force-pushed the pr2145 branch 2 times, most recently from c2ab26c to 6093af5 Compare May 16, 2025 18:25

ashwinb reviewed May 16, 2025

View reviewed changes

llama_stack/apis/inference/inference.py

"""

API_NAMESPACE: str = "Inference"

Copy link

Contributor

ashwinb May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is perfect

ashwinb approved these changes May 16, 2025

View reviewed changes

apis, alt

3bc1753

# What does this PR do? ## Test Plan # What does this PR do? ## Test Plan

ehhuang force-pushed the pr2145 branch from 6093af5 to 3bc1753 Compare May 19, 2025 04:22

ehhuang changed the title ~~feat: RFC: introduce APIs for retrieving chat completion requests~~ feat: introduce APIs for retrieving chat completion requests May 19, 2025

ehhuang merged commit 047303e into meta-llama:main May 19, 2025
30 of 52 checks passed

ehhuang mentioned this pull request May 19, 2025

feat: implement get chat completions APIs #2200

Merged

feat: introduce APIs for retrieving chat completion requests #2145

feat: introduce APIs for retrieving chat completion requests #2145

Uh oh!

Conversation

ehhuang commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

Uh oh!

rhuss commented May 12, 2025

Uh oh!

ashwinb commented May 12, 2025

Uh oh!

ehhuang commented May 12, 2025

Uh oh!

ehhuang commented May 12, 2025

Uh oh!

ashwinb commented May 13, 2025

Uh oh!

ehhuang commented May 13, 2025

Uh oh!

rhuss commented May 14, 2025

Uh oh!

leseb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehhuang commented May 14, 2025

Uh oh!

ehhuang commented May 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bbrowning commented May 16, 2025

Uh oh!

ehhuang commented May 16, 2025

Uh oh!

ashwinb commented May 16, 2025

Uh oh!

ehhuang commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ehhuang commented May 12, 2025 •

edited

Loading

ehhuang commented May 16, 2025 •

edited

Loading