Skip to content

feat: introduce APIs for retrieving chat completion requests #2145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 19, 2025

Conversation

ehhuang
Copy link
Contributor

@ehhuang ehhuang commented May 12, 2025

What does this PR do?

This PR introduces APIs to retrieve past chat completion requests, which will be used in the LS UI.

Our current Telemetry is ill-suited for this purpose as it's untyped so we'd need to filter by obscure attribute names, making it brittle.

Since these APIs are 'provided by stack' and don't need to be implemented by inference providers, we introduce a new InferenceProvider class, containing the existing inference protocol, which is implemented by inference providers.

The APIs are OpenAI-compliant, with an additional input_messages field.

Test Plan

This PR just adds the API and marks them provided_by_stack. S
tart stack server -> doesn't crash

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 12, 2025
@ehhuang ehhuang changed the title apis feat: RFC: introduce log APIs May 12, 2025
@ehhuang ehhuang marked this pull request as ready for review May 12, 2025 17:46
@rhuss
Copy link
Contributor

rhuss commented May 12, 2025

Sorry for my ignorance, but how is this supposed to work in a multi-user environment to return only user-specific completions and not every log for all users? Or is the assumption that only one user per llama-stack server is supported ? (sorry if this is a stupid question, still making my way through lls)

@ashwinb
Copy link
Contributor

ashwinb commented May 12, 2025

Relying on untyped attributes also hurts reliability.

I think this is the key argument and it makes sense. There are a few things to discuss here:

  • would you have similar endpoints for Responses? There's a /responses/{:id} endpoint for Responses also. how would this namespace (log) co-habit with that pre-existing one?
  • continuing on that thread, we have /agents/{:id} retrieveal APIs also for listing and retrieving information about Sessions, Turns, etc.
  • so in a certain way, this isn't quite telemetry actually but something more opinionated and higher level -- namely, for each completion you pass a store: boolean value and then state is kept by the Stack. and then, you would situate these (query) APIs within Api.inference itself.

An alternate design could be to double down on Telemetry itself and say (in the spec, somehow) what each API call must do as Telemetry side-effects with standardized attributes. Then, you can lean on the more general Telemetry APIs for querying.

====
My inclination is do what you are proposing: APIs which serve a clear purpose work better and can both be relied upon and correctly deprecated. Adding very generic requirements to very general APIs (prematurely while the set of use cases isn't mature) is a recipe for disaster.

@ehhuang
Copy link
Contributor Author

ehhuang commented May 12, 2025

Sorry for my ignorance, but how is this supposed to work in a multi-user environment to return only user-specific completions and not every log for all users? Or is the assumption that only one user per llama-stack server is supported ? (sorry if this is a stupid question, still making my way through lls)

no stupid questions :)

We'll follow up with attribute based auth support, a la what was done in agents API: #1737

@ehhuang
Copy link
Contributor Author

ehhuang commented May 12, 2025

Relying on untyped attributes also hurts reliability.

I think this is the key argument and it makes sense. There are a few things to discuss here:

  • would you have similar endpoints for Responses? There's a /responses/{:id} endpoint for Responses also. how would this namespace (log) co-habit with that pre-existing one?
  • continuing on that thread, we have /agents/{:id} retrieveal APIs also for listing and retrieving information about Sessions, Turns, etc.
  • so in a certain way, this isn't quite telemetry actually but something more opinionated and higher level -- namely, for each completion you pass a store: boolean value and then state is kept by the Stack. and then, you would situate these (query) APIs within Api.inference itself.

I used log namespace mainly to avoid confusion:

  1. There's no list-responses (not in OpenAI API either) so that would have to be added. Adding it to the same openAI path could be confusing.
  2. Similarly, while OpenAI chat completions does support 'list chat completions' op, it doesn't return input messages anywhere. I don't think changing the behavior of the openai compat endpoint (GET /openai/v1/chat/completions) is a good idea.

We could have it under the corresponding API's namespace, but I guess the path wouldn't be RESTful given the above? I.e. for inference, we could have (GET /inference/openai_chat_completions) for 'list openai chat completions'? Maybe this is fine as it's not that different from GET /log/openai_chat_completions anyway. I can move it.

An alternate design could be to double down on Telemetry itself and say (in the spec, somehow) what each API call must do as Telemetry side-effects with standardized attributes. Then, you can lean on the more general Telemetry APIs for querying.

==== My inclination is do what you are proposing: APIs which serve a clear purpose work better and can both be relied upon and correctly deprecated. Adding very generic requirements to very general APIs (prematurely while the set of use cases isn't mature) is a recipe for disaster.

Yes, agreed.

@ashwinb
Copy link
Contributor

ashwinb commented May 13, 2025

@ehhuang yeah I don't think we should use a new namespace (log) because this isn't quite generic logging and we already have something for that (namely, telemetry). This is basically some new stateful APIs we want to have for a well-defined purpose so let's locate them to the most specific place we can place them to be.

Relatedly, assuming we standardize on to the OpenAI inference APIs, do we need openai_ prefix in the URL path?

@ehhuang
Copy link
Contributor Author

ehhuang commented May 13, 2025

@ehhuang yeah I don't think we should use a new namespace (log) because this isn't quite generic logging and we already have something for that (namely, telemetry). This is basically some new stateful APIs we want to have for a well-defined purpose so let's locate them to the most specific place we can place them to be.

Relatedly, assuming we standardize on to the OpenAI inference APIs, do we need openai_ prefix in the URL path?

Ok, I can move it to /inference and get rid of openai_ prefix assuming the stack chat_completions will soon be deprecated

@rhuss
Copy link
Contributor

rhuss commented May 14, 2025

Sorry for my ignorance, but how is this supposed to work in a multi-user environment to return only user-specific completions and not every log for all users? Or is the assumption that only one user per llama-stack server is supported ? (sorry if this is a stupid question, still making my way through lls)

no stupid questions :)

We'll follow up with attribute based auth support, a la what was done in agents API: #1737

Thanks for getting back to me :-). I understand the authentication part. What I still don't get is how a user gets back only her own chat completions and not chat completions from every user. Looking at

cursor.execute("SELECT * FROM chat_completions")

for list_chat_completions, I don't see how this is filtered on the current user. Or does every user get their own backend DB? 🤔

The difference between telemetry and conversational logs, IMO, is that telemetry is a global concern, with access on an admin level, whereas conversational logs target the end user persona and need to be specific for each user. For that (and other reasons), I wouldn't put user-specific logs under the telemetry category. Global logs, though, are definitely observability features (like traces and metrics).

Copy link
Collaborator

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, the name “log” makes me think of server logs, but that might just be my perspective. Because of that, this feels more like an admin-level API. While I fully understand the UI use case, could we also explore other scenarios that would justify making this API “public” rather than keeping it as an internal-only interface?

I get that using SQLite to store logs is a simple starting point, but in a real-world scenario, this new API would be a strong candidate for Redis or MongoDB, as the data will likely grow quickly.

Also, I thought we had agreed to have one PR for the API design and a separate PR for the implementation. :)

Lastly, we still need tests! 😄 Thanks!



class LogConfig(BaseModel):
log_db_path: str = Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log_db_path: str = Field(
log_db_path: Path = Field(

Use Path instead or str?

async def store_chat_completion(self, chat_completion: ChatCompletion) -> None: ...

@webmethod(route="/log/openai_chat_completions", method="GET")
async def list_chat_completions(self) -> ListChatCompletionsResponse:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should return a PaginatedResponse.

def convert_response_to_chat_completion(response: ChatCompletionResponse) -> ChatCompletion:
from rich.pretty import pprint

pprint(response)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover?

conn.commit()
cursor.close()

async def list_chat_completions(self) -> ListChatCompletionsResponse:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should paginate the output

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+100

conn.row_factory = sqlite3.Row
cursor = conn.cursor()

cursor.execute("SELECT * FROM chat_completions")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a range with pagination?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if a particular user should be able to see all chat_completions coming from all users, too, except when LLS is operated in a single-user setup. (but see my comments above, too)

@ehhuang ehhuang changed the title feat: RFC: introduce log APIs feat: RFC: introduce APIs for retrieving chat completion requests May 14, 2025
@ehhuang
Copy link
Contributor Author

ehhuang commented May 14, 2025

Thank you all for the feedback so far. As discussed, this will not be a separate /log namespace, and I will move this into /inference instead. Will put it up again when ready.

@ehhuang ehhuang marked this pull request as draft May 14, 2025 20:14
@ehhuang ehhuang force-pushed the pr2145 branch 5 times, most recently from 24b46cc to 966fc32 Compare May 16, 2025 05:03
@ehhuang ehhuang force-pushed the pr2145 branch 2 times, most recently from 96194a1 to dfdd9ef Compare May 16, 2025 05:07
@ehhuang
Copy link
Contributor Author

ehhuang commented May 16, 2025

I've updated the PR to only include API definition and the description to reflect such. API implementation will be in a follow-up PR.

@ehhuang ehhuang marked this pull request as ready for review May 16, 2025 05:10
@@ -759,7 +759,7 @@ def _build_operation(self, op: EndpointOperation) -> Operation:
)

return Operation(
tags=[op.defining_class.__name__],
tags=[op.defining_class.__name__ if op.defining_class.__name__ != "InferenceProvider" else "Inference"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to make this general somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we stick some attribute in the protocol perhaps?

@bbrowning
Copy link
Collaborator

Just a note as we think about OpenAI-API compatibility for inference, we'll need the following API endpoints along with the parameters and such from the API spec (for example https://platform.openai.com/docs/api-reference/chat/list) if we're aiming for 100% compatibility:

  • get a chat completion: GET v1/chat/completions/{completion_id}
  • list messages for a completion: GET v1/chat/completions/{completion_id}/messages
  • list chat completions: GET v1/chat/completions/
  • update a chat completion: POST v1/chat/completions/{completion_id}
  • delete a chat completion: DELETE v1/chat/completions/{completion_id}
  • get a response: GET v1/responses/{response_id} - already implemented at
    @webmethod(route="/openai/v1/responses/{id}", method="GET")
  • delete a response: DELETE v1/responses/{response_id} - PR at feat: Add webmethod for deleting openai responses #2160
  • list input items in a response: GET v1/responses/{response_id}/input_items

If we're following that API spec, then it also makes decisions for us on things like pagination and sort keys. This isn't to say we have to follow it exactly, as these others methods are not as widely used in the wild that I've seen compared to the endpoints to create chat completions and responses. It may be worth at least considering since we want overlapping functionality.

@ehhuang
Copy link
Contributor Author

ehhuang commented May 16, 2025

Just a note as we think about OpenAI-API compatibility for inference, we'll need the following API endpoints along with the parameters and such from the API spec (for example https://platform.openai.com/docs/api-reference/chat/list) if we're aiming for 100% compatibility:

  • get a chat completion: GET v1/chat/completions/{completion_id}
  • list messages for a completion: GET v1/chat/completions/{completion_id}/messages
  • list chat completions: GET v1/chat/completions/
  • update a chat completion: POST v1/chat/completions/{completion_id}
  • delete a chat completion: DELETE v1/chat/completions/{completion_id}
  • get a response: GET v1/responses/{response_id} - already implemented at
    @webmethod(route="/openai/v1/responses/{id}", method="GET")
  • delete a response: DELETE v1/responses/{response_id} - PR at feat: Add webmethod for deleting openai responses #2160
  • list input items in a response: GET v1/responses/{response_id}/input_items

If we're following that API spec, then it also makes decisions for us on things like pagination and sort keys. This isn't to say we have to follow it exactly, as these others methods are not as widely used in the wild that I've seen compared to the endpoints to create chat completions and responses. It may be worth at least considering since we want overlapping functionality.

This is indeed a discussion point.

Note the APIs here are not under /v1/openai/, and hence they're not compatible. I opted to have it not under openai since 1) OpenAI's list chat completions don't return input messages, and 2) it has fields that's not immediately applicable to us (e.g. system_fingerprint, service_tier, etc.). An alternative is to make this openai-compliant and add another field input_messages and leave some fields empty/unimplemented.

Thoughts? cc @ashwinb

@ashwinb
Copy link
Contributor

ashwinb commented May 16, 2025

@bbrowning @ehhuang

Yes let's just go for OpenAI compatibility at this point for the inference-related APIs. If they can make it work, we can make it work too. I had missed that angle when I looked at it first.

@ehhuang ehhuang force-pushed the pr2145 branch 2 times, most recently from c2ab26c to 6093af5 Compare May 16, 2025 18:25
@ehhuang
Copy link
Contributor Author

ehhuang commented May 16, 2025

I updated the PR to make it fully OAI compliant and under openai/v1/. The returned data has an additional input_messages field. Note that OAI has a separate v1/dashboard/chat/completions called from the UI which has the additional messages field compared to the non-dashboard version. Not sure if there's any concern with just having one API.

@@ -759,7 +759,7 @@ def _build_operation(self, op: EndpointOperation) -> Operation:
)

return Operation(
tags=[op.defining_class.__name__],
tags=[getattr(op.defining_class, "API_NAMESPACE", op.defining_class.__name__)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

"""

API_NAMESPACE: str = "Inference"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is perfect

# What does this PR do?


## Test Plan
# What does this PR do?


## Test Plan
@ehhuang ehhuang changed the title feat: RFC: introduce APIs for retrieving chat completion requests feat: introduce APIs for retrieving chat completion requests May 19, 2025
@ehhuang ehhuang merged commit 047303e into meta-llama:main May 19, 2025
30 of 52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants