Skip to content

Commit 2890243

Browse files
liangwen12yearWen Liang
and
Wen Liang
authored
feat(quota): add server‑side per‑client request quotas (requires auth) (#2096)
# What does this PR do? feat(quota): add server‑side per‑client request quotas (requires auth) Unrestricted usage can lead to runaway costs and fragmented client-side workarounds. This commit introduces a native quota mechanism to the server, giving operators a unified, centrally managed throttle for per-client requests—without needing extra proxies or custom client logic. This helps contain cloud-compute expenses, enables fine-grained usage control, and simplifies deployment and monitoring of Llama Stack services. Quotas are fully opt-in and have no effect unless explicitly configured. Notice that Quotas are fully opt-in and require authentication to be enabled. The 'sqlite' is the only supported quota `type` at this time, any other `type` will be rejected. And the only supported `period` is 'day'. Highlights: - Adds `QuotaMiddleware` to enforce per-client request quotas: - Uses `Authorization: Bearer <client_id>` (from AuthenticationMiddleware) - Tracks usage via a SQLite-based KV store - Returns 429 when the quota is exceeded - Extends `ServerConfig` with a `quota` section (type + config) - Enforces strict coupling: quotas require authentication or the server will fail to start Behavior changes: - Quotas are disabled by default unless explicitly configured - SQLite defaults to `./quotas.db` if no DB path is set - The server requires authentication when quotas are enabled To enable per-client request quotas in `run.yaml`, add: ``` server: port: 8321 auth: provider_type: "custom" config: endpoint: "https://auth.example.com/validate" quota: type: sqlite config: db_path: ./quotas.db limit: max_requests: 1000 period: day [//]: # (If resolving an issue, uncomment and update the line below) Closes #2093 ## Test Plan [Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.*] [//]: # (## Documentation) Signed-off-by: Wen Liang <[email protected]> Co-authored-by: Wen Liang <[email protected]>
1 parent 5a3d777 commit 2890243

File tree

6 files changed

+363
-1
lines changed

6 files changed

+363
-1
lines changed

docs/source/distributions/configuration.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,80 @@ And must respond with:
208208

209209
If no access attributes are returned, the token is used as a namespace.
210210

211+
### Quota Configuration
212+
213+
The `quota` section allows you to enable server-side request throttling for both
214+
authenticated and anonymous clients. This is useful for preventing abuse, enforcing
215+
fairness across tenants, and controlling infrastructure costs without requiring
216+
client-side rate limiting or external proxies.
217+
218+
Quotas are disabled by default. When enabled, each client is tracked using either:
219+
220+
* Their authenticated `client_id` (derived from the Bearer token), or
221+
* Their IP address (fallback for anonymous requests)
222+
223+
Quota state is stored in a SQLite-backed key-value store, and rate limits are applied
224+
within a configurable time window (currently only `day` is supported).
225+
226+
#### Example
227+
228+
```yaml
229+
server:
230+
quota:
231+
kvstore:
232+
type: sqlite
233+
db_path: ./quotas.db
234+
anonymous_max_requests: 100
235+
authenticated_max_requests: 1000
236+
period: day
237+
```
238+
239+
#### Configuration Options
240+
241+
| Field | Description |
242+
| ---------------------------- | -------------------------------------------------------------------------- |
243+
| `kvstore` | Required. Backend storage config for tracking request counts. |
244+
| `kvstore.type` | Must be `"sqlite"` for now. Other backends may be supported in the future. |
245+
| `kvstore.db_path` | File path to the SQLite database. |
246+
| `anonymous_max_requests` | Max requests per period for unauthenticated clients. |
247+
| `authenticated_max_requests` | Max requests per period for authenticated clients. |
248+
| `period` | Time window for quota enforcement. Only `"day"` is supported. |
249+
250+
> Note: if `authenticated_max_requests` is set but no authentication provider is
251+
configured, the server will fall back to applying `anonymous_max_requests` to all
252+
clients.
253+
254+
#### Example with Authentication Enabled
255+
256+
```yaml
257+
server:
258+
port: 8321
259+
auth:
260+
provider_type: custom
261+
config:
262+
endpoint: https://auth.example.com/validate
263+
quota:
264+
kvstore:
265+
type: sqlite
266+
db_path: ./quotas.db
267+
anonymous_max_requests: 100
268+
authenticated_max_requests: 1000
269+
period: day
270+
```
271+
272+
If a client exceeds their limit, the server responds with:
273+
274+
```http
275+
HTTP/1.1 429 Too Many Requests
276+
Content-Type: application/json
277+
278+
{
279+
"error": {
280+
"message": "Quota exceeded"
281+
}
282+
}
283+
```
284+
211285
## Extending to handle Safety
212286

213287
Configuring Safety can be a little involved so it is instructive to go through an example.

llama_stack/distribution/datatypes.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
from llama_stack.apis.vector_dbs import VectorDB, VectorDBInput
2626
from llama_stack.apis.vector_io import VectorIO
2727
from llama_stack.providers.datatypes import Api, ProviderSpec
28-
from llama_stack.providers.utils.kvstore.config import KVStoreConfig
28+
from llama_stack.providers.utils.kvstore.config import KVStoreConfig, SqliteKVStoreConfig
2929

3030
LLAMA_STACK_BUILD_CONFIG_VERSION = "2"
3131
LLAMA_STACK_RUN_CONFIG_VERSION = "2"
@@ -235,6 +235,19 @@ class AuthenticationConfig(BaseModel):
235235
)
236236

237237

238+
class QuotaPeriod(str, Enum):
239+
DAY = "day"
240+
241+
242+
class QuotaConfig(BaseModel):
243+
kvstore: SqliteKVStoreConfig = Field(description="Config for KV store backend (SQLite only for now)")
244+
anonymous_max_requests: int = Field(default=100, description="Max requests for unauthenticated clients per period")
245+
authenticated_max_requests: int = Field(
246+
default=1000, description="Max requests for authenticated clients per period"
247+
)
248+
period: QuotaPeriod = Field(default=QuotaPeriod.DAY, description="Quota period to set")
249+
250+
238251
class ServerConfig(BaseModel):
239252
port: int = Field(
240253
default=8321,
@@ -262,6 +275,10 @@ class ServerConfig(BaseModel):
262275
default=None,
263276
description="The host the server should listen on",
264277
)
278+
quota: QuotaConfig | None = Field(
279+
default=None,
280+
description="Per client quota request configuration",
281+
)
265282

266283

267284
class StackRunConfig(BaseModel):

llama_stack/distribution/server/auth.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,10 @@ async def __call__(self, scope, receive, send):
113113
"roles": [token],
114114
}
115115

116+
# Store the client ID in the request scope so that downstream middleware (like QuotaMiddleware)
117+
# can identify the requester and enforce per-client rate limits.
118+
scope["authenticated_client_id"] = token
119+
116120
# Store attributes in request scope
117121
scope["user_attributes"] = user_attributes
118122
scope["principal"] = validation_result.principal
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Copyright (c) Meta Platforms, Inc. and affiliates.
2+
# All rights reserved.
3+
#
4+
# This source code is licensed under the terms described in the LICENSE file in
5+
# the root directory of this source tree.
6+
7+
import json
8+
import time
9+
from datetime import datetime, timedelta, timezone
10+
11+
from starlette.types import ASGIApp, Receive, Scope, Send
12+
13+
from llama_stack.log import get_logger
14+
from llama_stack.providers.utils.kvstore.api import KVStore
15+
from llama_stack.providers.utils.kvstore.config import KVStoreConfig, SqliteKVStoreConfig
16+
from llama_stack.providers.utils.kvstore.kvstore import kvstore_impl
17+
18+
logger = get_logger(name=__name__, category="quota")
19+
20+
21+
class QuotaMiddleware:
22+
"""
23+
ASGI middleware that enforces separate quotas for authenticated and anonymous clients
24+
within a configurable time window.
25+
26+
- For authenticated requests, it reads the client ID from the
27+
`Authorization: Bearer <client_id>` header.
28+
- For anonymous requests, it falls back to the IP address of the client.
29+
Requests are counted in a KV store (e.g., SQLite), and HTTP 429 is returned
30+
once a client exceeds its quota.
31+
"""
32+
33+
def __init__(
34+
self,
35+
app: ASGIApp,
36+
kv_config: KVStoreConfig,
37+
anonymous_max_requests: int,
38+
authenticated_max_requests: int,
39+
window_seconds: int = 86400,
40+
):
41+
self.app = app
42+
self.kv_config = kv_config
43+
self.kv: KVStore | None = None
44+
self.anonymous_max_requests = anonymous_max_requests
45+
self.authenticated_max_requests = authenticated_max_requests
46+
self.window_seconds = window_seconds
47+
48+
if isinstance(self.kv_config, SqliteKVStoreConfig):
49+
logger.warning(
50+
"QuotaMiddleware: Using SQLite backend. Expiry/TTL is not enforced; cleanup is manual. "
51+
f"window_seconds={self.window_seconds}"
52+
)
53+
54+
async def _get_kv(self) -> KVStore:
55+
if self.kv is None:
56+
self.kv = await kvstore_impl(self.kv_config)
57+
return self.kv
58+
59+
async def __call__(self, scope: Scope, receive: Receive, send: Send):
60+
if scope["type"] == "http":
61+
# pick key & limit based on auth
62+
auth_id = scope.get("authenticated_client_id")
63+
if auth_id:
64+
key_id = auth_id
65+
limit = self.authenticated_max_requests
66+
else:
67+
# fallback to IP
68+
client = scope.get("client")
69+
key_id = client[0] if client else "anonymous"
70+
limit = self.anonymous_max_requests
71+
72+
current_window = int(time.time() // self.window_seconds)
73+
key = f"quota:{key_id}:{current_window}"
74+
75+
try:
76+
kv = await self._get_kv()
77+
prev = await kv.get(key) or "0"
78+
count = int(prev) + 1
79+
80+
if int(prev) == 0:
81+
# Set with expiration datetime when it is the first request in the window.
82+
expiration = datetime.now(timezone.utc) + timedelta(seconds=self.window_seconds)
83+
await kv.set(key, str(count), expiration=expiration)
84+
else:
85+
await kv.set(key, str(count))
86+
except Exception:
87+
logger.exception("Failed to access KV store for quota")
88+
return await self._send_error(send, 500, "Quota service error")
89+
90+
if count > limit:
91+
logger.warning(
92+
"Quota exceeded for client %s: %d/%d",
93+
key_id,
94+
count,
95+
limit,
96+
)
97+
return await self._send_error(send, 429, "Quota exceeded")
98+
99+
return await self.app(scope, receive, send)
100+
101+
async def _send_error(self, send: Send, status: int, message: str):
102+
await send(
103+
{
104+
"type": "http.response.start",
105+
"status": status,
106+
"headers": [[b"content-type", b"application/json"]],
107+
}
108+
)
109+
body = json.dumps({"error": {"message": message}}).encode()
110+
await send({"type": "http.response.body", "body": body})

llama_stack/distribution/server/server.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@
6060

6161
from .auth import AuthenticationMiddleware
6262
from .endpoints import get_all_api_endpoints
63+
from .quota import QuotaMiddleware
6364

6465
REPO_ROOT = Path(__file__).parent.parent.parent.parent
6566

@@ -434,6 +435,35 @@ def main(args: argparse.Namespace | None = None):
434435
if config.server.auth:
435436
logger.info(f"Enabling authentication with provider: {config.server.auth.provider_type.value}")
436437
app.add_middleware(AuthenticationMiddleware, auth_config=config.server.auth)
438+
else:
439+
if config.server.quota:
440+
quota = config.server.quota
441+
logger.warning(
442+
"Configured authenticated_max_requests (%d) but no auth is enabled; "
443+
"falling back to anonymous_max_requests (%d) for all the requests",
444+
quota.authenticated_max_requests,
445+
quota.anonymous_max_requests,
446+
)
447+
448+
if config.server.quota:
449+
logger.info("Enabling quota middleware for authenticated and anonymous clients")
450+
451+
quota = config.server.quota
452+
anonymous_max_requests = quota.anonymous_max_requests
453+
# if auth is disabled, use the anonymous max requests
454+
authenticated_max_requests = quota.authenticated_max_requests if config.server.auth else anonymous_max_requests
455+
456+
kv_config = quota.kvstore
457+
window_map = {"day": 86400}
458+
window_seconds = window_map[quota.period.value]
459+
460+
app.add_middleware(
461+
QuotaMiddleware,
462+
kv_config=kv_config,
463+
anonymous_max_requests=anonymous_max_requests,
464+
authenticated_max_requests=authenticated_max_requests,
465+
window_seconds=window_seconds,
466+
)
437467

438468
try:
439469
impls = asyncio.run(construct_stack(config))

0 commit comments

Comments
 (0)