Skip to content

Commit 4a6b42d

Browse files
liangwen12yearWen Liang
authored and
Wen Liang
committed
feat(quota): add server‑side per‑client request quotas (requires auth)
Unrestricted usage can lead to runaway costs and fragmented client-side workarounds. This commit introduces a native quota mechanism to the server, giving operators a unified, centrally managed throttle for per-client requests—without needing extra proxies or custom client logic. This helps contain cloud-compute expenses, enables fine-grained usage control, and simplifies deployment and monitoring of Llama Stack services. Quotas are fully opt-in and have no effect unless explicitly configured. Notice that Quotas are fully opt-in and require authentication to be enabled. Highlights: - Add `QuotaMiddleware` (llama_stack/distribution/server/quota.py): - Reads `Authorization: Bearer <client_id>` via AuthenticationMiddleware - Tracks usage via a pluggable KV store (SQLite or Redis) - Enforces `quota_requests_per_day` within a `quota_window_seconds` window - Returns HTTP 429 when the quota is exceeded - Returns HTTP 500 if no `authenticated_client_id` is found - Extend `ServerConfig` with: - `quota_store: KVStoreConfig | None` (nullable; disables quotas if unset) - `quota_requests_per_day` - `quota_window_seconds` - Enforce strict auth+quota coupling: - The server refuses to start if quotas are enabled but no auth config is present - Logs a clear error and exits on misconfiguration - Wire middleware into server startup (`server.py`) and CLI entrypoint (`llama_stack/cli/stack/run.py`). - Add CLI flags: - `--quota-store-type` (sqlite or redis) - `--quota-store-db-path` (for SQLite) - `--quota-requests-per-day` - `--quota-window-seconds` Behavior changes: - Quotas are disabled by default unless `quota_store` is explicitly set in the YAML config or via CLI. - If `quota_store` is set but no DB path is specified, SQLite defaults to `./quotas.db`. - The server requires authentication when quotas are enabled; startup will fail if quotas are configured but auth is missing. To enable per-client request quotas in `run.yaml`, add: ``` server: port: 8321 auth: provider_type: "custom" config: endpoint: "https://auth.example.com/validate" quota_store: type: sqlite db_path: ./quotas.db quota_requests_per_day: 1000 quota_window_seconds: 86400 ``` To enable quotas via CLI: ``` llama stack run --quota-store-type sqlite --quota-requests-per-day=1000 --quota-window-seconds=86400 ``` Signed-off-by: Wen Liang <[email protected]>
1 parent a57985e commit 4a6b42d

File tree

11 files changed

+480
-2
lines changed

11 files changed

+480
-2
lines changed
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
name: Integration Quota Tests
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
pull_request:
7+
branches: [ main ]
8+
paths:
9+
- 'llama_stack/**'
10+
- 'tests/integration/**'
11+
- '.github/workflows/integration-quota-tests.yml'
12+
- 'pyproject.toml'
13+
- 'requirements.txt'
14+
15+
jobs:
16+
quota:
17+
runs-on: ubuntu-latest
18+
env:
19+
LLAMA_STACK_PORT: 8321
20+
21+
steps:
22+
- name: Check out code
23+
uses: actions/checkout@v4
24+
25+
- name: Set up Python & dependencies
26+
uses: astral-sh/setup-uv@v5
27+
with:
28+
python-version: "3.10"
29+
- run: |
30+
uv sync --extra dev --extra test
31+
uv pip install -e .
32+
33+
- name: Build a venv-based stack
34+
run: llama stack build --template ollama --image-type venv
35+
36+
- name: Start the Llama Stack server
37+
run: |
38+
nohup uv run llama stack run \
39+
--image-type venv \
40+
--quota-store-type sqlite \
41+
--quota-requests-per-day=2 \
42+
--quota-window-seconds=60 \
43+
llama_stack/templates/ollama/run.yaml \
44+
> server.log 2>&1 &
45+
echo "Waiting for health…"
46+
for i in {1..30}; do
47+
if curl -s http://localhost:${LLAMA_STACK_PORT}/v1/health | grep -q OK; then
48+
echo "Server is healthy"
49+
break
50+
fi
51+
sleep 1
52+
if [ $i -eq 30 ]; then
53+
echo "Server never came up:"
54+
cat server.log
55+
exit 1
56+
fi
57+
done
58+
59+
- name: Test quota enforcement
60+
run: |
61+
# 1st and 2nd requests must succeed:
62+
for n in 1 2; do
63+
status=$(curl -s -o /dev/null -w "%{http_code}" \
64+
-H "Authorization: Bearer client1" \
65+
http://localhost:${LLAMA_STACK_PORT}/test || true)
66+
if [ "$status" != "200" ]; then
67+
echo "Request #$n returned $status, expected 200"
68+
exit 1
69+
fi
70+
done
71+
72+
# 3rd request must be throttled:
73+
status=$(curl -s -o /dev/null -w "%{http_code}" \
74+
-H "Authorization: Bearer client1" \
75+
http://localhost:${LLAMA_STACK_PORT}/test || true)
76+
if [ "$status" != "429" ]; then
77+
echo "3rd request returned $status, expected 429"
78+
exit 1
79+
fi
80+
81+
echo "Quota behavior is correct"
82+
83+
- name: Test quotas fail without auth
84+
run: |
85+
echo "Starting server with quotas enabled but NO auth (should fail)..."
86+
set +e
87+
nohup uv run llama stack run \
88+
--image-type venv \
89+
--quota-store-type sqlite \
90+
--quota-requests-per-day=2 \
91+
--quota-window-seconds=60 \
92+
llama_stack/templates/ollama/run.yaml \
93+
> fail_server.log 2>&1 &
94+
PID=$!
95+
sleep 5
96+
97+
# Check if the server exited
98+
if ps -p $PID > /dev/null; then
99+
echo "Server did not fail as expected when quotas are enabled without auth."
100+
kill $PID
101+
cat fail_server.log
102+
exit 1
103+
else
104+
echo "Server failed as expected when quotas are enabled without auth."
105+
cat fail_server.log
106+
fi

docs/source/distributions/building_distro.md

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -269,10 +269,20 @@ After this step is successful, you should be able to find the built container im
269269
### Running your Stack server
270270
Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the `llama stack build` step.
271271

272-
```
272+
```bash
273273
llama stack run -h
274-
usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--disable-ipv6] [--env KEY=VALUE] [--tls-keyfile TLS_KEYFILE] [--tls-certfile TLS_CERTFILE]
274+
usage: llama stack run [-h]
275+
[--port PORT]
276+
[--image-name IMAGE_NAME]
277+
[--disable-ipv6]
278+
[--env KEY=VALUE]
279+
[--tls-keyfile TLS_KEYFILE]
280+
[--tls-certfile TLS_CERTFILE]
275281
[--image-type {conda,container,venv}]
282+
[--quota-store-type {sqlite,redis}]
283+
[--quota-store-db-path QUOTA_STORE_DB_PATH]
284+
[--quota-requests-per-day QUOTA_REQUESTS_PER_DAY]
285+
[--quota-window-seconds QUOTA_WINDOW_SECONDS]
276286
config
277287

278288
Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution.
@@ -293,8 +303,20 @@ options:
293303
Path to TLS certificate file for HTTPS (default: None)
294304
--image-type {conda,container,venv}
295305
Image Type used during the build. This can be either conda or container or venv. (default: conda)
306+
--quota-store-type {sqlite,redis}
307+
KV‑store backend for per‑client quotas.
308+
`sqlite` (default): stores counts in a local SQLite file
309+
`redis`: stores counts in Redis
310+
--quota-store-db-path QUOTA_STORE_DB_PATH
311+
Filesystem path to the SQLite DB file (only used when `--quota-store-type=sqlite`; default: `./quotas.db`)
312+
--quota-requests-per-day QUOTA_REQUESTS_PER_DAY
313+
Max requests each client may make per window (default: 1000).
314+
--quota-window-seconds QUOTA_WINDOW_SECONDS
315+
Quota window length in seconds (default: 86400 = 24 h).
296316

297317
```
318+
**Note:** Quota enforcement requires authentication to be enabled. If you configure quotas (via YAML or CLI) but do not enable an authentication provider, the server will fail to start with a clear error.
319+
298320

299321
```
300322
# Start using template name

llama_stack/cli/stack/run.py

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,11 @@
1111
from llama_stack.cli.stack.utils import ImageType
1212
from llama_stack.cli.subcommand import Subcommand
1313
from llama_stack.log import get_logger
14+
from llama_stack.providers.utils.kvstore.config import (
15+
KVStoreType,
16+
RedisKVStoreConfig,
17+
SqliteKVStoreConfig,
18+
)
1419

1520
REPO_ROOT = Path(__file__).parent.parent.parent.parent
1621

@@ -75,6 +80,31 @@ def _add_arguments(self):
7580
help="Image Type used during the build. This can be either conda or container or venv.",
7681
choices=[e.value for e in ImageType],
7782
)
83+
self.parser.add_argument(
84+
"--quota-store-type",
85+
type=str,
86+
choices=[KVStoreType.sqlite.value, KVStoreType.redis.value],
87+
default=KVStoreType.sqlite.value,
88+
help="KV store type to back per-client quotas",
89+
)
90+
self.parser.add_argument(
91+
"--quota-store-db-path",
92+
type=str,
93+
default=None,
94+
help="If using sqlite KV store, filesystem path to the database file",
95+
)
96+
self.parser.add_argument(
97+
"--quota-requests-per-day",
98+
type=int,
99+
default=None,
100+
help="Max requests per client per day",
101+
)
102+
self.parser.add_argument(
103+
"--quota-window-seconds",
104+
type=int,
105+
default=None,
106+
help="Time window for the daily quota, in seconds",
107+
)
78108

79109
# If neither image type nor image name is provided, but at the same time
80110
# the current environment has conda breadcrumbs, then assume what the user
@@ -144,6 +174,15 @@ def _run_stack_run_cmd(self, args: argparse.Namespace) -> None:
144174

145175
# Build the server args from the current args passed to the CLI
146176
server_args = argparse.Namespace()
177+
# Construct a quota_store config from the CLI flags
178+
if args.quota_store_type == KVStoreType.sqlite.value:
179+
server_args.quota_store = SqliteKVStoreConfig(db_path=args.quota_store_db_path or "./quotas.db")
180+
else:
181+
server_args.quota_store = RedisKVStoreConfig()
182+
183+
server_args.quota_requests_per_day = args.quota_requests_per_day
184+
server_args.quota_window_seconds = args.quota_window_seconds
185+
147186
for arg in vars(args):
148187
# If this is a function, avoid passing it
149188
# "args" contains:

llama_stack/distribution/datatypes.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -253,6 +253,21 @@ class ServerConfig(BaseModel):
253253
default=None,
254254
description="Authentication configuration for the server",
255255
)
256+
quota_store: KVStoreConfig | None = Field(
257+
default=None,
258+
description=(
259+
"KV store configuration for per-client quota tracking. "
260+
"Use type: sqlite or redis. If unset or null, quotas are disabled."
261+
),
262+
)
263+
quota_requests_per_day: int | None = Field(
264+
default=None,
265+
description="Maximum number of requests allowed per client per day (None disables limit).",
266+
)
267+
quota_window_seconds: int | None = Field(
268+
default=None,
269+
description="Quota window in seconds (None disables limit).",
270+
)
256271

257272

258273
class StackRunConfig(BaseModel):

llama_stack/distribution/server/auth.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,8 @@ async def __call__(self, scope, receive, send):
113113
"namespaces": [token],
114114
}
115115

116+
scope["authenticated_client_id"] = token
117+
116118
# Store attributes in request scope
117119
scope["user_attributes"] = user_attributes
118120
logger.debug(f"Authentication successful: {len(scope['user_attributes'])} attributes")
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# llama_stack/distribution/server/quota.py
2+
3+
# Copyright (c) Meta Platforms, Inc. and affiliates.
4+
# All rights reserved.
5+
#
6+
# This source code is licensed under the terms described in the LICENSE file in
7+
# the root directory of this source tree.
8+
9+
import json
10+
from datetime import datetime, timezone
11+
12+
from starlette.types import ASGIApp, Receive, Scope, Send
13+
14+
from llama_stack.log import get_logger
15+
from llama_stack.providers.utils.kvstore.api import KVStore
16+
from llama_stack.providers.utils.kvstore.config import KVStoreConfig, SqliteKVStoreConfig
17+
from llama_stack.providers.utils.kvstore.kvstore import kvstore_impl
18+
19+
logger = get_logger(name=__name__, category="quota")
20+
21+
22+
class QuotaMiddleware:
23+
"""
24+
ASGI middleware enforcing per client daily request quotas.
25+
26+
Expects Authorization: Bearer <client_id> header.
27+
Tracks counts in a KV store (SQLite by default); returns HTTP 429 when limit is exceeded.
28+
"""
29+
30+
def __init__(
31+
self,
32+
app: ASGIApp,
33+
kv_config: KVStoreConfig | None = None,
34+
default_requests_per_day: int = 1000,
35+
window_seconds: int = 86400,
36+
):
37+
self.app = app
38+
# if no config passed, default to on disk SQLite
39+
self._kv_config = kv_config or SqliteKVStoreConfig(db_path="./quotas.db")
40+
self._kv: KVStore | None = None
41+
self.default_limit = default_requests_per_day
42+
self.window = window_seconds
43+
44+
async def _get_kv(self) -> KVStore:
45+
if self._kv is None:
46+
self._kv = await kvstore_impl(self._kv_config)
47+
return self._kv
48+
49+
async def __call__(self, scope: Scope, receive: Receive, send: Send):
50+
if scope["type"] == "http":
51+
client_id = scope.get("authenticated_client_id")
52+
if not client_id:
53+
logger.error(
54+
"QuotaMiddleware requires an authenticated client_id but none was found in the scope. "
55+
"This likely means AuthenticationMiddleware is not installed or failed."
56+
)
57+
return await self._send_error(
58+
send, 500, "Quota system misconfigured: missing authenticated client identity"
59+
)
60+
61+
key = f"quota:{client_id}:{datetime.now(timezone.utc).date().isoformat()}"
62+
63+
try:
64+
kv = await self._get_kv()
65+
prev = await kv.get(key) or "0"
66+
count = int(prev) + 1
67+
await kv.set(key, str(count))
68+
# Note: TTL/expire is only supported on backends that implement it;
69+
# for SQLite we ignore expire.
70+
except Exception:
71+
logger.exception("Error accessing KV store for quota")
72+
return await self._send_error(send, 500, "Quota service error")
73+
74+
if count > self.default_limit:
75+
logger.warning(
76+
"Quota exceeded for client %s: %d/%d",
77+
client_id,
78+
count,
79+
self.default_limit,
80+
)
81+
return await self._send_error(send, 429, "Quota exceeded")
82+
83+
# Pass through to downstream application
84+
return await self.app(scope, receive, send)
85+
86+
async def _send_error(self, send: Send, status: int, message: str):
87+
await send(
88+
{
89+
"type": "http.response.start",
90+
"status": status,
91+
"headers": [[b"content-type", b"application/json"]],
92+
}
93+
)
94+
body = json.dumps({"error": {"message": message}}).encode()
95+
await send({"type": "http.response.body", "body": body})

llama_stack/distribution/server/server.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@
5858

5959
from .auth import AuthenticationMiddleware
6060
from .endpoints import get_all_api_endpoints
61+
from .quota import QuotaMiddleware
6162

6263
REPO_ROOT = Path(__file__).parent.parent.parent.parent
6364

@@ -401,6 +402,13 @@ def main(args: argparse.Namespace | None = None):
401402
config = replace_env_vars(config_contents)
402403
config = StackRunConfig(**config)
403404

405+
if getattr(args, "quota_store", None):
406+
config.server.quota_store = args.quota_store
407+
if getattr(args, "quota_requests_per_day", None) is not None:
408+
config.server.quota_requests_per_day = args.quota_requests_per_day
409+
if getattr(args, "quota_window_seconds", None) is not None:
410+
config.server.quota_window_seconds = args.quota_window_seconds
411+
404412
# now that the logger is initialized, print the line about which type of config we are using.
405413
logger.info(log_line)
406414

@@ -421,6 +429,24 @@ def main(args: argparse.Namespace | None = None):
421429
if config.server.auth:
422430
logger.info(f"Enabling authentication with provider: {config.server.auth.provider_type.value}")
423431
app.add_middleware(AuthenticationMiddleware, auth_config=config.server.auth)
432+
else:
433+
# NEW: Ensure quotas can't be enabled without authentication
434+
if config.server.quota_store:
435+
logger.error(
436+
"Quota enforcement requires authentication to be enabled, but no auth config is present. "
437+
"Disable quotas or configure authentication."
438+
)
439+
raise RuntimeError("Quota middleware requires authentication middleware to be active.")
440+
441+
# Enforce per-client quota (only if configured and require authentication)
442+
if config.server.quota_store:
443+
logger.info("Enabling per-client quota middleware")
444+
app.add_middleware(
445+
QuotaMiddleware,
446+
kv_config=config.server.quota_store,
447+
default_requests_per_day=config.server.quota_requests_per_day,
448+
window_seconds=config.server.quota_window_seconds,
449+
)
424450

425451
try:
426452
impls = asyncio.run(construct_stack(config))

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ dependencies = [
4040
"pillow",
4141
"h11>=0.16.0",
4242
"kubernetes",
43+
"redis>=4.4.0",
4344
]
4445

4546
[project.optional-dependencies]

0 commit comments

Comments
 (0)