feat(quota): add server‑side per‑client request quotas (requires auth)

liangwen12year · Wen Liang · commit 4a6b42da3a0e · 2025-05-06T13:04:30.000-04:00
Unrestricted usage can lead to runaway costs and fragmented client-side workarounds. This commit introduces a native quota mechanism to the server, giving operators a unified, centrally managed throttle for per-client requests—without needing extra proxies or custom client logic. This helps contain cloud-compute expenses, enables fine-grained usage control, and simplifies deployment and monitoring of Llama Stack services. Quotas are fully opt-in and have no effect unless explicitly configured. Notice that Quotas are fully opt-in and require authentication to be enabled. Highlights: - Add `QuotaMiddleware` (llama_stack/distribution/server/quota.py): - Reads `Authorization: Bearer <client_id>` via AuthenticationMiddleware - Tracks usage via a pluggable KV store (SQLite or Redis) - Enforces `quota_requests_per_day` within a `quota_window_seconds` window - Returns HTTP 429 when the quota is exceeded - Returns HTTP 500 if no `authenticated_client_id` is found - Extend `ServerConfig` with: - `quota_store: KVStoreConfig | None` (nullable; disables quotas if unset) - `quota_requests_per_day` - `quota_window_seconds` - Enforce strict auth+quota coupling: - The server refuses to start if quotas are enabled but no auth config is present - Logs a clear error and exits on misconfiguration - Wire middleware into server startup (`server.py`) and CLI entrypoint (`llama_stack/cli/stack/run.py`). - Add CLI flags: - `--quota-store-type` (sqlite or redis) - `--quota-store-db-path` (for SQLite) - `--quota-requests-per-day` - `--quota-window-seconds` Behavior changes: - Quotas are disabled by default unless `quota_store` is explicitly set in the YAML config or via CLI. - If `quota_store` is set but no DB path is specified, SQLite defaults to `./quotas.db`. - The server requires authentication when quotas are enabled; startup will fail if quotas are configured but auth is missing. To enable per-client request quotas in `run.yaml`, add: ``` server: port: 8321 auth: provider_type: "custom" config: endpoint: "https://auth.example.com/validate" quota_store: type: sqlite db_path: ./quotas.db quota_requests_per_day: 1000 quota_window_seconds: 86400 ``` To enable quotas via CLI: ``` llama stack run --quota-store-type sqlite --quota-requests-per-day=1000 --quota-window-seconds=86400 ``` Signed-off-by: Wen Liang <wenliang@redhat.com>
diff --git a/.github/workflows/integration-quota-tests.yml b/.github/workflows/integration-quota-tests.yml
@@ -0,0 +1,106 @@
+name: Integration Quota Tests
+
+on:
+  push:
+    branches: [ main ]
+  pull_request:
+    branches: [ main ]
+    paths:
+      - 'llama_stack/**'
+      - 'tests/integration/**'
+      - '.github/workflows/integration-quota-tests.yml'
+      - 'pyproject.toml'
+      - 'requirements.txt'
+
+jobs:
+  quota:
+    runs-on: ubuntu-latest
+    env:
+      LLAMA_STACK_PORT: 8321
+
+    steps:
+      - name: Check out code
+        uses: actions/checkout@v4
+
+      - name: Set up Python & dependencies
+        uses: astral-sh/setup-uv@v5
+        with:
+          python-version: "3.10"
+      - run: |
+          uv sync --extra dev --extra test
+          uv pip install -e .
+
+      - name: Build a venv-based stack
+        run: llama stack build --template ollama --image-type venv
+
+      - name: Start the Llama Stack server
+        run: |
+          nohup uv run llama stack run \
+            --image-type venv \
+            --quota-store-type sqlite \
+            --quota-requests-per-day=2 \
+            --quota-window-seconds=60 \
+            llama_stack/templates/ollama/run.yaml \
+            > server.log 2>&1 &
+          echo "Waiting for health…"
+          for i in {1..30}; do
+            if curl -s http://localhost:${LLAMA_STACK_PORT}/v1/health | grep -q OK; then
+              echo "Server is healthy"
+              break
+            fi
+            sleep 1
+            if [ $i -eq 30 ]; then
+              echo "Server never came up:"
+              cat server.log
+              exit 1
+            fi
+          done
+
+      - name: Test quota enforcement
+        run: |
+          # 1st and 2nd requests must succeed:
+          for n in 1 2; do
+            status=$(curl -s -o /dev/null -w "%{http_code}" \
+              -H "Authorization: Bearer client1" \
+              http://localhost:${LLAMA_STACK_PORT}/test || true)
+            if [ "$status" != "200" ]; then
+              echo "Request #$n returned $status, expected 200"
+              exit 1
+            fi
+          done
+
+          # 3rd request must be throttled:
+          status=$(curl -s -o /dev/null -w "%{http_code}" \
+            -H "Authorization: Bearer client1" \
+            http://localhost:${LLAMA_STACK_PORT}/test || true)
+          if [ "$status" != "429" ]; then
+            echo "3rd request returned $status, expected 429"
+            exit 1
+          fi
+
+          echo "Quota behavior is correct"
+
+      - name: Test quotas fail without auth
+        run: |
+          echo "Starting server with quotas enabled but NO auth (should fail)..."
+          set +e
+          nohup uv run llama stack run \
+            --image-type venv \
+            --quota-store-type sqlite \
+            --quota-requests-per-day=2 \
+            --quota-window-seconds=60 \
+            llama_stack/templates/ollama/run.yaml \
+            > fail_server.log 2>&1 &
+          PID=$!
+          sleep 5
+
+          # Check if the server exited
+          if ps -p $PID > /dev/null; then
+            echo "Server did not fail as expected when quotas are enabled without auth."
+            kill $PID
+            cat fail_server.log
+            exit 1
+          else
+            echo "Server failed as expected when quotas are enabled without auth."
+            cat fail_server.log
+          fi
diff --git a/docs/source/distributions/building_distro.md b/docs/source/distributions/building_distro.md
@@ -269,10 +269,20 @@ After this step is successful, you should be able to find the built container im
 ### Running your Stack server
 Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the `llama stack build` step.
 
-```
+```bash
 llama stack run -h
-usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--disable-ipv6] [--env KEY=VALUE] [--tls-keyfile TLS_KEYFILE] [--tls-certfile TLS_CERTFILE]
+usage: llama stack run [-h]
+                       [--port PORT]
+                       [--image-name IMAGE_NAME]
+                       [--disable-ipv6]
+                       [--env KEY=VALUE]
+                       [--tls-keyfile TLS_KEYFILE]
+                       [--tls-certfile TLS_CERTFILE]
                        [--image-type {conda,container,venv}]
+                       [--quota-store-type {sqlite,redis}]
+                       [--quota-store-db-path QUOTA_STORE_DB_PATH]
+                       [--quota-requests-per-day QUOTA_REQUESTS_PER_DAY]
+                       [--quota-window-seconds QUOTA_WINDOW_SECONDS]
                        config
 
 Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution.
@@ -293,8 +303,20 @@ options:
                         Path to TLS certificate file for HTTPS (default: None)
   --image-type {conda,container,venv}
                         Image Type used during the build. This can be either conda or container or venv. (default: conda)
+  --quota-store-type {sqlite,redis}
+    KV‑store backend for per‑client quotas.
+    • `sqlite` (default): stores counts in a local SQLite file
+    • `redis`: stores counts in Redis
+  --quota-store-db-path QUOTA_STORE_DB_PATH
+    Filesystem path to the SQLite DB file (only used when `--quota-store-type=sqlite`; default: `./quotas.db`)
+  --quota-requests-per-day QUOTA_REQUESTS_PER_DAY
+                        Max requests each client may make per window (default: 1000).
+  --quota-window-seconds QUOTA_WINDOW_SECONDS
+                        Quota window length in seconds (default: 86400 = 24 h).
 
 ```
+**Note:** Quota enforcement requires authentication to be enabled. If you configure quotas (via YAML or CLI) but do not enable an authentication provider, the server will fail to start with a clear error.
+
 
 ```
 # Start using template name
diff --git a/llama_stack/cli/stack/run.py b/llama_stack/cli/stack/run.py
@@ -11,6 +11,11 @@
 from llama_stack.cli.stack.utils import ImageType
 from llama_stack.cli.subcommand import Subcommand
 from llama_stack.log import get_logger
+from llama_stack.providers.utils.kvstore.config import (
+    KVStoreType,
+    RedisKVStoreConfig,
+    SqliteKVStoreConfig,
+)
 
 REPO_ROOT = Path(__file__).parent.parent.parent.parent
 
@@ -75,6 +80,31 @@ def _add_arguments(self):
             help="Image Type used during the build. This can be either conda or container or venv.",
             choices=[e.value for e in ImageType],
         )
+        self.parser.add_argument(
+            "--quota-store-type",
+            type=str,
+            choices=[KVStoreType.sqlite.value, KVStoreType.redis.value],
+            default=KVStoreType.sqlite.value,
+            help="KV store type to back per-client quotas",
+        )
+        self.parser.add_argument(
+            "--quota-store-db-path",
+            type=str,
+            default=None,
+            help="If using sqlite KV store, filesystem path to the database file",
+        )
+        self.parser.add_argument(
+            "--quota-requests-per-day",
+            type=int,
+            default=None,
+            help="Max requests per client per day",
+        )
+        self.parser.add_argument(
+            "--quota-window-seconds",
+            type=int,
+            default=None,
+            help="Time window for the daily quota, in seconds",
+        )
 
     # If neither image type nor image name is provided, but at the same time
     # the current environment has conda breadcrumbs, then assume what the user
@@ -144,6 +174,15 @@ def _run_stack_run_cmd(self, args: argparse.Namespace) -> None:
 
             # Build the server args from the current args passed to the CLI
             server_args = argparse.Namespace()
+            # Construct a quota_store config from the CLI flags
+            if args.quota_store_type == KVStoreType.sqlite.value:
+                server_args.quota_store = SqliteKVStoreConfig(db_path=args.quota_store_db_path or "./quotas.db")
+            else:
+                server_args.quota_store = RedisKVStoreConfig()
+
+            server_args.quota_requests_per_day = args.quota_requests_per_day
+            server_args.quota_window_seconds = args.quota_window_seconds
+
             for arg in vars(args):
                 # If this is a function, avoid passing it
                 # "args" contains:
diff --git a/llama_stack/distribution/datatypes.py b/llama_stack/distribution/datatypes.py
@@ -253,6 +253,21 @@ class ServerConfig(BaseModel):
         default=None,
         description="Authentication configuration for the server",
     )
+    quota_store: KVStoreConfig | None = Field(
+        default=None,
+        description=(
+            "KV store configuration for per-client quota tracking. "
+            "Use type: sqlite or redis. If unset or null, quotas are disabled."
+        ),
+    )
+    quota_requests_per_day: int | None = Field(
+        default=None,
+        description="Maximum number of requests allowed per client per day (None disables limit).",
+    )
+    quota_window_seconds: int | None = Field(
+        default=None,
+        description="Quota window in seconds (None disables limit).",
+    )
 
 
 class StackRunConfig(BaseModel):
diff --git a/llama_stack/distribution/server/auth.py b/llama_stack/distribution/server/auth.py
@@ -113,6 +113,8 @@ async def __call__(self, scope, receive, send):
                     "namespaces": [token],
                 }
 
+            scope["authenticated_client_id"] = token
+
             # Store attributes in request scope
             scope["user_attributes"] = user_attributes
             logger.debug(f"Authentication successful: {len(scope['user_attributes'])} attributes")
diff --git a/llama_stack/distribution/server/quota.py b/llama_stack/distribution/server/quota.py
@@ -0,0 +1,95 @@
+# llama_stack/distribution/server/quota.py
+
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+import json
+from datetime import datetime, timezone
+
+from starlette.types import ASGIApp, Receive, Scope, Send
+
+from llama_stack.log import get_logger
+from llama_stack.providers.utils.kvstore.api import KVStore
+from llama_stack.providers.utils.kvstore.config import KVStoreConfig, SqliteKVStoreConfig
+from llama_stack.providers.utils.kvstore.kvstore import kvstore_impl
+
+logger = get_logger(name=__name__, category="quota")
+
+
+class QuotaMiddleware:
+    """
+    ASGI middleware enforcing per client daily request quotas.
+
+    Expects Authorization: Bearer <client_id> header.
+    Tracks counts in a KV store (SQLite by default); returns HTTP 429 when limit is exceeded.
+    """
+
+    def __init__(
+        self,
+        app: ASGIApp,
+        kv_config: KVStoreConfig | None = None,
+        default_requests_per_day: int = 1000,
+        window_seconds: int = 86400,
+    ):
+        self.app = app
+        # if no config passed, default to on disk SQLite
+        self._kv_config = kv_config or SqliteKVStoreConfig(db_path="./quotas.db")
+        self._kv: KVStore | None = None
+        self.default_limit = default_requests_per_day
+        self.window = window_seconds
+
+    async def _get_kv(self) -> KVStore:
+        if self._kv is None:
+            self._kv = await kvstore_impl(self._kv_config)
+        return self._kv
+
+    async def __call__(self, scope: Scope, receive: Receive, send: Send):
+        if scope["type"] == "http":
+            client_id = scope.get("authenticated_client_id")
+            if not client_id:
+                logger.error(
+                    "QuotaMiddleware requires an authenticated client_id but none was found in the scope. "
+                    "This likely means AuthenticationMiddleware is not installed or failed."
+                )
+                return await self._send_error(
+                    send, 500, "Quota system misconfigured: missing authenticated client identity"
+                )
+
+            key = f"quota:{client_id}:{datetime.now(timezone.utc).date().isoformat()}"
+
+            try:
+                kv = await self._get_kv()
+                prev = await kv.get(key) or "0"
+                count = int(prev) + 1
+                await kv.set(key, str(count))
+                # Note: TTL/expire is only supported on backends that implement it;
+                # for SQLite we ignore expire.
+            except Exception:
+                logger.exception("Error accessing KV store for quota")
+                return await self._send_error(send, 500, "Quota service error")
+
+            if count > self.default_limit:
+                logger.warning(
+                    "Quota exceeded for client %s: %d/%d",
+                    client_id,
+                    count,
+                    self.default_limit,
+                )
+                return await self._send_error(send, 429, "Quota exceeded")
+
+        # Pass through to downstream application
+        return await self.app(scope, receive, send)
+
+    async def _send_error(self, send: Send, status: int, message: str):
+        await send(
+            {
+                "type": "http.response.start",
+                "status": status,
+                "headers": [[b"content-type", b"application/json"]],
+            }
+        )
+        body = json.dumps({"error": {"message": message}}).encode()
+        await send({"type": "http.response.body", "body": body})
diff --git a/llama_stack/distribution/server/server.py b/llama_stack/distribution/server/server.py
@@ -58,6 +58,7 @@
 
 from .auth import AuthenticationMiddleware
 from .endpoints import get_all_api_endpoints
+from .quota import QuotaMiddleware
 
 REPO_ROOT = Path(__file__).parent.parent.parent.parent
 
@@ -401,6 +402,13 @@ def main(args: argparse.Namespace | None = None):
         config = replace_env_vars(config_contents)
         config = StackRunConfig(**config)
 
+        if getattr(args, "quota_store", None):
+            config.server.quota_store = args.quota_store
+        if getattr(args, "quota_requests_per_day", None) is not None:
+            config.server.quota_requests_per_day = args.quota_requests_per_day
+        if getattr(args, "quota_window_seconds", None) is not None:
+            config.server.quota_window_seconds = args.quota_window_seconds
+
     # now that the logger is initialized, print the line about which type of config we are using.
     logger.info(log_line)
 
@@ -421,6 +429,24 @@ def main(args: argparse.Namespace | None = None):
     if config.server.auth:
         logger.info(f"Enabling authentication with provider: {config.server.auth.provider_type.value}")
         app.add_middleware(AuthenticationMiddleware, auth_config=config.server.auth)
+    else:
+        # NEW: Ensure quotas can't be enabled without authentication
+        if config.server.quota_store:
+            logger.error(
+                "Quota enforcement requires authentication to be enabled, but no auth config is present. "
+                "Disable quotas or configure authentication."
+            )
+            raise RuntimeError("Quota middleware requires authentication middleware to be active.")
+
+    # Enforce per-client quota (only if configured and require authentication)
+    if config.server.quota_store:
+        logger.info("Enabling per-client quota middleware")
+        app.add_middleware(
+            QuotaMiddleware,
+            kv_config=config.server.quota_store,
+            default_requests_per_day=config.server.quota_requests_per_day,
+            window_seconds=config.server.quota_window_seconds,
+        )
 
     try:
         impls = asyncio.run(construct_stack(config))
diff --git a/pyproject.toml b/pyproject.toml
@@ -40,6 +40,7 @@ dependencies = [
     "pillow",
     "h11>=0.16.0",
     "kubernetes",
+    "redis>=4.4.0",
 ]
 
 [project.optional-dependencies]
diff --git a/requirements.txt b/requirements.txt
diff --git a/tests/unit/server/test_quota.py b/tests/unit/server/test_quota.py
diff --git a/uv.lock b/uv.lock

Original file line number	Diff line number	Diff line change
`@@ -113,6 +113,8 @@ async def __call__(self, scope, receive, send):`
`113`	`113`	`"namespaces": [token],`
`114`	`114`	`}`
`115`	`115`
	`116`	`+ scope["authenticated_client_id"] = token`
	`117`	`+`
`116`	`118`	`# Store attributes in request scope`
`117`	`119`	`scope["user_attributes"] = user_attributes`
`118`	`120`	`logger.debug(f"Authentication successful: {len(scope['user_attributes'])} attributes")`
Original file line number	Diff line number	Diff line change
`@@ -40,6 +40,7 @@ dependencies = [`
`40`	`40`	`"pillow",`
`41`	`41`	`"h11>=0.16.0",`
`42`	`42`	`"kubernetes",`
	`43`	`+ "redis>=4.4.0",`
`43`	`44`	`]`
`44`	`45`
`45`	`46`	`[project.optional-dependencies]`