diff --git a/docs/content/architecture.md b/docs/content/architecture.md
index 3a0b1ba6a..8dde80920 100644
--- a/docs/content/architecture.md
+++ b/docs/content/architecture.md
@@ -4,152 +4,321 @@
The MaaS Platform is designed as a cloud-native, Kubernetes-based solution that provides policy-based access control, rate limiting, and tier-based subscriptions for AI model serving. The architecture follows microservices principles and leverages OpenShift/Kubernetes native components for scalability and reliability.
-## 🏗️ High-Level Architecture
+## Architecture
+
+### 🏗️ High-Level Architecture
+
+The MaaS Platform is an end-to-end solution that leverages Red Hat Connectivity Link (Kuadrant) Application Connectivity Policies and Red Hat OpenShift AI's Model Serving capabilities to provide a fully managed, scalable, and secure self-service platform for AI model serving.
+
+```mermaid
+graph LR
+ subgraph "User Layer"
+ User[Users]
+ AdminUI[Admin/User UI]
+ end
+
+ subgraph "Token Management"
+ MaaSAPI[MaaS API
Token Retrieval]
+ end
+
+ subgraph "Gateway & Auth"
+ GatewayAPI[Gateway API]
+ RHCL[RHCL Components
Kuadrant/Authrino/Limitador
Auth & Rate Limiting]
+ end
+
+ subgraph "Model Serving"
+ RHOAI[RHOAI
LLM Models]
+ end
+
+ User -->|1. Request Token| AdminUI
+ User -->|1a. Direct Token Request| MaaSAPI
+ AdminUI -->|2. Retrieve Token| MaaSAPI
+ User -->|3. Inference Request
with Token| GatewayAPI
+ GatewayAPI -->|4. Auth & Rate Limit| RHCL
+ RHCL -->|5. Forward to Model| RHOAI
+
+ style MaaSAPI fill:#e1f5fe
+ style GatewayAPI fill:#f3e5f5
+ style RHCL fill:#fff3e0
+ style RHOAI fill:#e8f5e8
+```
+
+### Architecture Details
+
+The MaaS Platform architecture is designed to be modular and scalable. It is composed of the following components:
+
+- **MaaS API**: The central component for token generation and management.
+- **Gateway API**: The entry point for all inference requests.
+- **Kuandrant (Red Hat Connectivity Link)**: The policy engine for authentication and authorization.
+- **Open Data Hub (Red Hat OpenShift AI)**: The model serving platform.
+
+### Detailed Component Architecture
+
+### MaaS API Component Details
+
+The MaaS API provides a self-service platform for users to request tokens for their inference requests. By leveraging Kubernetes native objects like ConfigMaps and ServiceAccounts, it offers model owners a simple way to configure access to their models based on a familiar group-based access control model.
+
+```mermaid
+graph TB
+ subgraph "External Access"
+ User[Users]
+ AdminUI[Admin/User UI]
+ end
+
+ subgraph "MaaS API Service"
+ API[**MaaS API**
Go + Gin Framework]
+ TierMapping[**Tier Mapping Logic**]
+ TokenGen[**Service Account Token Generation**]
+ end
+
+ subgraph "Configuration"
+ ConfigMap[**ConfigMap**
tier-to-group-mapping]
+ K8sGroups[**Kubernetes Groups**
tier-free-users
tier-premium-users
tier-enterprise-users]
+ end
+
+ subgraph "free namespace"
+ FreeSA1[**ServiceAccount**
freeuser1-sa]
+ FreeSA2[**ServiceAccount**
freeuser2-sa]
+ end
+
+ subgraph "premium namespace"
+ PremiumSA1[**ServiceAccount**
prem-user1-sa]
+ end
+
+ subgraph "enterprise namespace"
+ EnterpriseSA1[**ServiceAccount**
ent-user1-sa]
+ end
+
+ User -->|Direct API Call| API
+ AdminUI -->|Token Request| API
+
+ API --> TierMapping
+ API --> TokenGen
+
+ TierMapping --> ConfigMap
+ ConfigMap -->|Maps Groups to Tiers| K8sGroups
+ TokenGen --> FreeSA1
+ TokenGen --> FreeSA2
+ TokenGen --> PremiumSA1
+ TokenGen --> EnterpriseSA1
+
+ K8sGroups -->|Group Membership| TierMapping
+
+ style API fill:#e1f5fe
+ style ConfigMap fill:#ffeb3b
+ style K8sGroups fill:#ffeb3b
+ style FreeSA1 fill:#ffeb3b
+ style FreeSA2 fill:#ffeb3b
+ style PremiumSA1 fill:#ffeb3b
+ style EnterpriseSA1 fill:#ffeb3b
+```
+
+**Key Features:**
+
+- **Tier-to-Group Mapping**: Uses ConfigMap in the same namespace as MaaS API to map Kubernetes groups to tiers
+- **Three Configurable Default Tiers**: Out of the box, the MaaS Platform comes with three default tiers: free, premium, and enterprise. These tiers are configurable and can be extended to support more tiers as needed.
+- **Service Account Tokens**: Generates tokens for the appropriate tier's service account based on user's group membership
+- **Future Enhancements**: Planned improvements for more sophisticated token management and the ability to integrate with external identity providers.
+
+### Inference Service Component Details
+
+Once a user has obtained their token through the MaaS API, they can use it to make inference requests to the Gateway API. RHCL's Application Connectivity Policies then validate the token and enforce access control and rate limiting policies:
```mermaid
graph TB
subgraph "Client Layer"
- WebUI[Web UI]
- API[API Clients]
- CLI[CLI Tools]
+ Client[Client Applications
with Service Account Token]
end
subgraph "Gateway Layer"
- Gateway[Gateway API]
- Auth[Authentication]
- RateLimit[Rate Limiting]
- Policy[Policy Engine]
+ GatewayAPI[**maas-default-gateway**
maas.CLUSTER_DOMAIN]
+ Envoy[**Envoy Proxy**]
end
- subgraph "Service Layer"
- MaaSAPI[MaaS API]
- ModelService[Model Service]
- TokenService[Token Service]
- TierService[Tier Service]
+ subgraph "RHCL Policy Engine"
+ Kuadrant[**Kuadrant**
Policy Attachment]
+ Authorino[**Authorino**
Authentication Service]
+ Limitador[**Limitador**
Rate Limiting Service]
end
- subgraph "Model Layer"
- KServe[KServe]
- Model1[Model 1]
- Model2[Model 2]
- ModelN[Model N]
+ subgraph "Policy Components"
+ AuthPolicy[**AuthPolicy**
gateway-auth-policy]
+ RateLimitPolicy[**RateLimitPolicy**
gateway-rate-limits]
+ TokenRateLimitPolicy[**TokenRateLimitPolicy**
gateway-token-rate-limits]
end
- subgraph "Data Layer"
- ConfigMap[ConfigMaps]
- Secret[Secrets]
- PVC[Persistent Volumes]
+ subgraph "Model Access Control"
+ RBAC[**Kubernetes RBAC**
Service Account Permissions]
+ LLMInferenceService[**LLMInferenceService**
Model Access Control]
+ end
+
+ subgraph "Model Serving"
+ RHOAI[**RHOAI Platform**]
+ Models[**LLM Models**
Qwen, Granite, Llama]
end
subgraph "Observability"
- Prometheus[Prometheus]
- Grafana[Grafana]
- Logs[Log Aggregation]
- end
-
- WebUI --> Gateway
- API --> Gateway
- CLI --> Gateway
-
- Gateway --> Auth
- Gateway --> RateLimit
- Gateway --> Policy
-
- Gateway --> MaaSAPI
- MaaSAPI --> ModelService
- MaaSAPI --> TokenService
- MaaSAPI --> TierService
-
- ModelService --> KServe
- KServe --> Model1
- KServe --> Model2
- KServe --> ModelN
-
- MaaSAPI --> ConfigMap
- MaaSAPI --> Secret
- KServe --> PVC
-
- MaaSAPI --> Prometheus
- Gateway --> Prometheus
- KServe --> Prometheus
- Prometheus --> Grafana
- Gateway --> Logs
- MaaSAPI --> Logs
+ Prometheus[**Prometheus**
Metrics Collection]
+ Dashboards[**Observability Stack**
Grafana/Perses Dashboards]
+ end
+
+ Client -->|Inference Request + Service Account Token| GatewayAPI
+ GatewayAPI --> Envoy
+
+ Envoy --> Kuadrant
+ Kuadrant --> Authorino
+ Kuadrant --> Limitador
+
+ Authorino --> AuthPolicy
+ Limitador --> RateLimitPolicy
+ Limitador --> TokenRateLimitPolicy
+
+ Envoy -->|Check Model Access| RBAC
+ RBAC --> LLMInferenceService
+ LLMInferenceService -->|POST Permission Check| RHOAI
+ RHOAI --> Models
+
+ Limitador -->|Usage Metrics| Prometheus
+ Prometheus --> Dashboards
+
+ style GatewayAPI fill:#f3e5f5
+ style Kuadrant fill:#fff3e0
+ style Authorino fill:#fff3e0
+ style Limitador fill:#fff3e0
+ style AuthPolicy fill:#ffeb3b
+ style RateLimitPolicy fill:#ffeb3b
+ style TokenRateLimitPolicy fill:#ffeb3b
+ style RBAC fill:#ffeb3b
+ style LLMInferenceService fill:#ffeb3b
+ style RHOAI fill:#e8f5e8
```
-## 🔄 Request Flow
+**Policy Engine Flow:**
+
+1. **User Request**: A user makes an inference request to the Gateway API with a valid token.
+2. **Service Account Authentication**: Authorino validates service account tokens using gateway-auth-policy
+3. **Rate Limiting**: Limitador enforces usage quotas per tier/user using gateway-rate-limits and gateway-token-rate-limits
+4. **Model Access Control**: RBAC checks if service account has POST access to the specific LLMInferenceService
+5. **Request Forwarding**: Only requests with proper model access are forwarded to RHOAI
+6. **Metrics Collection**: Limitador sends usage data to Prometheus for observability dashboards
+
+## 🔄 Component Flows
-### 1. Authentication Flow
+### 1. Token Retrieval Flow (MaaS API)
+
+The MaaS API handles token generation and management for different user tiers:
+
+```mermaid
+sequenceDiagram
+ participant User
+ participant AdminUI[Admin/User UI]
+ participant MaaSAPI[MaaS API]
+ participant TokenDB[(Token Database)]
+ participant TierDB[(Tier Database)]
+
+ User->>AdminUI: Request Token
+ AdminUI->>MaaSAPI: POST /tokens
+ MaaSAPI->>TierDB: Check User Tier
+ TierDB-->>MaaSAPI: Tier Limits & Permissions
+ MaaSAPI->>TokenDB: Generate Token
+ TokenDB-->>MaaSAPI: Token + Metadata
+ MaaSAPI-->>AdminUI: Token Response
+ AdminUI-->>User: Token for Inference
+
+ Note over MaaSAPI: Token includes:
- User ID
- Tier Level
- Rate Limits
- Model Access
+```
+
+### 2. Gateway & Authentication Flow
+
+The Gateway API and RHCL components handle authentication and rate limiting:
```mermaid
sequenceDiagram
participant Client
- participant Gateway
- participant Auth
- participant MaaSAPI
- participant Model
-
- Client->>Gateway: Request with Token
- Gateway->>Auth: Validate Token
- Auth->>MaaSAPI: Check Token Validity
- MaaSAPI-->>Auth: Token Status + Tier Info
- Auth-->>Gateway: Authentication Result
- Gateway->>Model: Forward Request (if valid)
- Model-->>Gateway: Response
- Gateway-->>Client: Response
+ participant GatewayAPI[Gateway API]
+ participant Kuadrant[Kuadrant]
+ participant Authrino[Authrino]
+ participant Limitador[Limitador]
+ participant MaaSAPI[MaaS API]
+
+ Client->>GatewayAPI: Inference Request + Token
+ GatewayAPI->>Kuadrant: Apply Auth Policy
+ Kuadrant->>Authrino: Validate Token
+ Authrino->>MaaSAPI: Check Token Validity
+ MaaSAPI-->>Authrino: Token Status + Tier Info
+ Authrino-->>Kuadrant: Auth Result
+ Kuadrant->>Limitador: Check Rate Limits
+ Limitador-->>Kuadrant: Rate Limit Status
+ Kuadrant-->>GatewayAPI: Policy Decision
+ GatewayAPI-->>Client: Forward to Model or Reject
```
-### 2. Model Inference Flow
+### 3. Model Inference Flow
+
+The inference flow routes validated requests to RHOAI models:
```mermaid
sequenceDiagram
participant Client
- participant Gateway
- participant MaaSAPI
- participant KServe
- participant Model
-
- Client->>Gateway: POST /v1/models/{model}/infer
- Gateway->>MaaSAPI: Validate Request
- MaaSAPI-->>Gateway: Tier + Rate Limit Check
- Gateway->>KServe: Forward to Model
- KServe->>Model: Process Inference
- Model-->>KServe: Inference Result
- KServe-->>Gateway: Response
- Gateway-->>Client: Response
+ participant GatewayAPI[Gateway API]
+ participant RHCL[RHCL Components]
+ participant RHOAI[RHOAI Platform]
+ participant Model[LLM Model]
+
+ Client->>GatewayAPI: POST /v1/models/{model}/infer
+ GatewayAPI->>RHCL: Validate Request & Token
+ RHCL-->>GatewayAPI: Validation Success
+ GatewayAPI->>RHOAI: Forward Inference Request
+ RHOAI->>Model: Process Inference
+ Model-->>RHOAI: Inference Result
+ RHOAI-->>GatewayAPI: Response
+ GatewayAPI-->>Client: Model Response
+
+ Note over RHCL: Updates metrics:
- Token usage
- Request count
- Tier consumption
```
## Core Components
-### Gateway Layer
+### MaaS API (Token Management)
-The gateway layer handles all incoming requests and implements security policies:
+The MaaS API is the central component for token generation and management:
-- **Gateway API**: Routes requests to appropriate services
-- **Kuadrant**: Policy Attachment Point for authentication and authorization
-- **Authorino**: Authentication and authorization service
-- **Limitador**: Token and Request Rate limiting service
+- **Token Generation**: Creates secure tokens with embedded metadata
+- **Tier Management**: Enforces subscription tier limits and permissions
+- **User Authentication**: Validates user credentials and permissions
+- **Rate Limit Configuration**: Sets token-specific rate limits based on tier
-### Management Layer
+### Gateway API & RHCL Components
-The management layer contains the core business logic:
+The gateway layer provides policy-based request handling:
-- **MaaS API**: Central service for token and tier management
+- **Gateway API**: Entry point for all inference requests
+- **Kuadrant**: Policy attachment point for authentication and authorization
+- **Authrino**: Authentication and authorization service that validates tokens
+- **Limitador**: Rate limiting service that enforces usage quotas
-### Model Layer
+### RHOAI (Model Serving)
-The model layer provides AI model serving capabilities:
+Red Hat OpenShift AI provides the model serving infrastructure:
-- **KServe**: Model serving platform
-- **Model Instances**: Individual AI models (LLMs, etc.)
+- **Model Hosting**: Runs LLM models (Qwen, Granite, Llama, etc.)
- **Scaling**: Automatic scaling based on demand
+- **Resource Management**: GPU allocation and management
+- **Model Lifecycle**: Model deployment, updates, and retirement
-## Flows
-
-### 1. Token Request Flow
+## Architecture Benefits
-
+### Security
+- **Token-based Authentication**: Secure, stateless authentication
+- **Policy Enforcement**: Centralized security policies via Kuadrant
+- **Rate Limiting**: Prevents abuse and ensures fair resource usage
-### 2. Model Inference Flow
+### Scalability
+- **Microservices Architecture**: Independent scaling of components
+- **Kubernetes Native**: Leverages OpenShift/Kubernetes scaling capabilities
+- **Tier-based Resource Allocation**: Different service levels for different user tiers
-
+### Observability
+- **Comprehensive Metrics**: Token usage, request rates, and tier consumption
+- **Centralized Logging**: All components log to centralized systems
+- **Monitoring**: Real-time monitoring of system health and performance