refactor: use FastAPI app.state for dependency injection by amito · Pull Request #115 · redhat-et/neuralnav

amito · 2026-03-17T22:01:21Z

Replace global singleton pattern with FastAPI's app.state and Depends() for cleaner dependency injection. All shared instances are now initialized during app lifespan startup and injected via request.app.state.

Summary by CodeRabbit

New Features
- App startup now performs centralized initialization and wires core services into request-scoped dependencies.
- Deployment endpoints are namespace-aware and use async-safe execution for cluster operations.
Bug Fixes
- Thread-safe cluster-manager caching with a namespace limit and clearer HTTP error mappings.
- Safer deployment file matching to avoid unintended results.
Refactor
- Shared singletons moved into app state; routes updated to consume injected services.

coderabbitai · 2026-03-17T22:01:38Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds an async FastAPI lifespan that initializes shared services into app.state at startup, migrates module singletons to request-scoped DI reading from request.app.state, moves cluster-manager creation behind an asyncio lock with per-namespace caching, and updates routes to use Depends for these services.

Changes

Cohort / File(s)	Summary
App lifecycle & wiring `src/neuralnav/api/app.py`	Adds `@asynccontextmanager` `lifespan(app: FastAPI)` that calls `init_app_state(app)` via `asyncio.to_thread`, creates `app.state.cluster_manager_lock` on the event loop, wires `lifespan` into `create_app()`, and switches route imports to `neuralnav.api.routes`.
State initialization & DI providers `src/neuralnav/api/dependencies.py`	Adds `init_app_state(app)` to populate `app.state` with `ModelCatalog`, `SLOTemplateRepository`, `DeploymentGenerator`, `YAMLValidator`, `RecommendationWorkflow`, and a `cluster_managers` cache; moves singletons into `app.state`; converts provider functions to request-scoped (`request: Request`), adds async `get_cluster_manager_or_raise(request, namespace)` guarded by `app.state.cluster_manager_lock`, enforces `_MAX_CACHED_NAMESPACES = 32`, uses `run_in_threadpool` for manager creation, and maps cluster errors to HTTP 503; removes deployment-mode globals.
Configuration routes `src/neuralnav/api/routes/configuration.py`	Endpoints now accept `Request` and inject `DeploymentGenerator`/`YAMLValidator` via `Depends(...)`; mode read/write uses `request.app.state.deployment_generator.simulator_mode`; cluster endpoints call `await get_cluster_manager_or_raise(request, namespace)` and use `run_in_threadpool` for blocking manager calls; `deployment_id` is escaped before globbing; tightened some response typings.
Recommendation & Intent routes `src/neuralnav/api/routes/recommendation.py`, `src/neuralnav/api/routes/intent.py`	Handlers updated to accept `RecommendationWorkflow` (and `DeploymentGenerator` where needed) via `Depends(...)` instead of creating them in-function; signatures and imports adjusted for DI.
Reference-data routes `src/neuralnav/api/routes/reference_data.py`	`list_models`, `list_gpu_types`, and `list_use_cases` now accept `ModelCatalog`/`SLOTemplateRepository` via `Depends(...)`, removing internal singleton calls.

Sequence Diagram

sequenceDiagram
    participant Startup as App Startup
    participant Init as init_app_state()
    participant AppState as app.state
    participant Request as HTTP Request
    participant Provider as Dependency Provider\n(get_*)
    participant Endpoint as Route Handler
    participant K8s as KubernetesClusterManager

    Startup->>Init: call init_app_state (offloaded via asyncio.to_thread)
    Init->>AppState: populate workflow, model_catalog,\ndeployment_generator, yaml_validator,\ncluster_managers cache
    Request->>Provider: resolve dependency (Depends)
    Provider->>AppState: read service from request.app.state
    Provider->>Endpoint: inject dependency
    Endpoint->>K8s: await get_cluster_manager_or_raise(request, namespace)
    K8s->>AppState: cached per-namespace manager\n(or created under lock via run_in_threadpool)
    Endpoint->>Endpoint: handle request using injected services

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically describes the main refactoring objective: migrating from a global singleton pattern to FastAPI's app.state for dependency injection, which is the primary architectural change across the entire changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/neuralnav/api/dependencies.py (1)
48-50: Empty cleanup function may need implementation.

close_app_state is currently a no-op. If any of the initialized resources (e.g., RecommendationWorkflow, KubernetesClusterManager instances in cluster_managers) hold connections or require explicit cleanup, this should be implemented to prevent resource leaks on shutdown.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/neuralnav/api/dependencies.py` around lines 48 - 50, close_app_state is
currently a no-op; implement shutdown logic to avoid resource leaks by iterating
known state entries (e.g., app.state.cluster_managers and
app.state.recommendation_workflow), detecting and calling their cleanup methods
(common names: close, shutdown, stop, terminate) and awaiting them if they are
coroutines; swallow/log exceptions per-item so one failure doesn't block others,
and finally clear or delete those app.state attributes to release references.
Ensure you reference the symbols close_app_state, RecommendationWorkflow,
KubernetesClusterManager, and cluster_managers when locating and updating the
function.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/neuralnav/api/dependencies.py`:
- Around line 48-50: close_app_state is currently a no-op; implement shutdown
logic to avoid resource leaks by iterating known state entries (e.g.,
app.state.cluster_managers and app.state.recommendation_workflow), detecting and
calling their cleanup methods (common names: close, shutdown, stop, terminate)
and awaiting them if they are coroutines; swallow/log exceptions per-item so one
failure doesn't block others, and finally clear or delete those app.state
attributes to release references. Ensure you reference the symbols
close_app_state, RecommendationWorkflow, KubernetesClusterManager, and
cluster_managers when locating and updating the function.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8e1dec2a-00ff-4d77-8ccb-0f48414a523c

📥 Commits

Reviewing files that changed from the base of the PR and between 62d0219 and 89eab63.

📒 Files selected for processing (6)

src/neuralnav/api/app.py
src/neuralnav/api/dependencies.py
src/neuralnav/api/routes/configuration.py
src/neuralnav/api/routes/intent.py
src/neuralnav/api/routes/recommendation.py
src/neuralnav/api/routes/reference_data.py

anfredette

I left a couple of comments, but looks good to me otherwise.

anfredette · 2026-03-19T14:39:29Z

src/neuralnav/api/dependencies.py

+
+def close_app_state(app: FastAPI) -> None:
+    """Close resources and clear state."""
+


This function is a no-op. Should it be doing anything?

Suggested change

managers = getattr(app.state, "cluster_managers", {})

managers.clear()

We can either remove it completely and re-introduce it if we have resources which need cleanup, or use your suggestion. We can still do without cleanup at this point (the cleanup now happens during process shutdown).
WDYT?

I'm fine with removing it.

anfredette · 2026-03-19T14:53:16Z

src/neuralnav/api/dependencies.py



-def get_deployment_generator() -> DeploymentGenerator:
+def get_deployment_generator(request: Request) -> DeploymentGenerator:


Nit: get_deployment_mode and set_deployment_mode take Request but are called directly (e.g. configuration.py:62) rather than injected via Depends(). Was this intentional?

This is a very valid point, and not nit-picky at all. Both get_deployment_mode and set_deployment_mode are defined alongside other dependencies in depdendencies.py, but don't follow the same pattern.
I think it makes more sense to inline their logic into the route handlers (get_mode() and set_mode() in configuration.py).

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/neuralnav/api/dependencies.py`:
- Around line 32-109: Run the code formatter to fix ruff-formatting errors in
src/neuralnav/api/dependencies.py (the CI failure). In practice, run `ruff
format src/neuralnav/api/dependencies.py` (or your repo's pre-commit/format
command) and commit the resulting changes; ensure formatting issues around the
module-level blocks and function definitions like init_app_state,
close_app_state, and get_cluster_manager_or_raise (and constants like
_MAX_CACHED_NAMESPACES) are resolved so `ruff format --check` passes.
- Around line 82-108: get_cluster_manager_or_raise is synchronous and uses
threading.Lock while calling the blocking
KubernetesClusterManager(namespace=...) constructor (which runs subprocess.run),
causing event-loop stalls; convert get_cluster_manager_or_raise to an async
function, replace the module-level threading.Lock (_cluster_manager_lock) with
an asyncio.Lock stored on app.state (or a module-level asyncio.Lock initialized
at import), perform the blocking constructor call inside a threadpool (e.g., via
asyncio.to_thread or loop.run_in_executor) and keep the critical-section cache
logic under the async lock, and then update all call sites in configuration.py
(deploy_to_cluster, get_k8s_deployment_status, delete_deployment,
list_all_deployments) to await get_cluster_manager_or_raise. Ensure exceptions
are preserved (raise HTTPException on KubernetesDeploymentError) and the
namespace cache size check (_MAX_CACHED_NAMESPACES) remains enforced.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b9cef2a1-90af-40bb-b551-172a8f014bce

📥 Commits

Reviewing files that changed from the base of the PR and between 89eab63 and a3813f6.

📒 Files selected for processing (6)

src/neuralnav/api/app.py
src/neuralnav/api/dependencies.py
src/neuralnav/api/routes/configuration.py
src/neuralnav/api/routes/intent.py
src/neuralnav/api/routes/recommendation.py
src/neuralnav/api/routes/reference_data.py

🚧 Files skipped from review as they are similar to previous changes (3)

src/neuralnav/api/routes/intent.py
src/neuralnav/api/app.py
src/neuralnav/api/routes/reference_data.py

src/neuralnav/api/dependencies.py

coderabbitai · 2026-03-19T17:27:26Z

src/neuralnav/api/dependencies.py

+_cluster_manager_lock = threading.Lock()
+_MAX_CACHED_NAMESPACES = 32


-def get_cluster_manager_or_raise(namespace: str = "default") -> KubernetesClusterManager:
+def get_cluster_manager_or_raise(
+    request: Request, namespace: str = "default"
+) -> KubernetesClusterManager:
    """Get or create a cluster manager, raising an exception if not accessible."""
-    manager = get_cluster_manager(namespace)
-    if manager is None:
-        try:
-            return KubernetesClusterManager(namespace=namespace)
-        except KubernetesDeploymentError as e:
-            from fastapi import HTTPException, status
-
-            raise HTTPException(
-                status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
-                detail=f"Kubernetes cluster not accessible: {str(e)}",
-            ) from e
-    return manager
+    managers: dict[str, KubernetesClusterManager] = request.app.state.cluster_managers
+    if namespace not in managers:
+        with _cluster_manager_lock:
+            if namespace not in managers:
+                if len(managers) >= _MAX_CACHED_NAMESPACES:
+                    raise HTTPException(
+                        status_code=status.HTTP_400_BAD_REQUEST,
+                        detail=f"Too many namespaces (limit {_MAX_CACHED_NAMESPACES})",
+                    )
+                try:
+                    managers[namespace] = KubernetesClusterManager(namespace=namespace)
+                    logger.info(
+                        "Kubernetes cluster manager initialized for namespace=%s", namespace
+                    )
+                except KubernetesDeploymentError as e:
+                    raise HTTPException(
+                        status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
+                        detail=f"Kubernetes cluster not accessible: {e}",
+                    ) from e


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' import ast from pathlib import Path # 1) Confirm get_cluster_manager_or_raise is sync and uses lock dep = next(Path(".").rglob("dependencies.py")) tree = ast.parse(dep.read_text()) for node in tree.body: if isinstance(node, ast.FunctionDef) and node.name == "get_cluster_manager_or_raise": print(f"{dep}:{node.lineno} -> sync function") uses_lock = any( isinstance(n, ast.With) and any( isinstance(item.context_expr, ast.Name) and item.context_expr.id == "_cluster_manager_lock" for item in n.items ) for n in ast.walk(node) ) print(f"uses _cluster_manager_lock via with: {uses_lock}") # 2) Find async functions calling get_cluster_manager_or_raise for p in Path(".").rglob("*.py"): try: t = ast.parse(p.read_text()) except Exception: continue for fn in [n for n in ast.walk(t) if isinstance(n, ast.AsyncFunctionDef)]: for call in [n for n in ast.walk(fn) if isinstance(n, ast.Call)]: if isinstance(call.func, ast.Name) and call.func.id == "get_cluster_manager_or_raise": print(f"async caller: {p}:{call.lineno} in async def {fn.name}") PY

Repository: redhat-et/neuralnav

Length of output: 533

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check how get_cluster_manager_or_raise is called in async handlers echo "=== Call patterns in async handlers ===" rg -A 3 -B 3 "get_cluster_manager_or_raise" src/neuralnav/api/routes/configuration.py echo "" echo "=== Checking KubernetesClusterManager init signature ===" find . -name "*.py" -type f | xargs grep -l "class KubernetesClusterManager"

Repository: redhat-et/neuralnav

Length of output: 1287

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== KubernetesClusterManager.__init__ implementation ===" sed -n '/class KubernetesClusterManager/,/^class /p' src/neuralnav/cluster/manager.py | head -50 echo "" echo "=== Check if __init__ does blocking I/O (Kubernetes API calls) ===" rg -A 20 "def __init__" src/neuralnav/cluster/manager.py | head -40

Repository: redhat-et/neuralnav

Length of output: 2926

Blocking subprocess calls in async request handlers degrade concurrency.

get_cluster_manager_or_raise is a sync function called directly from 4 async route handlers (deploy_to_cluster, get_k8s_deployment_status, delete_deployment, list_all_deployments) without awaiting. Its KubernetesClusterManager constructor calls subprocess.run(["kubectl", "cluster-info"], timeout=10), which blocks the event loop. Combined with threading.Lock, cache misses can cause up to 10-second stalls across concurrent requests.

Make the function async using asyncio.Lock and offload blocking constructor to threadpool:

Refactoring steps

Change function to async and use asyncio.Lock:

-import threading +import asyncio +from starlette.concurrency import run_in_threadpool -_cluster_manager_lock = threading.Lock() +_cluster_manager_lock = asyncio.Lock() -def get_cluster_manager_or_raise( +async def get_cluster_manager_or_raise( request: Request, namespace: str = "default" ) -> KubernetesClusterManager: - with _cluster_manager_lock: + async with _cluster_manager_lock: - managers[namespace] = KubernetesClusterManager(namespace=namespace) + managers[namespace] = await run_in_threadpool( + KubernetesClusterManager, namespace=namespace + )

Update all 4 call sites in configuration.py to await:

Line 238: manager = await get_cluster_manager_or_raise(...)

Line 336: manager = await get_cluster_manager_or_raise(...)

Line 416: manager = await get_cluster_manager_or_raise(...)

Line 453: manager = await get_cluster_manager_or_raise(...)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_cluster_manager_lock = threading.Lock()

_MAX_CACHED_NAMESPACES = 32

def get_cluster_manager_or_raise(namespace: str = "default") -> KubernetesClusterManager:

def get_cluster_manager_or_raise(

request: Request, namespace: str = "default"

) -> KubernetesClusterManager:

"""Get or create a cluster manager, raising an exception if not accessible."""

manager = get_cluster_manager(namespace)

if manager is None:

try:

return KubernetesClusterManager(namespace=namespace)

except KubernetesDeploymentError as e:

from fastapi import HTTPException, status

raise HTTPException(

status_code=status.HTTP_503_SERVICE_UNAVAILABLE,

detail=f"Kubernetes cluster not accessible: {str(e)}",

) from e

return manager

managers: dict[str, KubernetesClusterManager] = request.app.state.cluster_managers

if namespace not in managers:

with _cluster_manager_lock:

if namespace not in managers:

if len(managers) >= _MAX_CACHED_NAMESPACES:

raise HTTPException(

status_code=status.HTTP_400_BAD_REQUEST,

detail=f"Too many namespaces (limit {_MAX_CACHED_NAMESPACES})",

)

try:

managers[namespace] = KubernetesClusterManager(namespace=namespace)

logger.info(

"Kubernetes cluster manager initialized for namespace=%s", namespace

)

except KubernetesDeploymentError as e:

raise HTTPException(

status_code=status.HTTP_503_SERVICE_UNAVAILABLE,

detail=f"Kubernetes cluster not accessible: {e}",

) from e

import asyncio

from starlette.concurrency import run_in_threadpool

_cluster_manager_lock = asyncio.Lock()

_MAX_CACHED_NAMESPACES = 32

async def get_cluster_manager_or_raise(

request: Request, namespace: str = "default"

) -> KubernetesClusterManager:

"""Get or create a cluster manager, raising an exception if not accessible."""

managers: dict[str, KubernetesClusterManager] = request.app.state.cluster_managers

if namespace not in managers:

async with _cluster_manager_lock:

if namespace not in managers:

if len(managers) >= _MAX_CACHED_NAMESPACES:

raise HTTPException(

status_code=status.HTTP_400_BAD_REQUEST,

detail=f"Too many namespaces (limit {_MAX_CACHED_NAMESPACES})",

)

try:

managers[namespace] = await run_in_threadpool(

KubernetesClusterManager, namespace=namespace

)

logger.info(

"Kubernetes cluster manager initialized for namespace=%s", namespace

)

except KubernetesDeploymentError as e:

raise HTTPException(

status_code=status.HTTP_503_SERVICE_UNAVAILABLE,

detail=f"Kubernetes cluster not accessible: {e}",

) from e

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/neuralnav/api/dependencies.py` around lines 82 - 108, get_cluster_manager_or_raise is synchronous and uses threading.Lock while calling the blocking KubernetesClusterManager(namespace=...) constructor (which runs subprocess.run), causing event-loop stalls; convert get_cluster_manager_or_raise to an async function, replace the module-level threading.Lock (_cluster_manager_lock) with an asyncio.Lock stored on app.state (or a module-level asyncio.Lock initialized at import), perform the blocking constructor call inside a threadpool (e.g., via asyncio.to_thread or loop.run_in_executor) and keep the critical-section cache logic under the async lock, and then update all call sites in configuration.py (deploy_to_cluster, get_k8s_deployment_status, delete_deployment, list_all_deployments) to await get_cluster_manager_or_raise. Ensure exceptions are preserved (raise HTTPException on KubernetesDeploymentError) and the namespace cache size check (_MAX_CACHED_NAMESPACES) remains enforced.

anfredette

Just to close the loop, after you make those two changes, I'll be good to go with this one, so I'll approve it, and you can merge it when you're done.

It looks like you also have some lint and formatting issues. At least one of the lint issues looked like a formatting issue, so just running the formatter may take care of both.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/neuralnav/api/routes/configuration.py (1)

296-317: ⚠️ Potential issue | 🟠 Major

Blocking subprocess call bypasses the refactored async pattern.

get_cluster_status directly instantiates KubernetesClusterManager(namespace="default") at line 305, which:

Calls subprocess.run(["kubectl", "cluster-info"], timeout=10) synchronously, blocking the event loop
Bypasses the app.state.cluster_managers cache used by all other cluster endpoints
Creates a new instance per request instead of reusing cached managers

This is inconsistent with the refactored async pattern and can cause up to 10-second stalls under concurrent requests.

Proposed fix: use the async cluster manager helper

 `@router.get`("/cluster-status")
-async def get_cluster_status():
+async def get_cluster_status(http_request: Request, namespace: str = "default"):
     """
     Get Kubernetes cluster status.

     Returns:
         Cluster accessibility and basic info
     """
     try:
-        temp_manager = KubernetesClusterManager(namespace="default")
-        deployments = temp_manager.list_inferenceservices()
+        manager = await get_cluster_manager_or_raise(http_request, namespace)
+        deployments = manager.list_inferenceservices()

         return {
             "accessible": True,
-            "namespace": temp_manager.namespace,
+            "namespace": manager.namespace,
             "inference_services": deployments,
             "count": len(deployments),
             "message": "Cluster accessible",
         }
     except Exception as e:
         logger.error(f"Failed to query cluster status: {e}")
         return {"accessible": False, "error": str(e)}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/neuralnav/api/routes/configuration.py` around lines 296 - 317,
get_cluster_status currently constructs KubernetesClusterManager() directly
(causing a blocking subprocess call and bypassing the shared cache); change the
endpoint to accept a Request (e.g., async def get_cluster_status(request:
Request)) and obtain the manager from the app-level cache instead of
instantiating KubernetesClusterManager — e.g., lookup
request.app.state.cluster_managers.get("default") and if the stored value is a
coroutine/factory await or call the async helper that initializes managers,
otherwise use the cached instance; ensure you import fastapi.Request, handle a
missing manager by returning accessible: False with an error, and remove the
direct KubernetesClusterManager(...) construction so no synchronous
subprocess.run is executed in the request handler.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/neuralnav/api/routes/configuration.py`:
- Around line 296-317: get_cluster_status currently constructs
KubernetesClusterManager() directly (causing a blocking subprocess call and
bypassing the shared cache); change the endpoint to accept a Request (e.g.,
async def get_cluster_status(request: Request)) and obtain the manager from the
app-level cache instead of instantiating KubernetesClusterManager — e.g., lookup
request.app.state.cluster_managers.get("default") and if the stored value is a
coroutine/factory await or call the async helper that initializes managers,
otherwise use the cached instance; ensure you import fastapi.Request, handle a
missing manager by returning accessible: False with an error, and remove the
direct KubernetesClusterManager(...) construction so no synchronous
subprocess.run is executed in the request handler.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 73c24a50-b331-4906-8ce8-0d814963c9cf

📥 Commits

Reviewing files that changed from the base of the PR and between a3813f6 and b555548.

📒 Files selected for processing (6)

src/neuralnav/api/app.py
src/neuralnav/api/dependencies.py
src/neuralnav/api/routes/configuration.py
src/neuralnav/api/routes/intent.py
src/neuralnav/api/routes/recommendation.py
src/neuralnav/api/routes/reference_data.py

🚧 Files skipped from review as they are similar to previous changes (3)

src/neuralnav/api/routes/intent.py
src/neuralnav/api/routes/reference_data.py
src/neuralnav/api/routes/recommendation.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/neuralnav/api/dependencies.py`:
- Around line 44-45: The module-level _cluster_manager_lock must be moved onto
app.state to avoid asyncio.Lock being bound to the wrong event loop; remove the
module-scope _cluster_manager_lock and instead initialize and store an
asyncio.Lock instance on app.state (e.g., app.state._cluster_manager_lock)
alongside app.state.cluster_managers, then update any code that
acquires/releases _cluster_manager_lock to reference
app.state._cluster_manager_lock (and create it lazily on startup if needed) so
the lock is owned by the same loop as the FastAPI app state.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 14013dde-f0a8-4fca-aa0c-5ada8a2c4ab5

📥 Commits

Reviewing files that changed from the base of the PR and between b555548 and 34450bf.

📒 Files selected for processing (6)

src/neuralnav/api/app.py
src/neuralnav/api/dependencies.py
src/neuralnav/api/routes/configuration.py
src/neuralnav/api/routes/intent.py
src/neuralnav/api/routes/recommendation.py
src/neuralnav/api/routes/reference_data.py

🚧 Files skipped from review as they are similar to previous changes (4)

src/neuralnav/api/routes/intent.py
src/neuralnav/api/app.py
src/neuralnav/api/routes/recommendation.py
src/neuralnav/api/routes/configuration.py

coderabbitai · 2026-03-22T08:22:58Z

src/neuralnav/api/dependencies.py

+    app.state.cluster_managers = {}  # dict[str, KubernetesClusterManager]
+    app.state.workflow = RecommendationWorkflow()


🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Python 3.11+ / 3.12, can a module-level asyncio.Lock() be safely reused across multiple event loops, or can it become bound to the first loop that acquires it and later fail in a different loop?

💡 Result:

No, a module-level asyncio.Lock cannot be safely reused across multiple event loops in Python 3.11+ or 3.12. It becomes bound to the first event loop that uses it (specifically, the first time _get_loop is called during an operation like acquire) and will raise a RuntimeError if attempted to be used with a different loop. This behavior comes from asyncio.Lock inheriting from mixins._LoopBoundMixin, which implements _get_loop to lazily bind the instance to the currently running loop (asyncio.get_running_loop) under a global lock for thread-safety. Once bound (self._loop is set), any subsequent use checks that the current running loop matches the bound loop; otherwise, it raises RuntimeError(f'{self!r} is bound to a different event loop'). Official docs note the removal of the explicit loop parameter in 3.10 (sources 1,4,6), but the binding still occurs dynamically via this mixin (sources 2,14,18,25). Stack Overflow discussions confirm that pre-3.10 locks bound at creation, but post-3.10 they bind at first use—still preventing reuse across different loops (sources 3,7). Creating the lock at module level (outside any loop) means it starts unbound, binds to the first loop that acquires it, and fails in subsequent different loops (e.g., multiple asyncio.run calls). To use across multiple loops, create a new Lock instance per loop, ideally inside the async context.

Citations:

1: https://docs.python.org/3.11/library/asyncio-sync.html

2: https://github.com/python/cpython/blob/3.11/Lib/asyncio/locks.py

3: https://stackoverflow.com/questions/60026975/event-loop-error-in-asyncio-lock-when-instantiated-multiple-times

4: https://docs.python.org/3/library/asyncio-sync.html

5: https://docs.python.org/3.12/library/asyncio-sync.html

6: https://stackoverflow.com/questions/79825053/cant-use-lock-around-gather-in-python-asyncio

7: https://github.com/python/cpython/blob/main/Lib/asyncio/locks.py

8: avoid the global_lock in asyncio.mixins._LoopBoundMixin python/cpython#95474

9: https://android.googlesource.com/platform/prebuilts/python/linux-x86/+/refs/heads/emu-33-release/lib/python3.10/asyncio/mixins.py

Move the cluster-manager lock to app.state.

cluster_managers is stored on app.state, but _cluster_manager_lock remains at module scope. A module-level asyncio.Lock binds to the first event loop that acquires it and will raise RuntimeError if used by a different loop. This causes failures in multi-loop scenarios (e.g., loop-per-test setups). Storing the lock on app.state alongside the cache ensures consistent ownership and prevents cross-loop binding issues.

Suggested change

def init_app_state(app: FastAPI) -> None: """Initialize all singletons on app.state during lifespan startup.""" app.state.model_catalog = ModelCatalog() app.state.slo_repo = SLOTemplateRepository() app.state.deployment_generator = DeploymentGenerator(simulator_mode=False) app.state.yaml_validator = YAMLValidator() app.state.cluster_managers = {} # dict[str, KubernetesClusterManager] + app.state.cluster_manager_lock = asyncio.Lock() app.state.workflow = RecommendationWorkflow() -_cluster_manager_lock = asyncio.Lock() _MAX_CACHED_NAMESPACES = 32 async def get_cluster_manager_or_raise( request: Request, namespace: str = "default" ) -> KubernetesClusterManager: """Get or create a cluster manager, raising an exception if not accessible.""" managers: dict[str, KubernetesClusterManager] = request.app.state.cluster_managers + cluster_manager_lock = cast(asyncio.Lock, request.app.state.cluster_manager_lock) if namespace not in managers: - async with _cluster_manager_lock: + async with cluster_manager_lock:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/neuralnav/api/dependencies.py` around lines 44 - 45, The module-level _cluster_manager_lock must be moved onto app.state to avoid asyncio.Lock being bound to the wrong event loop; remove the module-scope _cluster_manager_lock and instead initialize and store an asyncio.Lock instance on app.state (e.g., app.state._cluster_manager_lock) alongside app.state.cluster_managers, then update any code that acquires/releases _cluster_manager_lock to reference app.state._cluster_manager_lock (and create it lazily on startup if needed) so the lock is owned by the same loop as the FastAPI app state.

Replace global singleton pattern with FastAPI's app.state and Depends() for cleaner dependency injection. All shared instances are now initialized during app lifespan startup and injected via request.app.state. Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Amit Oren <amoren@redhat.com>

amito · 2026-03-22T08:53:17Z

@coderabbitai review

coderabbitai · 2026-03-22T08:53:24Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

🧹 Nitpick comments (2)

src/neuralnav/api/routes/configuration.py (2)

61-75: Potential race on simulator_mode for concurrent mode changes.

Direct mutation of gen.simulator_mode without synchronization could lead to TOCTOU issues if multiple concurrent PUT requests arrive. However, since this is a simple boolean assignment (atomic under CPython's GIL) and mode changes are typically rare admin operations, the practical impact is low.

If concurrent mode switching becomes a concern, consider adding a lock around the read/write.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/neuralnav/api/routes/configuration.py` around lines 61 - 75, The
get_mode/set_mode handlers directly read/write gen.simulator_mode causing a
potential race for concurrent PUTs; wrap accesses with an async lock on the
deployment generator (e.g., add or use an asyncio.Lock stored on
http_request.app.state.deployment_generator, reference symbols gen, get_mode,
set_mode, gen.simulator_mode, and DeploymentModeRequest) so set_mode acquires
the lock before mutating simulator_mode and get_mode acquires the lock when
reading; ensure the lock is created when the deployment_generator is initialized
and use await lock in the handlers to serialize read/write operations.

459-474: Consider parallelizing status fetches for better performance.

The loop makes sequential await run_in_threadpool(...) calls for each deployment (2 calls per deployment). For clusters with many InferenceServices, this could become slow. Consider using asyncio.gather to parallelize these fetches.

♻️ Optional optimization

         deployments = []
+        async def fetch_deployment_info(deployment_id):
+            svc_status = await run_in_threadpool(manager.get_inferenceservice_status, deployment_id)
+            pods = await run_in_threadpool(manager.get_deployment_pods, deployment_id)
+            return {"deployment_id": deployment_id, "status": svc_status, "pods": pods}
+
+        deployments = await asyncio.gather(*[fetch_deployment_info(d) for d in deployment_ids])
-        for deployment_id in deployment_ids:
-            svc_status = await run_in_threadpool(manager.get_inferenceservice_status, deployment_id)
-            pods = await run_in_threadpool(manager.get_deployment_pods, deployment_id)
-
-            deployments.append({"deployment_id": deployment_id, "status": svc_status, "pods": pods})

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/neuralnav/api/routes/configuration.py` around lines 459 - 474, The loop
is doing sequential await run_in_threadpool calls for
manager.get_inferenceservice_status and manager.get_deployment_pods per
deployment, causing slowness; refactor to run these I/O calls concurrently by
creating tasks (using asyncio.gather) for each deployment: first await
run_in_threadpool(manager.list_inferenceservices) to get deployment_ids, then
for each deployment_id create coroutines that call
run_in_threadpool(manager.get_inferenceservice_status, deployment_id) and
run_in_threadpool(manager.get_deployment_pods, deployment_id) (you can gather
the two per-id coroutines or create a single per-id task that gathers both),
await asyncio.gather over all per-id tasks, and then build the deployments list
with {"deployment_id": ..., "status": ..., "pods": ...} from the gathered
results.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/neuralnav/api/routes/configuration.py`:
- Around line 61-75: The get_mode/set_mode handlers directly read/write
gen.simulator_mode causing a potential race for concurrent PUTs; wrap accesses
with an async lock on the deployment generator (e.g., add or use an asyncio.Lock
stored on http_request.app.state.deployment_generator, reference symbols gen,
get_mode, set_mode, gen.simulator_mode, and DeploymentModeRequest) so set_mode
acquires the lock before mutating simulator_mode and get_mode acquires the lock
when reading; ensure the lock is created when the deployment_generator is
initialized and use await lock in the handlers to serialize read/write
operations.
- Around line 459-474: The loop is doing sequential await run_in_threadpool
calls for manager.get_inferenceservice_status and manager.get_deployment_pods
per deployment, causing slowness; refactor to run these I/O calls concurrently
by creating tasks (using asyncio.gather) for each deployment: first await
run_in_threadpool(manager.list_inferenceservices) to get deployment_ids, then
for each deployment_id create coroutines that call
run_in_threadpool(manager.get_inferenceservice_status, deployment_id) and
run_in_threadpool(manager.get_deployment_pods, deployment_id) (you can gather
the two per-id coroutines or create a single per-id task that gathers both),
await asyncio.gather over all per-id tasks, and then build the deployments list
with {"deployment_id": ..., "status": ..., "pods": ...} from the gathered
results.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 40628212-cdb5-4bb7-a0fc-af30d4d48d55

📥 Commits

Reviewing files that changed from the base of the PR and between 34450bf and e551bdf.

📒 Files selected for processing (6)

src/neuralnav/api/app.py
src/neuralnav/api/dependencies.py
src/neuralnav/api/routes/configuration.py
src/neuralnav/api/routes/intent.py
src/neuralnav/api/routes/recommendation.py
src/neuralnav/api/routes/reference_data.py

🚧 Files skipped from review as they are similar to previous changes (3)

src/neuralnav/api/routes/intent.py
src/neuralnav/api/app.py
src/neuralnav/api/routes/recommendation.py

amito requested a review from anfredette March 17, 2026 22:01

amito mentioned this pull request Mar 17, 2026

Integrate Model Catalog as external data source #114

Open

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

anfredette reviewed Mar 19, 2026

View reviewed changes

amito force-pushed the refactor/fastapi-state branch from 89eab63 to a3813f6 Compare March 19, 2026 17:20

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

anfredette approved these changes Mar 20, 2026

View reviewed changes

amito force-pushed the refactor/fastapi-state branch from a3813f6 to b555548 Compare March 22, 2026 08:05

coderabbitai bot reviewed Mar 22, 2026

View reviewed changes

amito force-pushed the refactor/fastapi-state branch from b555548 to 34450bf Compare March 22, 2026 08:13

coderabbitai bot reviewed Mar 22, 2026

View reviewed changes

amito force-pushed the refactor/fastapi-state branch from 34450bf to e551bdf Compare March 22, 2026 08:34

coderabbitai bot reviewed Mar 22, 2026

View reviewed changes

amito merged commit 3bff2da into redhat-et:main Mar 22, 2026
7 checks passed


		def close_app_state(app: FastAPI) -> None:
		"""Close resources and clear state."""


	managers = getattr(app.state, "cluster_managers", {})
	managers.clear()



		def get_deployment_generator() -> DeploymentGenerator:
		def get_deployment_generator(request: Request) -> DeploymentGenerator:

		app.state.cluster_managers = {} # dict[str, KubernetesClusterManager]
		app.state.workflow = RecommendationWorkflow()

Conversation

amito commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

anfredette left a comment

Choose a reason for hiding this comment

Uh oh!

anfredette Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

anfredette Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

amito Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

anfredette Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

anfredette Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

amito Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

anfredette left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

amito commented Mar 22, 2026

Uh oh!

coderabbitai bot commented Mar 22, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amito commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

anfredette left a comment •

edited

Loading