| Area | Enhancement | Goal |
|---|---|---|
| 🧭 Discovery | Add health status + TTL expiry in registry | Detect stale or crashed agents |
| 🧩 Metrics Service | Agents expose /metrics endpoint (Prometheus style) |
Continuous health monitoring |
| ⚙️ Routing Intelligence | Weighted routing: combine CPU%, latency, queue depth | Smarter delegation decisions |
| 🔄 Async Queues | Each agent holds an async queue for incoming jobs | Avoid blocking and add fairness |
| 📊 Dashboard | Live FastAPI+WebSocket UI (auto-refresh metrics) | Visualize mesh health in real time |
| 🔐 Security | Token-based auth for registration + RPC calls | Prevent rogue nodes |
| 🧰 Resiliency | Retry & circuit breaker (tenacity or custom) | Handle flaky peers gracefully |
Add TTL expiry and optional authentication key:
# TTL expiry: clean up stale agents every 30s
import asyncio
from datetime import datetime, timedelta
TTL = timedelta(seconds=40)
async def cleanup_task():
while True:
now = datetime.utcnow()
dead = [n for n, v in registry.items() if now - v["last_heartbeat"] > TTL]
for d in dead:
registry.pop(d, None)
print(f"🗑️ Removed stale agent {d}")
await asyncio.sleep(30)
@app.on_event("startup")
async def startup_event():
asyncio.create_task(cleanup_task())Optional: require a shared token via header for registration:
AUTH_TOKEN = "mesh-secret"
@app.post("/register")
async def register(request: Request):
if request.headers.get("X-Mesh-Token") != AUTH_TOKEN:
return {"error": "unauthorized"}
...Add a lightweight /metrics route for Prometheus-style scraping:
@app.get("/metrics")
async def metrics():
cpu = psutil.cpu_percent(interval=0.1)
mem = psutil.virtual_memory().percent
qlen = job_queue.qsize()
return {"agent": AGENT_NAME, "cpu": cpu, "mem": mem, "queue_len": qlen}Add a job queue to handle parallel tasks safely:
from asyncio import Queue
job_queue = Queue(maxsize=20)
async def worker_loop():
while True:
job = await job_queue.get()
print(f"🔧 {AGENT_NAME} processing: {job}")
await asyncio.sleep(random.uniform(0.3, 1.0))
job_queue.task_done()
@app.on_event("startup")
async def start_worker():
asyncio.create_task(worker_loop())Handle incoming jobs:
async def handle_enqueue(params):
text = params.get("text")
if job_queue.full():
return {"status": "busy"}
await job_queue.put(text)
return {"status": "queued", "text": text}Register new RPC:
METHODS["enqueue"] = handle_enqueueModify the delegation algorithm to consider CPU, latency, and queue length:
def compute_score(agent):
# Lower score = better target
return 0.6 * agent["cpu"] + 0.3 * agent["latency_ms"] + 0.1 * agent.get("queue_len", 0)
target = min(stats, key=compute_score)Use tenacity for robust remote RPCs:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=0.5, min=0.5, max=3))
async def safe_rpc_call(url, payload):
async with httpx.AsyncClient() as client:
r = await client.post(f"http://{url}/rpc", json=payload, timeout=5)
return r.json()Replace all direct client.post(...) calls with safe_rpc_call(...).
Add a simple monitoring dashboard showing CPU, memory, queue, and latency across agents:
from fastapi import FastAPI, WebSocket
import httpx, asyncio
app = FastAPI(title="Mesh Dashboard")
@app.websocket("/ws")
async def ws_metrics(ws: WebSocket):
await ws.accept()
async with httpx.AsyncClient() as client:
while True:
agents = (await client.get("http://localhost:8500/services")).json()["active_agents"]
data = []
for name, meta in agents.items():
try:
r = await client.get(f"http://{meta['url']}/metrics", timeout=1)
data.append(r.json())
except:
pass
await ws.send_json(data)
await asyncio.sleep(3)Then build a simple dashboard.html that connects to /ws and renders a table of live metrics using vanilla JavaScript or Chart.js.
| Feature | Description |
|---|---|
| 🧩 Agent Roles | Planner, Executor, Evaluator roles with specialized JSON-RPC endpoints |
| 🔄 Job Replay Log | Each agent writes completed jobs to SQLite for traceability |
| 🌐 Distributed Tracing | Add OpenTelemetry spans around every RPC call |
| 🧠 Knowledge Sharing | Agents can gossip or broadcast cached insights periodically |
| 🔒 TLS Mesh | Run agents with mutual TLS certificates using uvicorn + SSL |
| ⚡ Persistent State | Redis or PostgreSQL backend for registry + job tracking |
| 🪶 Lightweight SDK | Build a Python client: from agentmesh import MeshClient to hide RPC complexity |
agent_mesh/
├── discovery_server.py
├── agent_node.py
├── orchestrator_client.py
├── dashboard_server.py
├── core/
│ ├── rpc_utils.py
│ ├── metrics.py
│ └── routing.py
├── static/
│ └── dashboard.html
├── tests/
│ └── test_rpc_endpoints.py
├── requirements.txt
├── LICENSE
├── README.md
└── CONTRIBUTING.md
| Improvement | Impact |
|---|---|
| Dynamic scoring (CPU + latency + queue) | Smarter, adaptive load balancing |
| Async job queues | Prevents overload; smoother concurrency |
| Live dashboard | Real-time observability |
| Heartbeat + TTL | Removes dead agents automatically |
| Secure registration | Safer for multi-network deployments |
| Retry + circuit breaker | Fault-tolerant RPC communication |