LlamaNet is a decentralized inference swarm for LLM models using llama.cpp. It uses Kademlia DHT for truly distributed node discovery without any central registry and supports both real-time streaming and traditional inference modes.
- Hardware-based node identity - Nodes maintain consistent IDs across restarts based on hardware fingerprinting
- Decentralized DHT-based node discovery using Kademlia protocol
- High-performance inference powered by llama.cpp
- Real-time streaming inference with Server-Sent Events (SSE)
- OpenAI-compatible API with streaming support
- Interactive web interface with live streaming responses
- Async Client Library for easy integration with async/await support
- Automatic node selection based on load and performance
- No single point of failure - fully distributed architecture
- Docker support for easy deployment
- Live streaming responses - see text appear as it's generated
- OpenAI-compatible streaming - works with existing OpenAI clients
- Web UI streaming - interactive chat interface with real-time updates
- Functional programming approach - event-driven architecture with no blocking loops
- OpenAI Compatible:
/v1/chat/completionsand/v1/completionswithstream: true - Web Interface: Toggle streaming on/off in the browser UI
- Hardware Fingerprinting: Node IDs are generated based on CPU, memory, MAC addresses, and system identifiers
- Persistent Identity: Nodes maintain the same ID across restarts and reboots
- Duplicate Prevention: Eliminates duplicate node registrations in the DHT network
- Multi-Node Support: Multiple nodes on the same hardware get unique IDs based on port numbers
- CPU Information: Core count and architecture
- Memory Configuration: Total RAM size
- Network Interfaces: MAC addresses from physical interfaces
- System Identifiers: Platform UUID and hostname
- Port Differentiation: Allows multiple nodes per machine
- Consistency Checks: Validates node ID matches current hardware on startup
- Hardware Change Detection: Automatically updates node ID when hardware changes
- Fallback Mechanisms: Uses legacy random IDs if hardware fingerprinting fails
- Debug Endpoints:
/hardwareand/debug/node-idfor troubleshooting
- Python 3.8+
- LLM models in GGUF format (compatible with llama.cpp)
- Docker (optional, for containerized deployment)
LlamaNet's decentralized architecture makes it ideal for various scenarios where traditional centralized AI services fall short. Here are key use cases where LlamaNet provides significant advantages:
A global company with offices in New York, London, Tokyo, and São Paulo wants to provide AI assistance to employees while maintaining data sovereignty and reducing latency.
LlamaNet Solution:
- Deploy inference nodes in each office location
- Employees automatically connect to the nearest/fastest node
- No data leaves regional boundaries (GDPR/compliance friendly)
- Automatic failover if one office's node goes down
- Cost-effective scaling without vendor lock-in
# New York Office
python -m inference_node.server --model-path ./models/company-model.gguf --port 8000
# London Office
python -m inference_node.server --model-path ./models/company-model.gguf --port 8000 --bootstrap-nodes ny-office.company.com:8001
# Employees use OpenAI-compatible endpoint
openai.api_base = "http://local-llamanet.company.com/v1"A manufacturing company needs AI for both cloud analytics and edge device monitoring, with seamless integration between environments.
LlamaNet Solution:
- Cloud nodes for heavy analytics workloads
- Edge nodes for real-time device monitoring
- Automatic load balancing based on request type
- Unified API across all environments
A university research department wants to share AI resources across multiple labs while allowing each lab to contribute their own compute resources.
LlamaNet Solution:
- Each lab contributes nodes with their available hardware
- Researchers access a unified AI service regardless of which lab's hardware is used
- Fair resource sharing with automatic load balancing
- Easy addition of new labs/nodes without central coordination
# Research Lab A contributes GPU node
python -m inference_node.server --model-path ./models/research-model.gguf --n-gpu-layers 35
# Research Lab B contributes CPU node
python -m inference_node.server --model-path ./models/research-model.gguf --bootstrap-nodes lab-a.university.edu:8001
# Researchers use unified client
client = Client(bootstrap_nodes="lab-a.university.edu:8001,lab-b.university.edu:8001")An open-source community wants to create a shared AI inference network where members contribute compute resources and everyone benefits.
LlamaNet Solution:
- Community members run nodes with their spare compute
- Automatic discovery and load balancing
- No central authority or single point of failure
- Contributors can prioritize their own requests
Local businesses in a region want to share AI infrastructure costs while maintaining independence.
LlamaNet Solution:
- Each business runs nodes during their off-hours
- Shared access to AI capabilities without individual infrastructure costs
- Data stays within the cooperative network
- Easy scaling as more businesses join
A hospital network needs AI for medical imaging analysis while ensuring patient data never leaves their secure network.
LlamaNet Solution:
- Deploy nodes within each hospital's secure network
- AI processing happens locally with no external data transfer
- Automatic failover between hospitals in the network
- Compliance with HIPAA and other healthcare regulations
# Hospital A - Primary node
python -m inference_node.server --model-path ./models/medical-imaging.gguf
# Hospital B - Backup node
python -m inference_node.server --model-path ./models/medical-imaging.gguf --bootstrap-nodes hospital-a.network:8001
# Medical staff use secure internal endpoint
curl -X POST http://internal-ai.hospital.network/v1/chat/completions \
-d '{"messages": [{"role": "user", "content": "Analyze this X-ray image"}]}'A startup needs AI capabilities but cannot afford expensive cloud AI services or dedicated infrastructure.
LlamaNet Solution:
- Start with a single node on existing hardware
- Scale by adding nodes as the business grows
- No vendor lock-in or expensive API costs
- OpenAI-compatible API for easy integration with existing tools
A distributed development team needs shared AI assistance for coding, documentation, and brainstorming.
LlamaNet Solution:
- Team members contribute nodes from their development machines
- Shared AI assistant available to all team members
- No external dependencies or API costs
- Works offline or in restricted network environments
Research stations, ships, or remote facilities need AI capabilities but have limited or unreliable internet connectivity.
LlamaNet Solution:
- Local nodes provide AI services without internet dependency
- Mesh network topology for redundancy
- Automatic synchronization when connectivity is available
- Works in completely offline environments
Organizations in regions where major AI services are blocked or restricted need local AI capabilities.
LlamaNet Solution:
- Completely self-hosted with no external dependencies
- Local language models and cultural customization
- No data sent to foreign servers
- Full control over AI capabilities and policies
Research institutions need AI integrated with their existing HPC clusters for scientific workloads.
LlamaNet Solution:
- Deploy nodes on HPC cluster nodes during idle time
- Integrate with existing job schedulers
- Specialized models for scientific domains
- Seamless scaling with cluster resources
Game developers want to provide AI-powered NPCs and content generation without relying on external services.
LlamaNet Solution:
- Deploy nodes in game server infrastructure
- Low-latency AI for real-time game interactions
- No external API dependencies or costs
- Custom models trained on game-specific content
-
Clone the repository:
git clone https://github.com/machaao/llama-net.git cd llama-net -
Install the requirements:
pip3 install -r requirements.txt
python -m inference_node.server --model-path ./models/your-model.ggufThis starts:
- HTTP API on port 8000 (inference endpoints)
- DHT node on port 8001 (peer discovery)
- Web UI at http://localhost:8000
- Hardware-based node ID automatically generated and stored
python -m inference_node.server \
--model-path ./models/your-model.gguf \
--port 8002 \
--dht-port 8003 \
--bootstrap-nodes localhost:8001Note: Each additional node will automatically generate a unique hardware-based node ID that includes the port number, ensuring no conflicts when running multiple nodes on the same machine.
Open http://localhost:8000 in your browser for an interactive chat interface with:
- Real-time streaming responses
- Network status monitoring
- Streaming toggle for instant vs. complete responses
- Hardware fingerprint information in node details
LlamaNet automatically generates consistent node IDs based on your hardware:
# First run - generates and stores hardware-based node ID
python -m inference_node.server --model-path ./models/model.gguf
# Output: Generated hardware-based node ID: 5f3d6263b7009e54... from 6 hardware components
# Subsequent runs - uses the same stored node ID
python -m inference_node.server --model-path ./models/model.gguf
# Output: Using consistent stored hardware-based node ID: 5f3d6263b7009e54...You can still specify custom node IDs if needed:
# Override with custom node ID
python -m inference_node.server \
--model-path ./models/model.gguf \
--node-id my-custom-node-id-12345678901234567890
# Or via environment variable
export NODE_ID=my-custom-node-id-12345678901234567890
python -m inference_node.server --model-path ./models/model.ggufWhen hardware changes are detected:
# Hardware change detected - automatic update
# Output: Hardware fingerprint changed, generating new node ID
# Output: Updated stored node ID to: a1b2c3d4e5f6789a...Run multiple nodes on the same machine with automatic port-based differentiation:
# Node 1 - gets hardware-based ID with port 8000
python -m inference_node.server --model-path ./models/model.gguf --port 8000
# Node 2 - gets different hardware-based ID with port 8002
python -m inference_node.server --model-path ./models/model.gguf --port 8002 --dht-port 8003 --bootstrap-nodes localhost:8001
# Node 3 - gets another unique hardware-based ID with port 8004
python -m inference_node.server --model-path ./models/model.gguf --port 8004 --dht-port 8005 --bootstrap-nodes localhost:8001Each node gets a unique ID like:
- Node 1:
5f3d6263b7009e54...(hardware + port:8000) - Node 2:
7a948cc229cb9c9d...(hardware + port:8002) - Node 3:
b8e1f4a5c6d7e8f9...(hardware + port:8004)
# View hardware fingerprint details
curl http://localhost:8000/hardware
# Validate hardware consistency
curl http://localhost:8000/hardware/validate
# Debug node ID across all components
curl http://localhost:8000/debug/node-idThe hardware fingerprint includes:
{
"mac_count": 2,
"has_system_uuid": true,
"cpu_count": 8,
"memory_gb": 16,
"platform": "Linux-5.15.0-91-generic-x86_64-with-glibc2.35",
"hostname": "my-server",
"is_fallback": false
}Hardware-based node IDs are automatically stored in:
- Linux/macOS:
~/.llamanet_node_id - Windows:
%USERPROFILE%\.llamanet_node_id
This ensures the same node ID is used across restarts.
LlamaNet requires models in GGUF format (GGML Universal Format). GGUF is the modern format used by llama.cpp for efficient inference.
The largest collection of GGUF models is available on Hugging Face:
Current Active Publishers
- bartowski - Most active, high-quality quantizations (DeepSeek-R1, Llama 3.1/3.2, Qwen 2.5/3)
- unsloth - Extremely popular, performance-optimized models (Qwen3, GPT-OSS series)
- MaziyarPanahi - Very active with comprehensive coverage (Gemma 3, Llama, Qwen)
- mradermacher - Comprehensive model coverage with excellent documentation
- DavidAU - Specialized in abliterated/uncensored variants
- Microsoft - Official Microsoft models (Phi-3.5, etc.)
- Meta - Official Meta Llama models
- Google - Official Gemma models
- Mistral AI - Official Mistral models
- Qwen - Official Qwen models from Alibaba
Some models are available for direct download from official sources.
# Install Hugging Face CLI
pip install huggingface_hub
# Download latest Llama 3.1 8B model (recommended)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --local-dir ./models --local-dir-use-symlinks False
# Download Qwen 2.5 7B (excellent performance)
huggingface-cli download bartowski/Qwen2.5-7B-Instruct-GGUF Qwen2.5-7B-Instruct-Q4_K_M.gguf --local-dir ./models --local-dir-use-symlinks False
# Download multiple quantizations of a model
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF --include="*.gguf" --local-dir ./models/llama-3.1-8b# Create models directory
mkdir -p models
# Download Llama 3.1 8B directly
wget -O models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
"https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"
# Download Qwen 2.5 7B using curl
curl -L -o models/Qwen2.5-7B-Instruct-Q4_K_M.gguf \
"https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q4_K_M.gguf"# Clone entire model repository
git clone https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF models/llama-3.1-8b
# Clone specific files only
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF models/llama-3.1-8b
cd models/llama-3.1-8b
git lfs pull --include="*.Q4_K_M.gguf"# Llama 3.2 3B (3B parameters) - Latest Meta model, excellent for testing
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf --local-dir ./models
# Gemma 3 4B (4B parameters) - Latest Google model, great performance
huggingface-cli download MaziyarPanahi/gemma-3-4b-it-GGUF gemma-3-4b-it-Q4_K_M.gguf --local-dir ./models
# Phi-3.5 Mini (3.8B parameters) - Latest Microsoft model, great for testing
huggingface-cli download bartowski/Phi-3.5-mini-instruct-GGUF Phi-3.5-mini-instruct-Q4_K_M.gguf --local-dir ./models# Llama 3.1 8B Instruct - Latest Meta model, excellent general purpose (RECOMMENDED)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --local-dir ./models
# Qwen3 7B Instruct - Latest Qwen model, outstanding performance
huggingface-cli download unsloth/Qwen3-7B-Instruct-GGUF Qwen3-7B-Instruct-Q4_K_M.gguf --local-dir ./models
# DeepSeek-R1 Distill 7B - Latest breakthrough model, excellent reasoning
huggingface-cli download bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf --local-dir ./models
# Mistral 7B v0.3 - Latest Mistral model, high quality
huggingface-cli download bartowski/Mistral-7B-Instruct-v0.3-GGUF Mistral-7B-Instruct-v0.3-Q4_K_M.gguf --local-dir ./models
# CodeQwen 1.5 7B - Specialized for coding tasks
huggingface-cli download bartowski/CodeQwen1.5-7B-Chat-GGUF CodeQwen1.5-7B-Chat-Q4_K_M.gguf --local-dir ./models# Qwen3 30B Instruct - Most popular large model, exceptional performance
huggingface-cli download unsloth/Qwen3-30B-A3B-GGUF Qwen3-30B-A3B-Q4_K_M.gguf --local-dir ./models
# DeepSeek-R1 Distill 32B - Latest breakthrough model, top reasoning
huggingface-cli download bartowski/DeepSeek-R1-Distill-Qwen-32B-abliterated-GGUF DeepSeek-R1-Distill-Qwen-32B-abliterated-Q4_K_M.gguf --local-dir ./models
# Llama 3.1 70B Instruct - Top-tier reasoning and knowledge
huggingface-cli download bartowski/Meta-Llama-3.1-70B-Instruct-GGUF Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf --local-dir ./models
# GPT-OSS 20B - Popular open-source alternative
huggingface-cli download unsloth/gpt-oss-20b-GGUF gpt-oss-20b-Q4_K_M.gguf --local-dir ./models# Llama 3.1 405B Instruct - Frontier model capability (requires massive resources)
huggingface-cli download bartowski/Meta-Llama-3.1-405B-Instruct-GGUF Meta-Llama-3.1-405B-Instruct-Q4_K_M.gguf --local-dir ./models
# Qwen3 72B Instruct - Latest large Qwen model, exceptional capability
huggingface-cli download unsloth/Qwen3-72B-Instruct-GGUF Qwen3-72B-Instruct-Q4_K_M.gguf --local-dir ./models
# DeepSeek-R1 Distill 70B - Largest breakthrough model variant
huggingface-cli download bartowski/DeepSeek-R1-Distill-Qwen-70B-GGUF DeepSeek-R1-Distill-Qwen-70B-Q4_K_M.gguf --local-dir ./models
# Qwen 2.5 72B Instruct - Proven high performance for complex tasks
huggingface-cli download bartowski/Qwen2.5-72B-Instruct-GGUF Qwen2.5-72B-Instruct-Q4_K_M.gguf --local-dir ./modelsGGUF models come in different quantization levels that trade off quality vs. size/speed:
| Quantization | Quality | Size | Speed | Use Case |
|---|---|---|---|---|
| Q2_K | Lower | Smallest | Fastest | Testing, very limited resources |
| Q3_K_M | Good | Small | Fast | Mobile, edge devices |
| Q4_K_M | Recommended | Medium | Balanced | Most use cases |
| Q5_K_M | High | Large | Slower | Quality-focused applications |
| Q6_K | Very High | Larger | Slower | Maximum quality needs |
| Q8_0 | Highest | Largest | Slowest | Research, benchmarking |
Recommendation: Start with Q4_K_M quantization for the best balance of quality, size, and speed.
Organize your models for easy management:
models/
├── llama-3.1-8b/
│ ├── Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
│ ├── Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
│ └── README.md
├── qwen-2.5-7b/
│ ├── Qwen2.5-7B-Instruct-Q4_K_M.gguf
│ └── README.md
├── phi-3.5-mini/
│ ├── Phi-3.5-mini-instruct-Q4_K_M.gguf
│ └── README.md
└── codeqwen-1.5-7b/
├── CodeQwen1.5-7B-Chat-Q4_K_M.gguf
└── README.md
| Model Size | RAM Required | VRAM (GPU) | CPU Cores | Use Case |
|---|---|---|---|---|
| 1B-3B | 4-8 GB | 2-4 GB | 2+ | Testing, development |
| 7B-8B | 8-16 GB | 6-10 GB | 4+ | General purpose (RECOMMENDED) |
| 14B-15B | 16-32 GB | 12-20 GB | 8+ | High quality responses |
| 32B-34B | 32-64 GB | 24-40 GB | 16+ | Professional use |
| 70B-72B | 64-128 GB | 48-80 GB | 32+ | Maximum capability |
| 405B | 256+ GB | 200+ GB | 64+ | Frontier model research |
Once you have a model downloaded:
# Start LlamaNet with Llama 3.1 8B (recommended)
python -m inference_node.server --model-path ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
# Or with environment variable
export MODEL_PATH=./models/Qwen2.5-7B-Instruct-Q4_K_M.gguf
python -m inference_node.serverModel not loading:
# Check if file exists and is readable
ls -la ./models/your-model.gguf
# Verify it's a valid GGUF file
file ./models/your-model.ggufOut of memory errors:
# Try a smaller quantization
# Q4_K_M → Q3_K_M → Q2_K
# Or reduce context size
python -m inference_node.server --model-path ./model.gguf --n-ctx 1024Slow inference:
# Enable GPU acceleration (if available)
python -m inference_node.server --model-path ./model.gguf --n-gpu-layers 35
# Increase batch size for throughput
python -m inference_node.server --model-path ./model.gguf --n-batch 512import openai
# Configure to use LlamaNet
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "dummy-key" # Not used but required
# Streaming chat completion
response = openai.ChatCompletion.create(
model="llamanet",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.get("content"):
print(chunk.choices[0].delta.content, end="", flush=True)import requests
import json
response = requests.post("http://localhost:8000/v1/chat/completions",
json={
"model": "llamanet",
"messages": [{"role": "user", "content": "Explain machine learning"}],
"stream": True,
"max_tokens": 150
},
stream=True
)
for line in response.iter_lines():
if line.startswith(b'data: '):
data_str = line[6:].decode()
if data_str.strip() == '[DONE]':
break
data = json.loads(data_str)
if data["choices"][0]["delta"].get("content"):
print(data["choices"][0]["delta"]["content"], end="", flush=True)import asyncio
from client.api import Client
async def main():
client = Client(bootstrap_nodes="localhost:8001")
try:
# Discover available nodes
nodes = await client.dht_discovery.get_nodes()
print(f"Found {len(nodes)} nodes")
# Generate text
response = await client.generate(
prompt="What is LlamaNet?",
max_tokens=150,
temperature=0.7
)
if response:
print(f"Response: {response.text}")
print(f"Node: {response.node_id}")
print(f"Tokens: {response.tokens_generated}")
finally:
await client.close()
asyncio.run(main())The built-in web interface (http://localhost:8000) provides:
- Real-time streaming - watch responses appear live
- Parameter controls - adjust max tokens, temperature
- Streaming toggle - enable/disable real-time responses
- Live node discovery - see all connected nodes
- Performance metrics - load, tokens/second, uptime
- Health status - monitor node availability
- DHT network status - peer connections and routing table
- Max Tokens: Control response length (1-2048)
- Temperature: Adjust creativity (0.0-2.0)
- Streaming Mode: Toggle real-time vs. complete responses
LlamaNet provides comprehensive Docker support with automatic GPU/CPU detection, multi-node orchestration, and production-ready configurations.
- Docker and Docker Compose installed
- NVIDIA Container Toolkit (for GPU support, optional)
- GGUF model file in
./models/directory
# Clone and start a 3-node network
git clone https://github.com/machaao/llama-net.git
cd llama-net
mkdir -p models
# Place your GGUF model in models/model.gguf
docker-compose -f docker/docker-compose.yml up -d- Web UI: http://localhost:8000
- API: http://localhost:8000/v1/chat/completions
- Additional Nodes: http://localhost:8002, http://localhost:8004
- GPU Auto-Detection: Automatically detects NVIDIA GPUs and installs CUDA support
- CPU Fallback: Falls back to optimized CPU mode if no GPU available
- Smart Configuration: Optimizes settings based on detected hardware
# Start the complete network
docker-compose -f docker/docker-compose.yml up -d
# Scale to more nodes
docker-compose -f docker/docker-compose.yml up -d --scale inference1=3
# Monitor all nodes
docker-compose -f docker/docker-compose.yml logs -f- Health checks and monitoring
- Automatic restarts and failover
- Resource optimization
- Security best practices
# Bootstrap node with GPU support
docker run -d \
--name llamanet-gpu \
--gpus all \
-p 8000:8000 \
-p 8001:8001/udp \
-v $(pwd)/models:/models:ro \
-e MODEL_PATH=/models/your-model.gguf \
-e HARDWARE_MODE=auto \
llamanet/inference:latest# Additional CPU node
docker run -d \
--name llamanet-cpu \
-p 8002:8000 \
-p 8003:8001/udp \
-v $(pwd)/models:/models:ro \
-e MODEL_PATH=/models/your-model.gguf \
-e HARDWARE_MODE=cpu \
-e BOOTSTRAP_NODES=localhost:8001 \
llamanet/inference:latest| Variable | Default | Description |
|---|---|---|
MODEL_PATH |
required | Path to GGUF model file |
HARDWARE_MODE |
auto |
auto, gpu, or cpu |
N_GPU_LAYERS |
auto |
GPU layers (auto-optimized) |
BOOTSTRAP_NODES |
"" |
Comma-separated bootstrap nodes |
PORT |
8000 |
HTTP API port |
DHT_PORT |
8001 |
DHT protocol port |
All standard OpenAI endpoints plus LlamaNet extensions:
# Health check
curl http://localhost:8000/health
# List network models
curl http://localhost:8000/v1/models/network
# Chat completion with load balancing
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llamanet",
"messages": [{"role": "user", "content": "Hello!"}],
"strategy": "load_balanced"
}'# View all container logs
docker-compose -f docker/docker-compose.yml logs -f
# Check network status
curl http://localhost:8000/dht/status
# Monitor resources
docker stats
# Debug hardware detection
docker logs llamanet-gpu | grep -E "(GPU|Hardware)"For detailed Docker documentation including:
- Hardware optimization strategies
- Production deployment patterns
- Scaling and load balancing
- Troubleshooting guides
- Security configurations
- Performance tuning
See the comprehensive Docker Documentation 📖
# Show network status
python -m tools.network_status localhost:8001
# Monitor network in real-time
python -m tools.monitor localhost:8001
# Quick health check
python -m tools.quick_checkVisit http://localhost:8000 for:
- Real-time network status
- Node performance metrics
- Interactive chat interface
- Streaming response testing
GET /status- Node metricsGET /info- Node informationGET /health- Health checkGET /dht/status- DHT network status
GET /v1/models- List modelsPOST /v1/completions- Text completion (streaming supported)POST /v1/chat/completions- Chat completion (streaming supported)
GET /- Web UI dashboardGET /static/*- Static assets
MODEL_PATH=/path/to/model.gguf # Required: Path to GGUF model
HOST=0.0.0.0 # Bind address
PORT=8000 # HTTP API port
DHT_PORT=8001 # DHT protocol port
NODE_ID=unique-node-id # Node identifier
BOOTSTRAP_NODES=ip:port,ip:port # Bootstrap nodes
HEARTBEAT_INTERVAL=10 # DHT publish interval
N_CTX=2048 # Context size
N_BATCH=8 # Batch size
N_GPU_LAYERS=0 # GPU layers (0 = CPU only)python -m inference_node.server \
--model-path ./models/model.gguf \
--host 0.0.0.0 \
--port 8000 \
--dht-port 8001 \
--node-id my-node \
--bootstrap-nodes localhost:8001- Server-Sent Events (SSE) for real-time communication
- Functional programming approach with async generators
- Event-driven UI with real-time DOM updates
- Non-blocking streaming using async/await patterns
- Kademlia protocol for distributed hash table
- Automatic node discovery without central registry
- Load balancing based on node performance
- Fault tolerance with automatic failover
- Drop-in replacement for OpenAI API
- Streaming support with identical format
- Chat and completion endpoints
- Compatible with existing tools (curl, Postman, OpenAI libraries)
- Immediate feedback - users see responses instantly
- Better UX - no waiting for complete generation
- Lower perceived latency - streaming feels faster
- Cancellable requests - stop generation early
- Distributed load across multiple nodes
- Automatic scaling as nodes join/leave
- Smart routing to least loaded nodes
- Fault tolerance with automatic retry
The web UI uses a custom StreamUI class that:
- Handles Server-Sent Events from both LlamaNet and OpenAI endpoints
- Updates the chat interface in real-time
- Manages streaming state and error handling
- Provides visual feedback with animated cursors
The server implements streaming via:
- Async generators for token-by-token generation
- FastAPI StreamingResponse for HTTP streaming
- OpenAI-compatible format for existing client compatibility
- Functional programming patterns avoiding blocking loops
Developers building AI-powered applications need reliable, cost-effective inference for development and testing.
LlamaNet Solution:
- Local development environment with OpenAI-compatible API
- No API rate limits or costs during development
- Easy transition from development to production
- Test with different models and configurations
# Development setup
python -m inference_node.server --model-path ./models/dev-model.gguf
# Application code (works with both LlamaNet and OpenAI)
import openai
openai.api_base = "http://localhost:8000/v1" # LlamaNet for dev
# openai.api_base = "https://api.openai.com/v1" # OpenAI for productionOrganizations handling sensitive data (legal, financial, personal) need AI capabilities without exposing data to third parties.
LlamaNet Solution:
- All processing happens within organization's infrastructure
- No data sent to external AI services
- Full audit trail and control over AI operations
- Compliance with data protection regulations
Companies need AI assistance for strategic planning without revealing sensitive information to competitors or AI service providers.
LlamaNet Solution:
- Private AI network within company infrastructure
- Custom models trained on proprietary data
- No external data leakage or vendor dependencies
- Complete control over AI capabilities and access
| Scenario Type | Key Benefits |
|---|---|
| Enterprise | Cost reduction, data sovereignty, compliance, scalability |
| Research | Resource sharing, collaboration, specialized models |
| Community | Shared costs, democratic access, no central authority |
| Healthcare | Privacy compliance, local processing, secure networks |
| Startups | Low cost, no vendor lock-in, gradual scaling |
| Remote/Restricted | Offline capability, no external dependencies |
| Development | No API costs, unlimited testing, easy deployment |
| Privacy-Focused | Data control, compliance, competitive advantage |
- Identify Your Scenario: Match your needs to the scenarios above
- Plan Your Network: Decide on node locations and bootstrap strategy
- Choose Your Model: Select appropriate GGUF models for your use case
- Deploy Incrementally: Start with one node, add more as needed
- Integrate Applications: Use OpenAI-compatible API for easy integration
LlamaNet's flexibility allows it to adapt to virtually any scenario where distributed, private, or cost-effective AI inference is needed.
LlamaNet Network Formation
Step 1: Bootstrap Node Starts
┌─────────────────┐
│ Bootstrap Node │ ──► Starts DHT Network
│ (Node A) │ Creates initial routing table
└─────────────────┘
│
▼
┌─────────────────┐
│ DHT Network │
│ Storage Keys: │
│ • model:llama │
│ • node:abc123 │
│ • all_nodes │
└─────────────────┘
│
▼
Step 2: Additional Nodes Join
Node B ──► Connects to Bootstrap ──► Joins DHT
Node C ──► Connects to Bootstrap ──► Joins DHT
Node D ──► Connects to Node B ──► Joins DHT
Client Discovery Sequence:
1. Client Query:
Client ──► DHT Network: "Find model:llama-7b"
2. DHT Response:
DHT Network ──► Client: [Node1, Node2, Node3]
3. Health Checks:
Client ──► Node1: /status ──► Response: Load=0.3, TPS=15.2
Client ──► Node2: /status ──► Response: Load=0.7, TPS=12.1
Client ──► Node3: /status ──► Response: Load=0.1, TPS=18.5
4. Node Selection:
Client selects Node3 (lowest load)
5. Inference Request:
Client ──► Node3: /v1/chat/completions ──► Generated Text Response
DHT Storage Structure:
┌─────────────────────────────────────────────────────────────┐
│ Key: "model:llama-7b" │
│ Value: [ │
│ {node_id: "abc123", ip: "192.168.1.10", port: 8000}, │
│ {node_id: "def456", ip: "192.168.1.11", port: 8000}, │
│ {node_id: "ghi789", ip: "192.168.1.12", port: 8000} │
│ ] │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Key: "node:abc123" │
│ Value: { │
│ node_id: "abc123", ip: "192.168.1.10", port: 8000, │
│ model: "llama-7b", load: 0.3, tps: 15.2, uptime: 3600 │
│ } │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Key: "all_nodes" │
│ Value: [All active nodes regardless of model] │
└─────────────────────────────────────────────────────────────┘
Client Request Processing Flow
┌─────────────┐
│ Client │
│ Request │
└──────┬──────┘
│
▼
┌─────────────┐
│ Select API │
│ Mode │ ──► OpenAI API (/v1/chat/completions)
└──────┬──────┘
│
▼
┌─────────────┐
│ DHT Node │
│ Discovery │ ──► Query: "model:llama" or "all_nodes"
└──────┬──────┘
│
▼
┌─────────────┐
│ Node │
│ Selection │ ──► Load Balancing (lowest load)
└──────┬──────┘ ──► Health Check (/status)
│
▼
┌─────────────┐
│ HTTP │
│ Request │ ──► POST /v1/chat/completions
└──────┬──────┘
│
▼
┌─────────────┐
│ LLM │
│ Inference │ ──► llama.cpp processing
└──────┬──────┘
│
▼
┌─────────────┐
│ Response │
│ Formatting │ ──► LlamaNet or OpenAI format
└─────────────┘
LlamaNet Network Topology
Internet/Local Network
│
┌──────────────────┼──────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Node A │◄──────►│ Node B │◄──────►│ Node C │
│ HTTP: │ DHT │ HTTP: │ DHT │ HTTP: │
│ :8000 │ Gossip │ :8002 │ Gossip │ :8004 │
│ DHT: │ │ DHT: │ │ DHT: │
│ :8001 │ │ :8003 │ │ :8005 │
└─────────┘ └─────────┘ └─────────┘
▲ ▲ ▲
│ HTTP API │ HTTP API │ HTTP API
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│Client 1 │ │Client 2 │ │Web UI │
│Python │ │OpenAI │ │Browser │
│API │ │Library │ │ │
└─────────┘ └─────────┘ └─────────┘
Legend:
━━━ HTTP API Connections (Inference)
◄─► DHT Protocol Connections (Discovery)
OpenAI Compatibility Architecture
┌─────────────────┐
│ OpenAI │
│ Client │ ──► Uses standard OpenAI library
│ Application │
└─────────┬───────┘
│
▼
┌─────────────────┐
│ LlamaNet │
│ Compatibility │ ──► /v1/models
│ Endpoints │ ──► /v1/completions
└─────────┬───────┘ ──► /v1/chat/completions
│
▼
┌─────────────────┐
│ Request │
│ Translation │ ──► OpenAI format → LlamaNet format
└─────────┬───────┘
│
▼
┌─────────────────┐
│ LlamaNet │
│ Core Engine │ ──► DHT Discovery
└─────────┬───────┘ ──► Node Selection
│ ──► Load Balancing
▼
┌─────────────────┐
│ llama.cpp │
│ Inference │ ──► Model Processing
└─────────┬───────┘ ──► Text Generation
│
▼
┌─────────────────┐
│ Response │
│ Translation │ ──► LlamaNet format → OpenAI format
└─────────────────┘
Web UI Component Architecture
┌─────────────────┐
│ Web Browser │
└─────────┬───────┘
│ HTTP Request
▼
┌─────────────────┐
│ Static Files │
│ Server │ ──► Bootstrap CSS
└─────────┬───────┘ ──► Font Awesome Icons
│ ──► Custom CSS
▼ ──► JavaScript App
┌─────────────────┐
│ JavaScript │
│ Application │ ──► Network Monitor
└─────────┬───────┘ ──► Chat Interface
│ ──► API Mode Selector
▼
┌─────────────────┐
│ Backend API │
│ Endpoints │ ──► /dht/status (Network Info)
└─────────┬───────┘ ──► /v1/chat/completions (OpenAI)
│
▼
┌─────────────────┐
│ Response │
│ Processing │ ──► Markdown Rendering
└─────────┬───────┘ ──► Syntax Highlighting
│ ──► Chat Display
▼
┌─────────────────┐
│ User Interface │
│ Updates │ ──► Real-time Chat
└─────────────────┘ ──► Network Status
──► Performance Metrics
End-to-End Data Flow
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │───►│ DHT Network │───►│ Node Select │───►│ Inference │
│ │ │ │ │ │ │ │
│ • Web UI │ │ • Discovery │ │ • Load Bal. │ │ • llama.cpp │
│ • API Call │ │ • Routing │ │ • Failover │ │ • Generate │
│ • OpenAI │ │ • Storage │ │ • Health │ │ • Response │
│ • Python │ │ • Gossip │ │ • Metrics │ │ • Tokens │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
▲ │
│ Response Flow │
└─────────────────────────────────────────────────────────┘
Request Types:
• Text Generation • Chat Completion • Model Listing
• Node Discovery • Health Checks • Status Updates
Node Startup Sequence:
┌─────────────────┐
│ 1. Load Config │ ──► Parse CLI args & environment
└─────────┬───────┘
▼
┌─────────────────┐
│ 2. Init LLM │ ──► Load GGUF model with llama.cpp
└─────────┬───────┘
▼
┌─────────────────┐
│ 3. Start DHT │ ──► Create Kademlia node
└─────────┬───────┘
▼
┌─────────────────┐
│ 4. Join Network │ ──► Connect to bootstrap nodes
└─────────┬───────┘
▼
┌─────────────────┐
│ 5. HTTP Server │ ──► Serve API & Web UI
└─────────┬───────┘
▼
┌─────────────────┐
│ 6. Publish Info │ ──► Announce to DHT every 10s
└─────────────────┘
Client Discovery Process:
┌─────────────────┐
│ 1. DHT Client │ ──► Initialize Kademlia client
└─────────┬───────┘
▼
┌─────────────────┐
│ 2. Query Net │ ──► Search by model or all nodes
└─────────┬───────┘
▼
┌─────────────────┐
│ 3. Health Check │ ──► Verify availability & performance
└─────────┬───────┘
▼
┌─────────────────┐
│ 4. Load Balance │ ──► Select optimal node
└─────────┬───────┘
▼
┌─────────────────┐
│ 5. Send Request │ ──► HTTP call to selected node
└─────────┬───────┘
▼
┌─────────────────┐
│ 6. Handle Resp │ ──► Process result or failover
└─────────────────┘
Bootstrap Node (8001) ← Node 1 (8003) ← Node 2 (8005)
↑ ↑ ↑
Client connects Joins DHT Joins DHT
model:{model_name}- Find nodes serving specific modelsnode:{node_id}- Find specific nodes by IDall_nodes- Discover any available nodes
- Load Configuration - Parse CLI args and environment variables
- Initialize LLM - Load GGUF model with llama.cpp
- Start DHT Node - Create Kademlia node on available port
- Join Network - Connect to bootstrap nodes if specified
- Start HTTP Server - Serve inference API and web UI
- Begin Publishing - Announce availability to DHT every 10 seconds
- Create DHT Client - Initialize Kademlia client
- Query Network - Search for nodes by model or all nodes
- Health Check - Verify node availability and performance
- Load Balancing - Select optimal node based on load/TPS
- Send Request - Make HTTP call to selected node
- Handle Response - Process result or failover to backup node
- Receive Request - HTTP endpoint receives generation request
- Validate Input - Check prompt, parameters, and format
- Queue Processing - Add to inference queue if needed
- LLM Generation - Call llama.cpp with specified parameters
- Format Response - Convert to LlamaNet or OpenAI format
- Update Metrics - Track tokens, timing, and load statistics
- Return Result - Send formatted response to client
This architecture ensures high availability, automatic scaling, and fault tolerance while maintaining compatibility with existing OpenAI-based applications.
No nodes discovered:
# Check DHT status
curl http://localhost:8000/dht/status
# Verify bootstrap nodes
python -m tools.network_status localhost:8001Web UI not loading:
# Check if static files are served
curl http://localhost:8000/static/style.cssStreaming responses cut off:
- Ensure your HTTP client supports streaming
- Check for proxy/firewall interference
- Verify Content-Type headers are correct
# Check hardware fingerprint
curl http://localhost:8000/hardware
# Validate node ID consistency
curl http://localhost:8000/hardware/validate
# Debug node ID mismatches
curl http://localhost:8000/debug/node-id
# Force hardware revalidation
curl -X POST http://localhost:8000/hardware/update
# Fix node ID mismatches (emergency)
curl -X POST http://localhost:8000/debug/fix-node-id# Enable debug logging
export LOG_LEVEL=DEBUG
python -m inference_node.server --model-path ./model.ggufLlamaNet's decentralized architecture with streaming support makes it ideal for:
- Real-time AI assistance with immediate feedback
- Multi-office deployment with local streaming nodes
- Compliance-friendly - data never leaves your infrastructure
- Interactive research tools with streaming responses
- Collaborative AI across multiple departments
- Resource sharing with real-time load balancing
- Community-driven AI with shared streaming infrastructure
- Real-time collaboration tools and assistants
- Cost-effective scaling with streaming efficiency
- Private streaming AI within secure networks
- No external dependencies for sensitive data processing
- Real-time processing without cloud latency
- Choose Your Deployment: Single node for testing, multi-node for production
- Enable Streaming: Use the web UI or API endpoints with
stream: true - Configure Parameters: Adjust max tokens, temperature for your use case
- Monitor Performance: Use the web dashboard to track streaming performance
- Scale as Needed: Add more nodes for increased capacity
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure streaming works in both modes
- Submit a pull request
llama.cpp - The foundation of LlamaNet's inference capabilities
- Created by Georgi Gerganov and the llama.cpp community
- Provides efficient CPU and GPU inference for LLM models
- Enables GGUF format support and quantization
- Powers the core text generation in every LlamaNet node
llama-cpp-python - Python bindings for llama.cpp
- Created by Andrei Betlen
- Provides the Python interface used by LlamaNet's LLM wrapper
- Enables seamless integration between Python and llama.cpp
- Kademlia - Distributed hash table implementation
- FastAPI - Modern web framework for the API layer
- Uvicorn - ASGI server for high-performance serving
- The llama.cpp community for continuous improvements and optimizations
- Meta AI for releasing the Llama model family
- All model publishers on Hugging Face providing GGUF quantizations
- The open-source AI community for making decentralized AI possible
Apache License 2.0 - see LICENSE file for details.
This project was built with love using MACH-AI
