Use this template when creating specifications for integrating Shimmy into applications
- Application Name: [Your application name]
- Integration Type: [REST API, Library, CLI, WebSocket, etc.]
- Expected Traffic: [Requests per second/minute/hour]
- Model Requirements: [Specific models or model types needed]
- 5MB Constraint: Integration preserves Shimmy's lightweight nature
- Startup Speed: Integration doesn't impact sub-2-second startup
- Zero Python Dependencies: No Python runtime requirements
- API Compatibility: Uses standard OpenAI API endpoints
- CLI Access: Programmatic access via command-line interface
[Your Application] -> [Integration Layer] -> [Shimmy Instance] -> [Model Backend]
shimmy_config:
bind: "127.0.0.1:11435" # Or auto-allocated port
model_dirs: "path/to/your/models"
features: ["huggingface", "llama"] # Optional: specify required featuresSHIMMY_PORT=11435 # Optional: override default port
SHIMMY_MODEL_DIRS="/path/to/models" # Optional: additional model directories
SHIMMY_LOG_LEVEL=info # Optional: logging level# List available models
curl http://localhost:11435/v1/models
# Generate completion
curl -X POST http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'# List models programmatically
shimmy list --short | grep "model-pattern"
# Health check
shimmy serve --bind 127.0.0.1:0 & # Auto-allocate port
SHIMMY_PID=$!
# Cleanup
kill $SHIMMY_PID- Memory: Base 5MB + model size + context buffer
- CPU: [Your CPU requirements]
- GPU: [Optional GPU requirements for acceleration]
- Network: [Bandwidth requirements for your use case]
- Single Instance: Direct API calls for low-traffic applications
- Load Balanced: Multiple Shimmy instances behind load balancer
- Containerized: Docker deployment with resource constraints
- Serverless: Lambda/Function-as-a-Service integration
# Basic connectivity test
curl -f http://localhost:11435/health || exit 1
# Model availability test
MODEL_COUNT=$(curl -s http://localhost:11435/v1/models | jq '.data | length')
[ "$MODEL_COUNT" -gt 0 ] || exit 1
# Generation test
RESPONSE=$(curl -s -X POST http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "test-model", "messages": [{"role": "user", "content": "test"}], "max_tokens": 10}')
echo "$RESPONSE" | jq -e '.choices[0].message.content' || exit 1# Startup time test
time shimmy serve --bind 127.0.0.1:0 &
# Should complete in <2 seconds
# Memory usage test
ps -o pid,vsz,rss -p $SHIMMY_PID
# VSZ should be reasonable for your constraints- Model Not Found: Verify model is in discovery path
- Port Conflicts: Use auto-allocation or check port availability
- Memory Limits: Monitor resource usage, especially with large models
- GPU Issues: Check GPU detection and driver compatibility
# Fallback to CPU if GPU fails
shimmy serve --bind 127.0.0.1:11435
# Shimmy automatically handles GPU fallback
# Model fallback hierarchy
# 1. Requested specific model
# 2. Similar model in same family
# 3. Default available model
# 4. Error with available alternatives- Models Available: Required models discoverable in configured paths
- Ports Configured: Network ports available and firewall configured
- Resource Limits: Memory and CPU limits appropriate for model size
- Dependencies Met: Rust runtime and required libraries available
- Health Check:
/healthendpoint responding correctly - Model Discovery: Models appearing in
/v1/modelsendpoint - Generation Test: Successful completion generation
- Performance Metrics: Startup time and memory usage within limits
- Startup Time: Should be <2 seconds
- Memory Usage: Base 5MB + model overhead
- Request Latency: Time to first token and total generation time
- Error Rate: Failed requests as percentage of total
- Model Load Time: Time to load models on demand
alerts:
startup_time: >2s
memory_usage: >expected_model_size + 100MB
error_rate: >5%
request_latency_p95: >5s # Adjust based on your SLA- Bind Address: Use 127.0.0.1 for local-only access
- Firewall Rules: Restrict access to required ports only
- TLS/HTTPS: Consider reverse proxy for HTTPS termination
- Authentication: Implement application-level auth as needed
- Model Validation: Verify model integrity before loading
- Access Controls: Limit model directory access permissions
- Input Sanitization: Validate prompts in your application layer
- Output Filtering: Review generated content as appropriate
This template ensures constitutional compliance while providing practical integration guidance.
Customize sections based on your specific integration requirements.
Last Updated: September 17, 2025 Version: 1.0