A flexible and efficient Rust-based inference service for sentence transformer models using the Candle ML framework. Built with clean architecture principles and designed for concurrent access with ArcRwLock.
- π Fast Inference: Powered by Candle ML framework for efficient CPU/GPU inference
- π Model Switching: Runtime model switching without service restart
- ποΈ Clean Architecture: Domain-driven design with clear separation of concerns
- π Thread Safe: Concurrent model access using
Arc<RwLock<T>> - π Multiple Formats: Support for JSON, CSV, and human-readable output
- π Multiple Models: Pre-configured popular sentence transformer models
- βοΈ Configurable: TOML-based configuration with environment variable support
- Rust 1.70+ (2021 edition)
- Optional: CUDA toolkit for GPU acceleration
# Clone and build
git clone <your-repo>
cd inference
cargo build --release
# Start the API server
cargo run# Using docker-compose (recommended)
docker-compose up --build
# Or using Docker directly
docker build -t inference-api .
docker run -p 8080:8080 inference-apiThe server will start on http://127.0.0.1:8080 by default.
curl http://localhost:8080/healthcurl -X POST http://localhost:8080/encode \
-H "Content-Type: application/json" \
-d '{"text": "Hello, world!", "normalize": true}'curl -X POST http://localhost:8080/encode/batch \
-H "Content-Type: application/json" \
-d '{
"texts": ["Text 1", "Text 2", "Text 3"],
"normalize": true
}'# Get current model info
curl http://localhost:8080/model/info
# Switch model
curl -X POST http://localhost:8080/model/switch \
-H "Content-Type: application/json" \
-d '{
"model_id": "sentence-transformers/all-mpnet-base-v2",
"tokenizer_repo": "sentence-transformers/all-mpnet-base-v2",
"max_sequence_length": 512,
"device": "cpu"
}'See examples/api_usage.md for detailed API documentation and client examples.
The project follows clean architecture principles:
src/
βββ domain/ # Business logic and entities
β βββ entities.rs # Core data structures
β βββ traits.rs # Domain interfaces
βββ infrastructure/ # External dependencies
β βββ config.rs # Configuration management
β βββ model_loader.rs # Candle model loading
β βββ sentence_transformer.rs # ML inference
βββ application/ # Use cases and orchestration
β βββ services.rs # Service composition
β βββ use_cases.rs # Business use cases
βββ presentation/ # User interfaces
βββ cli.rs # Command-line interface
βββ handlers.rs # Command handlers
Edit config/default.toml:
[model]
model_id = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer_repo = "sentence-transformers/all-MiniLM-L6-v2"
max_sequence_length = 512
device = "cpu"
[server]
host = "127.0.0.1"
port = 8080
workers = 4Override configuration with environment variables:
export INFERENCE_MODEL__MODEL_ID="sentence-transformers/all-mpnet-base-v2"
export INFERENCE_MODEL__DEVICE="cuda"
export INFERENCE_SERVER__PORT="3000"See config/models.toml for ready-to-use model configurations:
all-MiniLM-L6-v2: Fast, general-purpose (default)all-mpnet-base-v2: Higher quality, slowermultilingual-e5-base: 100+ languages supportbge-small-en-v1.5: Compact English modelgte-large: Superior performance, larger size
# Encode query and documents
cargo run -- encode -t "machine learning algorithms" --output-format json > query.json
cargo run -- encode-batch \
-t "supervised learning with neural networks" \
-t "unsupervised clustering techniques" \
-t "reinforcement learning applications" \
--output-format json > docs.json# Compare embeddings from different models
cargo run -- encode -t "artificial intelligence" --output-format json
cargo run -- switch-model -m "sentence-transformers/all-mpnet-base-v2"
cargo run -- encode -t "artificial intelligence" --output-format jsonEnable CUDA support:
# Build with CUDA
cargo build --release --features cuda
# Use GPU
cargo run -- switch-model -m "your-model" --device cudaThe service uses Arc<RwLock<T>> for thread-safe model access:
- Multiple readers can access the model simultaneously
- Model switching requires exclusive write access
- Zero-copy model sharing between threads
# Debug build
cargo build
# Release build
cargo build --release
# With specific features
cargo build --features cuda
cargo build --features metal # macOS
cargo build --features mkl # Intel MKL# Run tests
cargo test
# Run with logging
RUST_LOG=debug cargo testcargo clippy
cargo fmt- Candle: ML framework for Rust
- Tokenizers: HuggingFace tokenizers
- Tokio: Async runtime
- Clap: Command-line parsing
- Serde: Serialization
- Anyhow: Error handling
- Tracing: Structured logging
[Your License Here]
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
- HTTP REST API server
- WebSocket streaming
- Model quantization support
- Batch processing optimization
- Metrics and monitoring
- Docker containerization
- Kubernetes deployment