Version: 0.1.0 Last Updated: 2025-12-28
This document provides comprehensive API documentation for all public modules in the Loci engine.
- Core Engine API
- Backend API
- GGUF Loader API
- Sampling API
- Paged Attention API
- Constraint Sampling API
- Suspend/Resume API
- Radix Tree Caching API
- Plugin System API
- Model Registry API
- LoRA API
- Model Encryption API
- Multi-Tenancy API
- Multimodal API
- Quantization API
- Kernel Fusion API
- Configuration API
Main inference engine for running AI models. Built on llama.cpp FFI bindings for maximum performance.
pub fn new(config: EngineConfig) -> anyhow::Result<Self>Creates a new Loci engine instance with the specified configuration.
Example:
use loci::{LociEngine, EngineConfig};
let config = EngineConfig {
model_path: "path/to/model.gguf".to_string(),
n_gpu_layers: -1, // Use all GPU layers
temperature: 0.7, // Sampling temperature
top_k: 40, // Top-K sampling
top_p: 0.9, // Top-P sampling
..Default::default()
};
let engine = LociEngine::new(config)?;pub fn generate(
&self,
prompt: &str,
max_tokens: usize
) -> anyhow::Result<String>Generates text based on the given prompt using the sampler configured in EngineConfig.
Parameters:
prompt: Input text promptmax_tokens: Maximum number of tokens to generate
Returns: Generated text as String
Example:
let response = engine.generate("Once upon a time", 100)?;
println!("{}", response);Note: Sampling parameters (temperature, top_k, top_p) are configured in EngineConfig during engine creation. The engine uses llama.cpp's internal sampler for optimal performance.
pub fn stats(&self) -> PerformanceStatsReturns current performance statistics.
Example:
let stats = engine.stats();
println!("Prompt eval: {:.2} t/s", stats.prompt_tokens_per_second());
println!("Generation: {:.2} t/s", stats.eval_tokens_per_second());pub fn backend_name(&self) -> StringReturns the name of the active compute backend.
Example:
let backend = engine.backend_name();
println!("Using backend: {}", backend);Configuration for the Loci engine.
pub struct EngineConfig {
pub model_path: String,
pub n_ctx: u32, // Default: 2048
pub n_gpu_layers: i32, // Default: 0 (-1 = all layers)
pub n_batch: u32, // Default: 512
pub n_threads: u32, // Default: num_cpus
pub temperature: f32, // Default: 0.8
pub top_k: i32, // Default: 40
pub top_p: f32, // Default: 0.9
pub repeat_penalty: f32, // Default: 1.1
}Field Descriptions:
| Field | Type | Default | Description |
|---|---|---|---|
model_path |
String |
"" |
Path to the GGUF model file |
n_ctx |
u32 |
2048 |
Context length (max tokens in context) |
n_gpu_layers |
i32 |
0 |
Number of layers to offload to GPU (-1 = all, 0 = CPU only) |
n_batch |
u32 |
512 |
Batch size for prompt processing |
n_threads |
u32 |
auto | Number of CPU threads (0 = auto-detect) |
temperature |
f32 |
0.8 |
Sampling temperature (0.0 = deterministic, >1.0 = more random) |
top_k |
i32 |
40 |
Top-K sampling (0 = disabled) |
top_p |
f32 |
0.9 |
Top-P/nucleus sampling (1.0 = disabled) |
repeat_penalty |
f32 |
1.1 |
Repetition penalty (1.0 = no penalty) |
Example:
use loci::EngineConfig;
let config = EngineConfig {
model_path: "llama-2-7b.gguf".to_string(),
n_ctx: 4096, // Longer context
n_gpu_layers: -1, // Use all GPU
temperature: 0.7, // Less random
top_k: 40,
top_p: 0.9,
repeat_penalty: 1.15, // Stronger penalty
..Default::default()
};Performance metrics for the engine.
pub struct PerformanceStats {
pub prompt_eval_count: u64,
pub prompt_eval_time_ms: u64,
pub eval_count: u64,
pub eval_time_ms: u64,
}
impl PerformanceStats {
pub fn prompt_tokens_per_second(&self) -> f64;
pub fn eval_tokens_per_second(&self) -> f64;
}Example:
let stats = engine.stats();
println!("Prompt processing:");
println!(" Tokens: {}", stats.prompt_eval_count);
println!(" Time: {}ms", stats.prompt_eval_time_ms);
println!(" Speed: {:.2} t/s", stats.prompt_tokens_per_second());
println!("Generation:");
println!(" Tokens: {}", stats.eval_count);
println!(" Time: {}ms", stats.eval_time_ms);
println!(" Speed: {:.2} t/s", stats.eval_tokens_per_second());Trait for compute backend abstraction.
pub trait ComputeBackend: Send + Sync {
fn name(&self) -> &str;
fn device_info(&self) -> DeviceInfo;
fn is_available(&self) -> bool;
fn forward(&self, input: &[f32]) -> Result<Vec<f32>>;
}Enumeration of supported backend types.
pub enum BackendType {
CPU,
CUDA,
Metal,
ROCm,
Vulkan,
}pub fn detect_backend() -> Box<dyn ComputeBackend>Automatically detects and returns the best available backend.
Example:
use loci::detect_backend;
let backend = detect_backend();
println!("Using backend: {}", backend.name());Information about the compute device.
pub struct DeviceInfo {
pub name: String,
pub memory_mb: u64,
pub compute_units: usize,
}Zero-copy GGUF model loader using memory mapping.
pub fn from_file(path: impl AsRef<Path>) -> anyhow::Result<Self>Loads a GGUF model from a file using zero-copy memory mapping.
Example:
use loci::GGUFModel;
let model = GGUFModel::from_file("path/to/model.gguf")?;
println!("Model: {}", model.metadata().name);pub fn metadata(&self) -> &GGUFMetadataReturns model metadata.
pub fn tensor_info(&self, name: &str) -> Option<&TensorInfo>Gets information about a specific tensor.
Model metadata from GGUF file.
pub struct GGUFMetadata {
pub name: String,
pub architecture: String,
pub vocab_size: usize,
pub context_length: usize,
pub embedding_dim: usize,
}Information about a tensor in the model.
pub struct TensorInfo {
pub name: String,
pub shape: Vec<usize>,
pub dtype: DataType,
pub offset: u64,
pub size: u64,
}Loci uses llama.cpp's high-performance integrated sampler. Sampling parameters are configured directly in EngineConfig.
Supported Sampling Methods:
- Temperature Scaling: Controls randomness (configured via
temperature) - Top-K Sampling: Samples from top K tokens (configured via
top_k) - Top-P (Nucleus) Sampling: Samples from cumulative probability mass (configured via
top_p) - Repetition Penalty: Penalizes repeated tokens (configured via
repeat_penalty)
Example:
use loci::{LociEngine, EngineConfig};
let config = EngineConfig {
model_path: "model.gguf".to_string(),
temperature: 0.7, // Moderate randomness
top_k: 40, // Consider top 40 tokens
top_p: 0.9, // Nucleus sampling threshold
repeat_penalty: 1.1, // Slight repetition penalty
..Default::default()
};
let engine = LociEngine::new(config)?;
let output = engine.generate("Once upon a time", 100)?;Sampling Parameter Guidelines:
| Use Case | Temperature | Top-K | Top-P | Repeat Penalty |
|---|---|---|---|---|
| Deterministic | 0.0-0.3 | 1-10 | 0.5-0.7 | 1.0-1.1 |
| Balanced | 0.7-0.8 | 40-50 | 0.9-0.95 | 1.1-1.15 |
| Creative | 0.9-1.2 | 100+ | 0.95-1.0 | 1.0-1.05 |
Note: The integrated sampler provides optimal performance by operating directly on llama.cpp's internal state. For advanced use cases requiring custom sampling logic, consider using the Plugin System with the on_sample hook.
Manages paged attention sessions with memory budgeting.
pub fn new(vram_mb: u64, ram_mb: u64, block_size_kb: usize) -> SelfCreates a new session manager with memory budgets.
Example:
use loci::SessionManager;
let manager = SessionManager::new(
4096, // 4GB VRAM
8192, // 8GB RAM
256, // 256KB block size
);pub fn create_session(&mut self) -> Result<SessionId>Creates a new paged attention session.
pub fn allocate_blocks(
&mut self,
session_id: SessionId,
num_blocks: usize
) -> Result<Vec<PhysicalBlockId>>Allocates physical blocks for a session.
pub fn free_session(&mut self, session_id: SessionId) -> Result<()>Frees all resources associated with a session.
Maps logical blocks to physical blocks.
pub struct BlockTable {
pub session_id: SessionId,
pub mapping: HashMap<LogicalBlockId, PhysicalBlockId>,
}pub const BLOCK_SIZE: usize = 256; // Tokens per blockTrait for implementing sampling constraints.
pub trait Constraint: Send + Sync {
fn apply(&self, ctx: &ConstraintContext) -> Result<TokenMask>;
fn reset(&mut self);
}Constrains output to match a regular expression.
pub fn new(pattern: &str) -> Result<Self>Example:
use loci::RegexConstraint;
let constraint = RegexConstraint::new(r"^\d{3}-\d{4}$")?;Constrains output to valid JSON matching a schema.
pub fn new(schema: JsonType) -> SelfExample:
use loci::{JsonSchemaConstraint, JsonType};
let schema = JsonType::Object(vec![
("name".to_string(), JsonType::String),
("age".to_string(), JsonType::Number),
]);
let constraint = JsonSchemaConstraint::new(schema);JSON schema type enumeration.
pub enum JsonType {
String,
Number,
Boolean,
Null,
Array(Box<JsonType>),
Object(Vec<(String, JsonType)>),
}Efficient token filtering mask.
pub struct TokenMask {
pub allowed_tokens: Vec<usize>,
}Session with suspend/resume capabilities.
pub fn new(session_id: SessionID) -> Selfpub fn suspend(&mut self, reason: SuspendReason) -> Result<ResumeContext>Suspends the session and returns a resume context.
Example:
use loci::{SuspendableSession, SuspendReason};
let mut session = SuspendableSession::new(session_id);
let resume_ctx = session.suspend(SuspendReason::ToolCall {
tool_name: "web_search".to_string(),
arguments: args,
})?;pub fn resume(&mut self, ctx: ResumeContext) -> Result<()>Resumes the session from a previous suspension.
Reason for suspending a session.
pub enum SuspendReason {
ToolCall { tool_name: String, arguments: HashMap<String, String> },
UserInput,
Approval,
Custom(String),
}Context for resuming a suspended session.
pub struct ResumeContext {
pub session_id: SessionID,
pub state: SessionState,
pub injection: Option<String>,
pub injection_type: InjectionType,
}Prefix-sharing data structure for KV cache.
pub fn new() -> Selfpub fn insert(&mut self, tokens: &[TokenId]) -> NodeIdInserts a token sequence and returns the leaf node ID.
pub fn search(&self, tokens: &[TokenId]) -> Option<NodeId>Searches for a token sequence, returns node ID if found.
pub fn get_cache_blocks(&self, node_id: NodeId) -> Vec<CacheBlockId>Gets all cache blocks associated with a node.
pub fn stats(&self) -> RadixTreeStatsReturns statistics about the tree.
Statistics for the radix tree.
pub struct RadixTreeStats {
pub total_nodes: usize,
pub total_tokens: usize,
pub shared_tokens: usize,
pub memory_saved_bytes: usize,
}Manages KV cache with radix tree integration.
pub fn new() -> Selfpub fn get_or_allocate(
&mut self,
tokens: &[TokenId]
) -> Result<(NodeId, Vec<CacheBlockId>)>Gets or allocates cache blocks for a token sequence.
Trait for implementing plugins.
pub trait Plugin: Send + Sync {
fn metadata(&self) -> &PluginMetadata;
fn on_load(&mut self) -> Result<()> { Ok(()) }
fn on_sample(&self, ctx: &mut PluginContext) -> Result<PluginControlFlow>;
fn on_unload(&mut self) -> Result<()> { Ok(()) }
}Dual-track plugin registry (Native + WASM).
pub fn new() -> Selfpub fn register_native(
&mut self,
plugin: Box<dyn Plugin>
) -> Result<String>Registers a native plugin.
Example:
use loci::PluginRegistry;
let mut registry = PluginRegistry::new();
let plugin_id = registry.register_native(Box::new(MyPlugin))?;pub fn register_wasm(
&mut self,
path: impl AsRef<Path>,
signature: Option<&[u8]>
) -> Result<String>Registers a WASM plugin with optional signature verification.
pub fn invoke_all(&self, ctx: &mut PluginContext) -> Result<PluginControlFlow>Invokes all registered plugins on the given context.
Plugin metadata.
pub struct PluginMetadata {
pub name: String,
pub version: String,
pub author: String,
pub description: String,
}Context passed to plugin hooks.
pub struct PluginContext<'a> {
pub logits: LogitsView<'a>,
pub tokens: &'a [usize],
pub metadata: HashMap<String, String>,
}Global registry for managing multiple models and LoRA adapters.
pub fn load_model(
&mut self,
model_id: ModelID,
path: impl AsRef<Path>
) -> Result<()>Loads a model into the registry.
Example:
use loci::MODEL_REGISTRY;
MODEL_REGISTRY.load_model("llama-7b", "path/to/llama-7b.gguf")?;pub fn switch_model(&mut self, model_id: &ModelID) -> Result<()>Switches the active model at runtime.
pub fn load_lora(
&mut self,
lora_id: LoRAID,
path: impl AsRef<Path>
) -> Result<()>Loads a LoRA adapter.
pub fn apply_lora(
&mut self,
model_id: &ModelID,
lora_id: &LoRAID,
scale: f32
) -> Result<()>Applies a LoRA adapter to a model.
Configuration for LoRA adapter.
pub struct LoRAConfig {
pub rank: usize,
pub alpha: f32,
pub target_modules: Vec<String>,
}Manages LoRA weight merging.
pub fn new() -> Selfpub fn load_lora(&mut self, path: impl AsRef<Path>) -> Result<LoRAModel>Loads a LoRA model from GGUF format.
pub fn merge_lora(
&self,
base_weight: &[f32],
lora: &LoRALayer,
scale: f32
) -> Result<Vec<f32>>Merges LoRA weights: W' = W + scale * (A @ B)
Example:
use loci::LoRAManager;
let manager = LoRAManager::new();
let lora = manager.load_lora("path/to/lora.gguf")?;
let merged = manager.merge_lora(&base_weights, &lora.layers[0], 1.0)?;LoRA model structure.
pub struct LoRAModel {
pub config: LoRAConfig,
pub layers: Vec<LoRALayer>,
}Individual LoRA layer.
pub struct LoRALayer {
pub name: String,
pub lora_a: LoRATensor, // [rank, in_features]
pub lora_b: LoRATensor, // [out_features, rank]
}Loads and decrypts AES-256-GCM encrypted models.
pub fn new(key_source: KeySource) -> Result<Self>Example:
use loci::{EncryptedModelLoader, KeySource};
let loader = EncryptedModelLoader::new(
KeySource::Environment("LOCI_ENCRYPTION_KEY".to_string())
)?;pub fn load_encrypted_model(&self, path: impl AsRef<Path>) -> Result<Vec<u8>>Loads and decrypts a model file.
pub fn encrypt_model(
&self,
input_path: impl AsRef<Path>,
output_path: impl AsRef<Path>
) -> Result<()>Encrypts a model file with AES-256-GCM.
Source for encryption keys.
pub enum KeySource {
Environment(String),
File(PathBuf),
KMS { endpoint: String, key_id: String },
Hardware,
Direct(Zeroizing<Vec<u8>>),
}pub fn generate_key() -> Result<Zeroizing<Vec<u8>>>Generates a cryptographically secure 256-bit encryption key.
Manages multi-tenant resource isolation.
pub fn new() -> Selfpub fn create_tenant(
&mut self,
quota: TenantQuota
) -> Result<TenantID>Creates a new tenant with specified quota.
Example:
use loci::{TenantManager, TenantQuota};
let mut manager = TenantManager::new();
let tenant_id = manager.create_tenant(TenantQuota::Enterprise {
max_sessions: 100,
max_memory_mb: 16384,
max_tokens_per_session: 128000,
})?;pub fn get_context(&self, tenant_id: &TenantID) -> Result<Arc<TenantContext>>Gets the isolated context for a tenant.
pub fn delete_tenant(&mut self, tenant_id: &TenantID) -> Result<()>Deletes a tenant and frees all resources.
Quota tiers for tenants.
pub enum TenantQuota {
Free {
max_sessions: usize,
max_memory_mb: u64,
max_tokens_per_session: usize,
},
Default {
max_sessions: usize,
max_memory_mb: u64,
max_tokens_per_session: usize,
},
Enterprise {
max_sessions: usize,
max_memory_mb: u64,
max_tokens_per_session: usize,
},
}Trait for vision encoders.
pub trait VisionEncoder: Send + Sync {
fn encode_image(&self, image: &ImageBuffer) -> Result<Vec<f32>>;
fn embedding_dim(&self) -> usize;
fn supported_sizes(&self) -> Vec<(u32, u32)>;
}CLIP ViT-L/14@336 implementation.
pub fn new() -> Selfpub fn encode_image(&self, image: &ImageBuffer) -> Result<Vec<f32>>Encodes an image into embeddings.
Example:
use loci::{CLIPVisionEncoder, ImageBuffer};
let encoder = CLIPVisionEncoder::new();
let image = ImageBuffer::from_file("image.jpg")?;
let embeddings = encoder.encode_image(&image)?;Unified KV cache for text and image tokens.
pub struct MultimodalKVCache {
pub text_cache: Vec<(Vec<f32>, Vec<f32>)>,
pub image_cache: Vec<(Vec<f32>, Vec<f32>)>,
pub token_types: Vec<TokenType>,
}Token type enumeration.
pub enum TokenType {
Text,
Image,
}Supported quantization formats.
pub enum QuantizationType {
None,
FP16,
Q8_0,
Q4_0,
Q4_K_M,
IQ2_XXS, // 2-bit importance weighted
BitNet158, // Ternary {-1, 0, +1}
}Manages quantization schemes.
pub fn quantize(
&self,
data: &[f32],
qtype: QuantizationType
) -> Result<QuantizedTensor>Quantizes float32 data to specified format.
Example:
use loci::{QuantizationManager, QuantizationType};
let manager = QuantizationManager::new();
let quantized = manager.quantize(&data, QuantizationType::IQ2_XXS)?;pub fn dequantize(&self, tensor: &QuantizedTensor) -> Result<Vec<f32>>Dequantizes tensor back to float32.
2-bit importance-weighted quantization.
pub struct IQ2_XXS {
pub block_size: usize, // 32 elements
pub importance_threshold: f32,
}Ternary quantization {-1, 0, +1}.
pub struct BitNet158 {
pub zero_threshold: f32,
}Manages fused kernels for performance optimization.
pub fn create_rmsnorm_rope(
rmsnorm_params: RMSNormParams,
rope_params: RoPEParams
) -> Result<RMSNormRoPEFusion>Creates a fused RMSNorm+RoPE kernel.
Example:
use loci::{KernelFusionManager, RMSNormParams, RoPEParams};
let fusion = KernelFusionManager::create_rmsnorm_rope(
RMSNormParams { dim: 4096, eps: 1e-5 },
RoPEParams { dim: 128, base: 10000.0, max_seq_len: 2048 }
)?;
let output = fusion.forward(&input, position)?;Fused RMSNorm + RoPE kernel (-30% latency).
pub struct RMSNormRoPEFusion {
pub rmsnorm_params: RMSNormParams,
pub rope_params: RoPEParams,
}Fused MatMul + Bias Add (+18% throughput).
pub struct MatMulAddFusion {
pub weight: Vec<f32>,
pub bias: Vec<f32>,
pub in_features: usize,
pub out_features: usize,
}Loads and validates configuration from files and environment.
pub fn new() -> Selfpub fn from_file(path: impl AsRef<Path>) -> Result<Self>Loads configuration from TOML or JSON file.
Example:
use loci::ConfigLoader;
let config = ConfigLoader::from_file("loci.toml")?
.with_env_overrides()
.build()?;pub fn with_env_overrides(self) -> SelfApplies environment variable overrides.
Environment Variables:
LOCI_MODEL_PATH: Model file pathLOCI_BACKEND: Backend type (cpu/cuda/metal/rocm)LOCI_N_GPU_LAYERS: Number of GPU layersLOCI_LOG_LEVEL: Logging level
pub fn validate(&self) -> Result<()>Validates the configuration.
pub fn build(self) -> Result<LociConfig>Builds the final configuration after validation.
Global configuration structure.
pub struct LociConfig {
pub engine: EngineSettings,
pub backend: BackendSettings,
pub memory: MemorySettings,
pub plugins: PluginSettings,
pub logging: LoggingSettings,
pub server: ServerSettings,
}All API functions that can fail return anyhow::Result<T> for flexible error handling.
Example:
use anyhow::{Result, Context};
fn my_function() -> Result<()> {
let engine = LociEngine::new(config)
.context("Failed to initialize engine")?;
let output = engine.generate(prompt, &mut sampler, 100)
.context("Generation failed")?;
Ok(())
}Common type aliases used throughout the API:
pub type SessionId = u64;
pub type TokenId = u32;
pub type NodeId = usize;
pub type ModelID = String;
pub type LoRAID = String;
pub type TenantID = uuid::Uuid;- Use mmap for large models: Set
use_mmap: trueinEngineConfig - Optimize batch size: Adjust
batch_sizebased on available memory - Enable kernel fusion: Set
enable_fusion: truein backend config - Use GPU layers: Set
n_gpu_layers: -1for maximum GPU utilization - Enable radix tree caching: Automatically enabled for prefix sharing
- Use appropriate quantization: Q4_K_M for balance, IQ2_XXS for maximum compression
- All public APIs are thread-safe unless otherwise noted
LociEngineuses internal locking for concurrent accessPluginRegistryis thread-safeSessionManageris thread-safeTenantManageris thread-safe
This API reference is for Loci version 0.1.0.
Minimum Rust Version: 1.85+
Last Updated: 2026-01-01 License: MIT