You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cross-platform & energy-efficient kernels, runtime and AI inference engine for mobile devices.
┌─────────────────┐
│ Cactus FFI │ ←── OpenAI compatible C API for integration (tools, RAG, cloud handoff)
└─────────────────┘
│
┌─────────────────┐
│ Cactus Engine │ ←── High-level transformer engine (NPU support, INT4/INT8/FP16/MIXED)
└─────────────────┘
│
┌─────────────────┐
│ Cactus Models │ ←── Implements SOTA models using Cactus Graphs
└─────────────────┘
│
┌─────────────────┐
│ Cactus Graph │ ←── Unified zero-copy computation graph (think NumPy for mobile)
└─────────────────┘
│
┌─────────────────┐
│ Cactus Kernels │ ←── Low-level ARM-specific SIMD operations (think CUDA for mobile)
└─────────────────┘
Cactus Graph & Kernel
#include cactus.h
CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);
auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);
float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);
graph.execute();
void* output_data = graph.get_output(result);
graph.hard_reset();
Cactus Engine & FFI
#include cactus.h
cactus_set_pro_key(""); // email founders@cactuscompute.com for optional keycactus_model_t model = cactus_init(
"path/to/weight/folder", // section to generate weigths below"txt/or/md/file/or/dir/with/many", // nullptr if none, cactus does automatic fast RAG
);
constchar* messages = R"([ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "My name is Henry Ndubuaku"}])";
constchar* options = R"({ "max_tokens": 50, "stop_sequences": ["<|im_end|>"]})";
char response[4096];
int result = cactus_complete(
model, // model handle from cactus_init
messages, // JSON array of chat messages
response, // buffer to store response JSONsizeof(response), // size of response buffer
options, // optional: generation options (nullptr for defaults)nullptr, // optional: tools JSON for function calling nullptr, // optional: streaming callback fn(token, id, user_data)nullptr// optional: user data passed to callback
);
Example response from Gemma3-270m
{
"success": true, // when successfully generated locally"error": null, // returns specific errors if success = false"cloud_handoff": false, // true when model is unconfident, simply route to cloud"response": "Hi there!", // null when error is not null or cloud_handoff = true"function_calls": [], // parsed to [{"name":"set_alarm","arguments":{"hour":"10","minute":"0"}}]"confidence": 0.8193, // how confident the model is with its response"time_to_first_token_ms": 45.23, // latency (time to first token)"total_time_ms": 163.67, // total execution time"prefill_tps": 1621.89, // prefill tokens per second"decode_tps": 168.42, // decode tokens per second"ram_usage_mb": 245.67, // current process RAM usage in MB"prefill_tokens": 28,
"decode_tokens": 50,
"total_tokens": 78}
Performance
Models: LFM2-VL-450m & Whisper-Small
Precision: Cactus smartly blends INT4, INT8 and F16 for all weights.