Compute Engine for Building Data Platforms
Indexify is a compute engine for building data platforms in Python. Create large-scale data processing workflows and agentic applications with durable execution—functions automatically retry on failure, and workflows seamlessly scale across machines. Upon deployment, each application gets a unique URL that can be called from any system.
Note: Indexify is the open-source core that powers Tensorlake Cloud—a serverless platform for document processing, media pipelines, and agentic applications.
| Feature | Description |
|---|---|
| 🐍 Python Native | Define workflows as Python functions with type hints—no DSLs, YAML, or config files |
| 🔄 Durable Execution | Functions automatically retry on failure with persistent state across restarts |
| 📊 Distributed Map/Reduce | Parallelize functions over sequences across machines with automatic data shuffling |
| ⚡ Request Queuing | Automatically queue and batch invocations to maximize GPU utilization |
| 🌐 Multi-Cloud | Run across multiple clouds, datacenters, or regions with minimal configuration |
| 📈 Autoscaling | Server automatically redistributes work when machines come and go |
Build production-grade data pipelines entirely in Python with automatic parallelization, fault tolerance, and distributed execution:
- Document Processing — Extract tables, images, and text from PDFs at scale; build knowledge graphs; implement RAG pipelines
- Media Pipelines — Transcribe and summarize video/audio content; detect and describe objects in images
- ETL & Data Transformation — Process millions of records with distributed map/reduce operations
Build durable AI agents that reliably execute multi-step workflows:
- Tool-Calling Agents — Orchestrate LLM tool calls with automatic state management and retry logic
- Multi-Agent Systems — Coordinate multiple agents with durable message passing
📖 Explore the Cookbooks → for complete examples and tutorials.
Using pip:
pip install indexify tensorlakeCreate applications using @application() and @function() decorators. Each function runs in its own isolated sandbox with durable execution—if a function crashes, it automatically restarts from where it left off.
from typing import List
from pydantic import BaseModel
from tensorlake.applications import application, function, Image, run_local_application
# Define container image with dependencies
embedding_image = Image(base_image="python:3.11-slim", name="embedding_image").run(
"pip install sentence-transformers langchain-text-splitters chromadb"
)
class TextChunk(BaseModel):
chunk: str
page_number: int
class ChunkEmbedding(BaseModel):
text: str
embedding: List[float]
@function(image=embedding_image)
def chunk_text(text: str) -> List[TextChunk]:
"""Split text into chunks for embedding."""
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
texts = splitter.create_documents([text])
return [
TextChunk(chunk=chunk.page_content, page_number=i)
for i, chunk in enumerate(texts)
]
@function(image=embedding_image)
def embed_chunks(chunks: List[TextChunk]) -> List[ChunkEmbedding]:
"""Embed text chunks using sentence transformers."""
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
return [
ChunkEmbedding(text=chunk.chunk, embedding=model.encode(chunk.chunk).tolist())
for chunk in chunks
]
@function(image=embedding_image)
def write_to_vectordb(embeddings: List[ChunkEmbedding]) -> str:
"""Write embeddings to ChromaDB."""
import chromadb
import uuid
client = chromadb.PersistentClient("./chromadb_data")
collection = client.get_or_create_collection("documents")
for emb in embeddings:
collection.upsert(
ids=[str(uuid.uuid4())],
embeddings=[emb.embedding],
documents=[emb.text],
)
return f"Indexed {len(embeddings)} chunks"
@application()
@function(description="Text embedding pipeline")
def text_embedder(text: str) -> str:
"""Main application: chunks text, embeds it, and stores in vector DB."""
chunks = chunk_text(text)
embeddings = embed_chunks(chunks)
result = write_to_vectordb(embeddings)
return resultTensorlake Cloud is the fastest way to test and deploy your applications—no infrastructure setup required. Get an API key and deploy in seconds:
# Set your API key
export TENSORLAKE_API_KEY="your-api-key"
# Deploy the application
tensorlake deploy workflow.py
# => Deployed! URL: https://api.tensorlake.ai/namespaces/default/applications/text_embedderInvoke your application using the SDK or call the URL directly:
from tensorlake.applications import run_remote_application
request = run_remote_application(text_embedder, "Your document text here...")
result = request.output()
print(result)If you prefer to self-host or need on-premise deployment, you can run the Indexify server locally:
# Terminal 1: Start the server
docker run -p 8900:8900 tensorlake/indexify-server
# Terminal 2: Start an executor (repeat for more parallelism)
indexify-cli executorSet the API URL and deploy:
export TENSORLAKE_API_URL=http://localhost:8900
tensorlake deploy workflow.py
# => Deployed! URL: http://localhost:8900/namespaces/default/applications/text_embedderRun your application:
from tensorlake.applications import run_remote_application
request = run_remote_application(text_embedder, "Your document text here...")
result = request.output()
print(result)For quick iteration during development, run applications locally without any infrastructure:
if __name__ == "__main__":
request = run_local_application(text_embedder, "Your document text here...")
result = request.output()
print(result)For production self-hosted deployments, see operations/k8s for Kubernetes deployment manifests and Helm charts.
Start with Tensorlake Cloud to build and test your applications without infrastructure overhead. When you're ready for self-hosting or need on-premise deployment, Indexify provides the same runtime you can run anywhere.
| Feature | Tensorlake Cloud | Indexify (Self-Hosted) |
|---|---|---|
| Setup Time | Instant—just get an API key | Deploy server + executors |
| Image Building | Automatic image builds when you deploy | Build and manage container images yourself |
| Auto-Scaling | Dynamic container scaling with scale-to-zero | Manual executor management |
| Security | Secure sandboxes (gVisor, Linux containers, virtualization) | Standard container isolation |
| Secrets | Built-in secret management for applications | Manage secrets externally |
| Observability | Logging, tracing, and observability built-in | Bring your own logging/tracing |
| Testing | Interactive playground to invoke applications | Local development only |
Get started with Tensorlake Cloud →
We welcome contributions! See CONTRIBUTING.md for guidelines.
Indexify is licensed under the Apache 2.0 License.