We're building a unified multimodal data storage and orchestration solution powered by Pixeltable. It enables incremental storage, transformation, indexing, and orchestration of your multimodal data—providing a single, seamless way to store and search across text, images, audio, and video.
To demonstrate the usage, we have created MCP servers on top of Pixeltable infra for different modalities and connected them to Agents powered by local LLMs. You can also use these servers as part of your own solution.
How It Works:
- Query Submission: A user submits a query of any modality (text, image, video, or audio).
- Smart Routing: A Router Agent classifies the query and directs it to the appropriate specialist.
- Specialist Execution: The designated Specialist Agent (Document, Image, Video, or Audio) uses its dedicated Pixeltable MCP server to execute the task—be it indexing, insertion, or searching.
- Response Synthesis: The output is then passed to a Synthesis Agent.
- Final Output: This final agent refines the retrieved information into a polished, user-friendly response.
We use:
- Pixeltable for multimodal AI data infrastructure
- CrewAI for multi-agent orchestration
- Ollama for running large language models locally
Follow these steps one by one:
Create a .env file in the root directory of your project with the following content:
OPENAI_API_KEY=<your_openai_api_key>Download and install Ollama for your operating system. Ollama is used to run large language models locally.
For example, on linux, you can use the following command:
curl -fsSL https://ollama.com/install.sh | shPull the required model:
ollama pull gemma3uv syncTo run all 4 (audio, video, image, and doc) Pixeltable MCP servers, execute the following docker compose command:
docker compose --env-file .env up --buildEach service runs on its designated port (8080 for audio, 8081 for video, 8082 for image, 8083 for doc).
Our Pixeltable servers are ready, so now it's time to integrate the MCP servers as tools within CrewAI!
We will create crews of agents linked to their respective Pixeltable MCP servers for tool discovery and execution. Next, we will use the CrewAI flow to orchestrate a multimodal, multi-agent system capable of performing complex tasks such as audio and video indexing, semantic image search, and more.
Please refer to the crewai_mcp.ipynb notebook for detailed instructions and the complete code to build the CrewAI flow described above.
Get a FREE Data Science eBook 📖 with 150+ essential lessons in Data Science when you subscribe to our newsletter! Stay in the loop with the latest tutorials, insights, and exclusive resources. Subscribe now!
Contributions are welcome! Feel free to fork this repository and submit pull requests with your improvements.
