A unified technical platform enabling an agent conversationnel to search, explore, and retrieve scientific metadata and resources using a Knowledge Graph (Oxigraph), Vector Search (Elasticsearch), and LLM (Ollama).
- GraphRAG: Retrieval-Augmented Generation combining vector search + knowledge graph + LLM
- SKG-IF Compliant: Follows the Scientific Knowledge Graphs Interoperability Framework
- Multi-Source Ingestion: arXiv, EuropePMC, PubChem, CopolDB crawlers with resume support
- MCP Server: Model Context Protocol endpoint for LLM tool integration
- OpenWebUI Integration: Ready-to-use functions and pipelines
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ OpenWebUI │────▶│ MCP Server │────▶│ Backend API │
│ (Frontend) │ │ (FastMCP) │ │ (FastAPI) │
└─────────────────┘ └─────────────────┘ └────────┬────────┘
│
┌────────────────────────────────┼────────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Elasticsearch │ │ Oxigraph │ │ Ollama │
│ (Vector Search) │ │ (Knowledge Graph)│ │ (LLM) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
The system is built on four main pillars:
- Data Layer:
- Oxigraph: RDF Triplestore with SKG-IF ontology (Research Products, Agents, Topics, Venues)
- Elasticsearch: Vector store for semantic similarity search on abstracts/titles
- Ingestion Layer:
- Python ETL pipeline (
ingestion/) with crawlers for arXiv, EuropePMC, PubChem - RDF transformation following SKG-IF data model
- Vector embeddings with SentenceTransformers
- Python ETL pipeline (
- Service Layer:
- FastAPI (
api/) providing GraphRAG, SPARQL, and Vector search endpoints - Ollama integration for LLM generation
- FastAPI (
- Agentic Layer:
- FastMCP (
mcp/) exposing API functions as tools to LLMs - OpenWebUI: Chat interface with GraphRAG tools
- FastMCP (
- Docker & Docker Compose installed.
- Ollama running locally on the host machine (default port 11434) for LLM inference (e.g., Llama 3).
- Note: OpenWebUI is configured to talk to
host.docker.internal:11434.
- Note: OpenWebUI is configured to talk to
-
Setup Directories:
./setup.sh
-
Start Services:
./start.sh
-
Access:
- OpenWebUI: http://localhost:3000
- API Docs: http://localhost:8000/docs
- Oxigraph UI: http://localhost:7878
The ingestion service runs automatically on startup (defined in docker-compose.yml). It loads mock data located in ingestion/scripts/run_ingestion.py. To ingest real data, modify this script to connect to your OAI-PMH or FTP sources.
The MCP Server runs on a separate container but is not automatically "connected" to the LLM inside OpenWebUI unless configured.
- In a production setup, you would configure the LLM runner to attach to the MCP server.
- Currently, the MCP server acts as a standalone tool provider that can be queried.
.
├── api/ # FastAPI Service (GraphRAG + SPARQL + Vector)
│ └── src/
│ ├── routers/ # API endpoints (rag.py, search.py)
│ └── services/ # Business logic (rag_service, ollama_service, etc.)
├── ingestion/ # Python ETL Pipeline
│ ├── scripts/ # Crawlers and ingestion scripts
│ └── src/
│ ├── extractors/ # Source-specific crawlers (arXiv, EuropePMC, PubChem)
│ ├── transformers/# RDF transformation (SKG-IF)
│ └── loaders/ # AllegroGraph + Elasticsearch loaders
├── mcp/ # FastMCP Server (MCP tools for LLMs)
├── openwebui/ # OpenWebUI integration
│ ├── functions/ # GraphRAG tools for OpenWebUI
│ ├── pipelines/ # Auto-augmentation pipeline
│ └── models/ # Custom model configurations
├── data/ # Persistent storage (created by setup.sh)
├── docker-compose.yml # Orchestration
├── RUNBOOK.md # Operations guide
├── setup.sh # Init script
└── start.sh # Launch script
| Endpoint | Method | Description |
|---|---|---|
/rag/query |
POST | GraphRAG query (vector search + LLM generation) |
/rag/health |
GET | Health check for all components |
/search/semantic |
POST | Vector similarity search |
/search/sparql |
POST | Direct SPARQL query |
The system uses the Scientific Knowledge Graphs Interoperability Framework (SKG-IF) for RDF metadata:
| Entity | Description |
|---|---|
ResearchProduct |
Publications, datasets, software |
Agent (Person/Organisation) |
Authors, institutions |
Topic |
Keywords, categories |
Venue |
Journals, conferences |
DataSource |
Provenance (arXiv, EuropePMC, PubChem) |
Contribution |
Author roles with ranking |
Manifestation |
Access URLs (HTML, PDF) |
Identifier |
DOI, PMID, arXiv ID |
See: https://skg-if.github.io/
- Python 3.11
- Oxigraph (RDF Triplestore)
- Elasticsearch v8 (Vector Search)
- FastAPI (REST API)
- FastMCP (Model Context Protocol)
- Ollama (LLM inference)
- OpenWebUI (Chat interface)
- SentenceTransformers (Embeddings)
- rdflib (RDF processing)
See openwebui/README.md for detailed instructions on:
- Importing GraphRAG tools
- Configuring the RAG pipeline
- Creating a custom SciChat model
Apache 2.0