Skip to content

Artemis-IA/scichat-semantic-platform

Repository files navigation

SciChat Semantic AI Platform

A unified technical platform enabling an agent conversationnel to search, explore, and retrieve scientific metadata and resources using a Knowledge Graph (Oxigraph), Vector Search (Elasticsearch), and LLM (Ollama).

✨ Features

  • GraphRAG: Retrieval-Augmented Generation combining vector search + knowledge graph + LLM
  • SKG-IF Compliant: Follows the Scientific Knowledge Graphs Interoperability Framework
  • Multi-Source Ingestion: arXiv, EuropePMC, PubChem, CopolDB crawlers with resume support
  • MCP Server: Model Context Protocol endpoint for LLM tool integration
  • OpenWebUI Integration: Ready-to-use functions and pipelines

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   OpenWebUI     │────▶│   MCP Server    │────▶│   Backend API   │
│   (Frontend)    │     │   (FastMCP)     │     │   (FastAPI)     │
└─────────────────┘     └─────────────────┘     └────────┬────────┘
                                                         │
                        ┌────────────────────────────────┼────────────────────────────────┐
                        │                                │                                │
                        ▼                                ▼                                ▼
               ┌─────────────────┐              ┌─────────────────┐              ┌─────────────────┐
               │  Elasticsearch  │              │    Oxigraph     │              │     Ollama      │
               │ (Vector Search) │              │ (Knowledge Graph)│              │     (LLM)       │
               └─────────────────┘              └─────────────────┘              └─────────────────┘

The system is built on four main pillars:

  1. Data Layer:
    • Oxigraph: RDF Triplestore with SKG-IF ontology (Research Products, Agents, Topics, Venues)
    • Elasticsearch: Vector store for semantic similarity search on abstracts/titles
  2. Ingestion Layer:
    • Python ETL pipeline (ingestion/) with crawlers for arXiv, EuropePMC, PubChem
    • RDF transformation following SKG-IF data model
    • Vector embeddings with SentenceTransformers
  3. Service Layer:
    • FastAPI (api/) providing GraphRAG, SPARQL, and Vector search endpoints
    • Ollama integration for LLM generation
  4. Agentic Layer:
    • FastMCP (mcp/) exposing API functions as tools to LLMs
    • OpenWebUI: Chat interface with GraphRAG tools

Prerequisites

  • Docker & Docker Compose installed.
  • Ollama running locally on the host machine (default port 11434) for LLM inference (e.g., Llama 3).
    • Note: OpenWebUI is configured to talk to host.docker.internal:11434.

Quick Start

  1. Setup Directories:

    ./setup.sh
  2. Start Services:

    ./start.sh
  3. Access:

Configuration

Ingestion

The ingestion service runs automatically on startup (defined in docker-compose.yml). It loads mock data located in ingestion/scripts/run_ingestion.py. To ingest real data, modify this script to connect to your OAI-PMH or FTP sources.

MCP & LLM

The MCP Server runs on a separate container but is not automatically "connected" to the LLM inside OpenWebUI unless configured.

  • In a production setup, you would configure the LLM runner to attach to the MCP server.
  • Currently, the MCP server acts as a standalone tool provider that can be queried.

Project Structure

.
├── api/                 # FastAPI Service (GraphRAG + SPARQL + Vector)
│   └── src/
│       ├── routers/     # API endpoints (rag.py, search.py)
│       └── services/    # Business logic (rag_service, ollama_service, etc.)
├── ingestion/           # Python ETL Pipeline
│   ├── scripts/         # Crawlers and ingestion scripts
│   └── src/
│       ├── extractors/  # Source-specific crawlers (arXiv, EuropePMC, PubChem)
│       ├── transformers/# RDF transformation (SKG-IF)
│       └── loaders/     # AllegroGraph + Elasticsearch loaders
├── mcp/                 # FastMCP Server (MCP tools for LLMs)
├── openwebui/           # OpenWebUI integration
│   ├── functions/       # GraphRAG tools for OpenWebUI
│   ├── pipelines/       # Auto-augmentation pipeline
│   └── models/          # Custom model configurations
├── data/                # Persistent storage (created by setup.sh)
├── docker-compose.yml   # Orchestration
├── RUNBOOK.md           # Operations guide
├── setup.sh             # Init script
└── start.sh             # Launch script

API Endpoints

Endpoint Method Description
/rag/query POST GraphRAG query (vector search + LLM generation)
/rag/health GET Health check for all components
/search/semantic POST Vector similarity search
/search/sparql POST Direct SPARQL query

SKG-IF Ontology

The system uses the Scientific Knowledge Graphs Interoperability Framework (SKG-IF) for RDF metadata:

Entity Description
ResearchProduct Publications, datasets, software
Agent (Person/Organisation) Authors, institutions
Topic Keywords, categories
Venue Journals, conferences
DataSource Provenance (arXiv, EuropePMC, PubChem)
Contribution Author roles with ranking
Manifestation Access URLs (HTML, PDF)
Identifier DOI, PMID, arXiv ID

See: https://skg-if.github.io/

Technologies

  • Python 3.11
  • Oxigraph (RDF Triplestore)
  • Elasticsearch v8 (Vector Search)
  • FastAPI (REST API)
  • FastMCP (Model Context Protocol)
  • Ollama (LLM inference)
  • OpenWebUI (Chat interface)
  • SentenceTransformers (Embeddings)
  • rdflib (RDF processing)

OpenWebUI Integration

See openwebui/README.md for detailed instructions on:

  • Importing GraphRAG tools
  • Configuring the RAG pipeline
  • Creating a custom SciChat model

License

Apache 2.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages