Skip to content

KaranSinghDev/Arkhe-Open-Data-Archive

Repository files navigation

Arkhe

DOI Build Status Docker Deployment FAIR Metadata

From the Greek ἀρχή — origin, beginning, source.

A self-hostable, lightweight scientific data repository built for research groups, university departments, and teaching environments that need FAIR-compliant data sharing without the operational overhead of a full-scale platform.


Background

Zenodo is a great platform — it lets researchers publish datasets, software, and papers with a permanent DOI, completely free. I use it and think it does its job well. InvenioRDM is the open-source software that powers it, and it is a serious, well-maintained system built by a team at CERN.

The problem I kept running into is that neither is easy to self-host. InvenioRDM requires Kubernetes or a complex Docker setup, at least 16 GB of RAM, and a fair amount of configuration just to get a first record in. I went through their documentation, looked at GitHub issues, and read threads in communities like the HEP Software Foundation — and the same question kept coming up: is there something simpler, that I can just run on my own server?

The situations people described were practical ones:

  • A university department that wants its own repository for student work, without sending data to an external service
  • A research group that can't upload pre-publication data to CERN-operated infrastructure due to compliance rules
  • A summer school or workshop where participants need to share datasets and code for a few weeks
  • A lab that works in a restricted or offline network environment

I built Arkhe to cover that gap — something that works like Zenodo for a small group, runs with a single docker compose up, and doesn't need a dedicated sysadmin to maintain.


What it does

  • ORCID authentication — researchers sign in with their existing orcid.org identity; no new accounts or passwords
  • File upload up to 2 GB — CSV, JSON, PDF, HDF5, and ROOT (uproot) files supported; metadata extracted automatically in the background
  • Full-text search — OpenSearch-backed with facets (experiment, record type, year) and prefix autocomplete
  • FAIR-compliant metadata — every record has a GET /records/{id}/metadata.json endpoint returning schema.org Dataset JSON-LD (application/ld+json)
  • Presigned download URLs — public files downloadable without credentials via MinIO presigned URLs
  • Background processing — Celery workers handle file parsing and search indexing asynchronously so uploads return immediately
  • Observability — structured JSON logs (structlog), Prometheus metrics at /metrics, and deep health checks at /health/ready

Architecture

Browser
  └── Nginx :80
        ├── /api/*  ──► FastAPI (uvicorn)
        │                  ├── PostgreSQL  — users, records, file metadata
        │                  ├── MinIO       — file blobs (S3-compatible)
        │                  ├── OpenSearch  — full-text search index
        │                  └── Redis  ◄──  Celery workers
        │                                   (file parsing, index updates)
        └── /*      ──► React SPA (static, built into nginx image)

Nine services total, all defined in docker-compose.yml. The nginx image is a multi-stage build that compiles the React frontend and copies the static output — no separate frontend container at runtime.


Quick start

Prerequisites: Docker and Docker Compose (v2).

Option A — pre-built images (recommended)

No git clone or build step required.

curl -O https://raw.githubusercontent.com/KaranSinghDev/Arkhe-Open-Data-Archive/main/docker-compose.hub.yml
curl -O https://raw.githubusercontent.com/KaranSinghDev/Arkhe-Open-Data-Archive/main/.env.example
cp .env.example .env
# Edit .env — set ORCID_CLIENT_ID, ORCID_CLIENT_SECRET, SECRET_KEY,
#             POSTGRES_PASSWORD, MINIO_ROOT_PASSWORD, ORCID_REDIRECT_URI
docker compose -f docker-compose.hub.yml up -d

Option B — build from source

git clone https://github.com/KaranSinghDev/Arkhe-Open-Data-Archive.git
cd arkhe
cp .env.example .env
docker compose up --build

Open http://localhost. API docs: http://localhost/api/docs.

Register a free ORCID Public API at orcid.org/developer-tools. Set the redirect URI to http://localhost/auth/callback (or your domain).


Development setup

# Backend — Python 3.12+
cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload        # http://localhost:8000

# Frontend — Node 20+
cd frontend
npm install
npm run dev                          # http://localhost:5173 (proxies /api to :8000)
# Tests
cd backend && pytest -x -q

# Lint
cd backend && ruff check app/
cd frontend && npx tsc --noEmit

Configuration

All settings are environment variables. Copy .env.example and edit.

Variable Required Description
DATABASE_URL Yes postgresql+asyncpg://user:pass@postgres:5432/db
SECRET_KEY Yes JWT signing secret — use a long random string
ORCID_CLIENT_ID Yes From orcid.org/developer-tools
ORCID_CLIENT_SECRET Yes From orcid.org/developer-tools
ORCID_REDIRECT_URI Yes Must match what ORCID has on record
REDIS_URL No Default: redis://redis:6379/0
MINIO_ROOT_USER / _PASSWORD No Default in .env.example
LOG_LEVEL No INFO (default), DEBUG, WARNING

Production deployment

docker compose -f docker-compose.yml -f docker-compose.prod.yml up --build -d

The prod override removes volume bind-mounts, runs 4 Uvicorn workers, sets memory/CPU limits, and adds restart: unless-stopped to all services.


Comparison with alternatives

Arkhe InvenioRDM Zenodo (hosted)
Self-hostable Yes Yes No
Setup time ~5 min 2–8 hours N/A (SaaS)
Min RAM (single node) 2–4 GB ~16 GB N/A
Services required 9 30+ N/A
Kubernetes required No Recommended N/A
DOI minting (DataCite) No Yes Yes
Record versioning No Yes Yes
Communities / collections No Yes Yes
Access control (embargoes, restricted) No Yes Yes
FAIR JSON-LD metadata Yes Yes Yes
ORCID login Yes Yes Yes
Full-text search Yes (OpenSearch) Yes (Elasticsearch) Yes
File size limit 2 GB (configurable) Configurable 50 GB
Background file parsing Yes (CSV/JSON/PDF/ROOT) Plugin-based No

Resource usage estimates

These figures are from a single measurement on one laptop (Intel i7, 12 GB RAM, WSL2) at idle with no records. WSL2 adds overhead, and memory usage will vary with JVM tuning, OS, and load — treat them as rough lower bounds, not guarantees.

Container Measured Expected range
backend (FastAPI + uvicorn) 122 MiB 100–250 MiB
celery-worker 91 MiB 80–200 MiB
celery-flower 51 MiB 40–100 MiB
minio 81 MiB 60–200 MiB
postgres 59 MiB 50–500 MiB
redis 7 MiB 5–50 MiB
opensearch ~1 GiB 1–2 GiB
Total ~1.4 GiB ~1.5–3 GiB

OpenSearch is the dominant cost and its JVM heap is the main variable. With the default settings in docker-compose.yml (-Xms512m -Xmx512m), it stays around 1 GiB resident. On a loaded system or with a large index, plan for 2 GB just for OpenSearch.

Search latency (p50 / p95, loopback inside the container, 200 requests at concurrency 10 — no network overhead included):

Endpoint p50 p95
GET /api/records ~10 ms ~25 ms
GET /api/search?q= ~15 ms ~75 ms

Add 10–50 ms for LAN and 50–150 ms for WAN depending on your deployment. Numbers will increase under concurrent real load; these were measured with an otherwise idle stack.


Honest scope

A pre-built Docker image is available on Docker Hub — no build step needed:

docker pull karandev7/arkhe-backend:latest
docker pull karandev7/arkhe-frontend:latest

See the Quick start section for the full setup using the pre-built images.

Arkhe works well if you:

  • Want your own private repository that stays on your servers
  • Are at an institution where data can't go to external services
  • Are running a course, workshop, or summer school and need somewhere for participants to upload work
  • Need something running quickly without a complex setup
  • Work in an offline or restricted network environment

Arkhe is not the right tool if you:

  • Want a public, permanently archived record with a DOI — just use Zenodo directly, it's free and built for that
  • Need record versioning, embargo controls, or community curation — InvenioRDM handles all of that and is the right choice for a serious institutional repository
  • Expect to grow to tens of thousands of records — Arkhe's OpenSearch and PostgreSQL setup is intentionally minimal and hasn't been tuned for large scale

I built this to fill a specific gap, not to compete with Zenodo or InvenioRDM. If either of those fits your situation, use them.


References


License

MIT — see LICENSE.

About

A self-hostable scientific data repository for research groups and academic environments. Based on CERN ZENODO

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors