From the Greek ἀρχή — origin, beginning, source.
A self-hostable, lightweight scientific data repository built for research groups, university departments, and teaching environments that need FAIR-compliant data sharing without the operational overhead of a full-scale platform.
Zenodo is a great platform — it lets researchers publish datasets, software, and papers with a permanent DOI, completely free. I use it and think it does its job well. InvenioRDM is the open-source software that powers it, and it is a serious, well-maintained system built by a team at CERN.
The problem I kept running into is that neither is easy to self-host. InvenioRDM requires Kubernetes or a complex Docker setup, at least 16 GB of RAM, and a fair amount of configuration just to get a first record in. I went through their documentation, looked at GitHub issues, and read threads in communities like the HEP Software Foundation — and the same question kept coming up: is there something simpler, that I can just run on my own server?
The situations people described were practical ones:
- A university department that wants its own repository for student work, without sending data to an external service
- A research group that can't upload pre-publication data to CERN-operated infrastructure due to compliance rules
- A summer school or workshop where participants need to share datasets and code for a few weeks
- A lab that works in a restricted or offline network environment
I built Arkhe to cover that gap — something that works like Zenodo for a small group, runs with a single docker compose up, and doesn't need a dedicated sysadmin to maintain.
- ORCID authentication — researchers sign in with their existing orcid.org identity; no new accounts or passwords
- File upload up to 2 GB — CSV, JSON, PDF, HDF5, and ROOT (uproot) files supported; metadata extracted automatically in the background
- Full-text search — OpenSearch-backed with facets (experiment, record type, year) and prefix autocomplete
- FAIR-compliant metadata — every record has a
GET /records/{id}/metadata.jsonendpoint returning schema.orgDatasetJSON-LD (application/ld+json) - Presigned download URLs — public files downloadable without credentials via MinIO presigned URLs
- Background processing — Celery workers handle file parsing and search indexing asynchronously so uploads return immediately
- Observability — structured JSON logs (structlog), Prometheus metrics at
/metrics, and deep health checks at/health/ready
Browser
└── Nginx :80
├── /api/* ──► FastAPI (uvicorn)
│ ├── PostgreSQL — users, records, file metadata
│ ├── MinIO — file blobs (S3-compatible)
│ ├── OpenSearch — full-text search index
│ └── Redis ◄── Celery workers
│ (file parsing, index updates)
└── /* ──► React SPA (static, built into nginx image)
Nine services total, all defined in docker-compose.yml. The nginx image is a multi-stage build that compiles the React frontend and copies the static output — no separate frontend container at runtime.
Prerequisites: Docker and Docker Compose (v2).
No git clone or build step required.
curl -O https://raw.githubusercontent.com/KaranSinghDev/Arkhe-Open-Data-Archive/main/docker-compose.hub.yml
curl -O https://raw.githubusercontent.com/KaranSinghDev/Arkhe-Open-Data-Archive/main/.env.example
cp .env.example .env
# Edit .env — set ORCID_CLIENT_ID, ORCID_CLIENT_SECRET, SECRET_KEY,
# POSTGRES_PASSWORD, MINIO_ROOT_PASSWORD, ORCID_REDIRECT_URI
docker compose -f docker-compose.hub.yml up -dgit clone https://github.com/KaranSinghDev/Arkhe-Open-Data-Archive.git
cd arkhe
cp .env.example .env
docker compose up --buildOpen http://localhost. API docs: http://localhost/api/docs.
Register a free ORCID Public API at orcid.org/developer-tools. Set the redirect URI to http://localhost/auth/callback (or your domain).
# Backend — Python 3.12+
cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload # http://localhost:8000
# Frontend — Node 20+
cd frontend
npm install
npm run dev # http://localhost:5173 (proxies /api to :8000)# Tests
cd backend && pytest -x -q
# Lint
cd backend && ruff check app/
cd frontend && npx tsc --noEmitAll settings are environment variables. Copy .env.example and edit.
| Variable | Required | Description |
|---|---|---|
DATABASE_URL |
Yes | postgresql+asyncpg://user:pass@postgres:5432/db |
SECRET_KEY |
Yes | JWT signing secret — use a long random string |
ORCID_CLIENT_ID |
Yes | From orcid.org/developer-tools |
ORCID_CLIENT_SECRET |
Yes | From orcid.org/developer-tools |
ORCID_REDIRECT_URI |
Yes | Must match what ORCID has on record |
REDIS_URL |
No | Default: redis://redis:6379/0 |
MINIO_ROOT_USER / _PASSWORD |
No | Default in .env.example |
LOG_LEVEL |
No | INFO (default), DEBUG, WARNING |
docker compose -f docker-compose.yml -f docker-compose.prod.yml up --build -dThe prod override removes volume bind-mounts, runs 4 Uvicorn workers, sets memory/CPU limits, and adds restart: unless-stopped to all services.
| Arkhe | InvenioRDM | Zenodo (hosted) | |
|---|---|---|---|
| Self-hostable | Yes | Yes | No |
| Setup time | ~5 min | 2–8 hours | N/A (SaaS) |
| Min RAM (single node) | 2–4 GB | ~16 GB | N/A |
| Services required | 9 | 30+ | N/A |
| Kubernetes required | No | Recommended | N/A |
| DOI minting (DataCite) | No | Yes | Yes |
| Record versioning | No | Yes | Yes |
| Communities / collections | No | Yes | Yes |
| Access control (embargoes, restricted) | No | Yes | Yes |
| FAIR JSON-LD metadata | Yes | Yes | Yes |
| ORCID login | Yes | Yes | Yes |
| Full-text search | Yes (OpenSearch) | Yes (Elasticsearch) | Yes |
| File size limit | 2 GB (configurable) | Configurable | 50 GB |
| Background file parsing | Yes (CSV/JSON/PDF/ROOT) | Plugin-based | No |
These figures are from a single measurement on one laptop (Intel i7, 12 GB RAM, WSL2) at idle with no records. WSL2 adds overhead, and memory usage will vary with JVM tuning, OS, and load — treat them as rough lower bounds, not guarantees.
| Container | Measured | Expected range |
|---|---|---|
| backend (FastAPI + uvicorn) | 122 MiB | 100–250 MiB |
| celery-worker | 91 MiB | 80–200 MiB |
| celery-flower | 51 MiB | 40–100 MiB |
| minio | 81 MiB | 60–200 MiB |
| postgres | 59 MiB | 50–500 MiB |
| redis | 7 MiB | 5–50 MiB |
| opensearch | ~1 GiB | 1–2 GiB |
| Total | ~1.4 GiB | ~1.5–3 GiB |
OpenSearch is the dominant cost and its JVM heap is the main variable. With the default settings in docker-compose.yml (-Xms512m -Xmx512m), it stays around 1 GiB resident. On a loaded system or with a large index, plan for 2 GB just for OpenSearch.
Search latency (p50 / p95, loopback inside the container, 200 requests at concurrency 10 — no network overhead included):
| Endpoint | p50 | p95 |
|---|---|---|
GET /api/records |
~10 ms | ~25 ms |
GET /api/search?q= |
~15 ms | ~75 ms |
Add 10–50 ms for LAN and 50–150 ms for WAN depending on your deployment. Numbers will increase under concurrent real load; these were measured with an otherwise idle stack.
A pre-built Docker image is available on Docker Hub — no build step needed:
docker pull karandev7/arkhe-backend:latest
docker pull karandev7/arkhe-frontend:latestSee the Quick start section for the full setup using the pre-built images.
Arkhe works well if you:
- Want your own private repository that stays on your servers
- Are at an institution where data can't go to external services
- Are running a course, workshop, or summer school and need somewhere for participants to upload work
- Need something running quickly without a complex setup
- Work in an offline or restricted network environment
Arkhe is not the right tool if you:
- Want a public, permanently archived record with a DOI — just use Zenodo directly, it's free and built for that
- Need record versioning, embargo controls, or community curation — InvenioRDM handles all of that and is the right choice for a serious institutional repository
- Expect to grow to tens of thousands of records — Arkhe's OpenSearch and PostgreSQL setup is intentionally minimal and hasn't been tuned for large scale
I built this to fill a specific gap, not to compete with Zenodo or InvenioRDM. If either of those fits your situation, use them.
- Wilkinson, M. D. et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018. https://doi.org/10.1038/sdata.2016.18
- InvenioRDM documentation — https://inveniordm.docs.cern.ch
- InvenioRDM system requirements — https://inveniordm.docs.cern.ch/install/requirements/
- Zenodo — https://zenodo.org
- ORCID Public API — https://info.orcid.org/documentation/features/public-api/
- OpenSearch documentation — https://opensearch.org/docs/latest/
- MinIO documentation — https://min.io/docs/minio/linux/index.html
- HEP Software Foundation — https://hepsoftwarefoundation.org
- schema.org Dataset — https://schema.org/Dataset
MIT — see LICENSE.