Permissioned dataset interface API. Exposes governed SQL query access to PostgreSQL-backed datasets, a DCAT-AP 3.0 catalogue, and a CLI for governance/lineage import-export workflows.
Two-database design:
- Catalogue DB — stores
datasets_entriestable (metadata, access levels, DCAT fields). Managed by Alembic migrations. Schema name controlled byCATALOGUE_SCHEMA(defaultdataset_api). - Datasets DB — holds the actual data tables. No ORM models; tables are reflected at runtime via SQLAlchemy
MetaData.reflect()with PostGIS geometry support (db/reflection.py).
Entry point: src/celine/dataset/main.py → create_app() factory. Routers are discovered automatically from routes/*.py files exporting a router variable.
POST /query accepts {"sql": "...", "limit": N, "offset": N}. The pipeline:
- Parse (
api/dataset_query/parser.py) — sqlglot AST validation against an allowlist of expressions and functions. Rejects DML, statement stacking, comments, disallowed functions. The parser uses a Tuple/IN allowlist and depth checks. - Governance — resolves referenced tables to
DatasetEntryrecords, enforces access level via OPA policies evaluated in-process (policies/celine/dataset.rego). - Row filters (
api/dataset_query/row_filters/) — pluggable row-level access control. Handlers registered viaROW_FILTERS_MODULESsetting. Built-in handlers:direct_user_match,rec_registry,http_in_list,table_pointer. - Execute (
api/dataset_query/executor.py) — runs the rewritten SQL withstatement_timeoutguard. Limits clamped toMAX_LIMIT=10000.
SQL parser allowlist: when adding support for new SQL constructs, add the sqlglot exp.* type to ALLOWED_EXPRESSIONS in parser.py. For new SQL functions, add the lowercase name to ALLOWED_FUNCTIONS.
Three layers:
- Authentication — JWT via
celine-sdkOIDC. Dependencies:get_current_user()(required) /get_optional_user()(optional). - Access levels — per-dataset:
open,internal,restricted,secret. Stored inDatasetEntry.access_level. - OPA policy —
policies/celine/dataset.regoevaluates subject type (user/service/anonymous), roles, groups, scopes against access level. Admin scope (X.admin) matches allX.*required scopes.
Optional EDC dataspace integration when EDR_ENABLED=true — checks Edc-Contract-Agreement-Id / Edc-Bpn headers.
Installed as dataset-cli (pyproject.toml [project.scripts]).
Key commands (see taskfile.yaml):
task cli:export:governance— extract governance metadata fromgovernance.yamlfiles in pipelines repos intodata/governance/task cli:import:governance— import extracted YAML into the API cataloguetask cli:export:openlineage/task cli:import:openlineage— same for Marquez lineage data
task setup # uv sync
task run # uvicorn on :8001 with reload
task debug # same with debugpy on :48001
task test # pytest (append -- -k "name" to filter)
task alembic:migrate # alembic upgrade headRequires Python >= 3.12, uv as package manager, hatchling for builds.
Local PostgreSQL expected at :15432 (credentials postgres:securepassword123). Settings use Pydantic Settings v2 with .env file support; defaults work for local dev. Cross-service refs use host.docker.internal.
- Source layout:
src/celine/dataset/(namespace package for cross-celine compatibility) - Settings: single
Settings()instance incore/config.py, env vars override defaults - Tests:
pytest-asynciowith dependency override fixtures intests/conftest.py. SQL parser has dedicated security test suites (injection, fuzzing, jailbreak) undertests/api/dataset_query/sql_parser/ - DCAT catalogue:
api/catalogue/dcat_formatter.pyproduces JSON-LD. Publisher metadata enriched fromowners.yaml - Versioning:
python-semantic-release,task release
| Path | Purpose |
|---|---|
core/config.py |
All settings and env vars |
api/dataset_query/parser.py |
SQL validation allowlist |
api/dataset_query/executor.py |
Query execution, limits, timeout |
api/dataset_query/row_filters/ |
Row-level filter framework |
security/governance.py |
Access enforcement entry point |
security/auth.py |
JWT validation |
policies/celine/dataset.rego |
OPA access policy |
db/reflection.py |
Dynamic table introspection |
db/models/dataset_entry.py |
Catalogue ORM model |
api/catalogue/dcat_formatter.py |
DCAT-AP 3.0 serialization |
cli/ |
CLI commands (export/import) |
routes/ |
FastAPI routers (auto-discovered) |