Skip to content

Design document to refactor access to JSONSchema serialization of dandi-schema's dandiset.json #2381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 145 additions & 0 deletions doc/design/vendored-schema-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Access to JSONSchema serialization of dandi-schema dandiset.json

## Current situation

This mermaid diagram depicts current overall definition and flow of the metadata schema:

```mermaid
flowchart TD
%% repositories as grouped nodes
subgraph dandi_schema_repo["<a href='https://github.com/dandi/dandi-schema/'>dandi/dandi-schema</a>"]
Pydantic["Pydantic Models"]
end

subgraph schema_repo["<a href='https://github.com/dandi/schema/'>dandi/schema</a>"]
JSONSchema["JSONSchema<br>serializations"]

end

subgraph dandi_cli_repo["<a href='https://github.com/dandi/dandi-cli'>dandi-cli</a>"]
CLI["CLI & Library<br>validation logic<br/>(Python)"]
end

subgraph dandi_archive_repo["<a href='https://github.com/dandi/dandi-archive/'>dandi-archive</a>"]
Meditor["Web UI<br/>Metadata Editor<br/>(meditor; Vue)"]
API["Archive API<br/>(Python; DJANGO)"]
Storage[("DB (Postgresql)")]
end

%% main flow
Pydantic -->|"serialize into<br/>(CI)"| JSONSchema
Pydantic -->|used to validate| CLI
Pydantic -->|used to validate| API

JSONSchema -->|used to produce| Meditor
JSONSchema -->|used to validate| Meditor
Meditor -->|submits metadata| API

CLI -->|used to upload & submit metadata| API

API <-->|metadata JSON| Storage

%% styling
classDef repo fill:#f9f9f9,stroke:#333,stroke-width:1px;
classDef code fill:#e1f5fe,stroke:#0277bd,stroke-width:1px;
classDef ui fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px;
classDef data fill:#fff3e0,stroke:#e65100,stroke-width:1px;
JSONSchema@{ shape: docs }

class dandi_schema_repo,schema_repo,dandi_cli_repo,dandi_archive_repo repo;
class Pydantic,CLI,API code;
class JSONSchema,Storage data;
class Meditor ui;
```

In summary, dandi-archive relies on two *instantiations* of `dandi-schema`:

- **Pydantic**: backend validates metadata using Python library;
- **JSONSchema**: frontend is produced with validation of entry against loaded from JSONSchema serialization definitions.

### Pydantic models: backend

The JSONSchema models are generated from the Pydantic models in the `dandi-schema` repository, and stored in `dandi/schema` repository for every version of `dandi-schema` Pydantic model.
The idea was to be able to validate against specific version of the `dandi-schema` model.
AFAIK it was never realized and `dandi-archive` always uses specific version of the `dandi-schema` model, as prescribed by the `DANDI_SCHEMA_VERSION` constant [in `dandischema.consts`](https://github.com/dandi/dandi-schema/blob/HEAD/dandi-schema/consts.py) with possibility to overload in [dandiapi.settings](https://github.com/dandi/dandi-archive/blob/HEAD/dandiapi/settings.py#L98C1-L101C85).

```python
from dandischema.consts import DANDI_SCHEMA_VERSION as _DANDI_SCHEMA_VERSION

class DandiMixin(ConfigMixin):
...
# This is where the schema version should be set.
# It can optionally be overwritten with the environment variable, but that should only be
# considered a temporary fix.
DANDI_SCHEMA_VERSION = values.Value(default=_DANDI_SCHEMA_VERSION, environ=True)
```

and us hardcoding to use very specific version of `dandi-schema` in the `dandi-archive` repository's [`setup.py`](https://github.com/dandi/dandi-archive/blob/HEAD/setup.py)

```python
# Pin dandischema to exact version to make explicit which schema version is being used
'dandischema==0.11.0', # schema version 0.6.9
```

Then we use `dandischema` library to validate the metadata in the backend (via celery tasks AFAIK) and against both Pydantic and JSONSchema models

```python
❯ git grep -e 'validate(' -e 'import.*validate\>' dandiapi/api/services/
dandiapi/api/services/metadata/__init__.py:from dandischema.metadata import aggregate_assets_summary, validate
dandiapi/api/services/metadata/__init__.py: validate(metadata, schema_key='PublishedAsset', json_validation=True)
dandiapi/api/services/metadata/__init__.py: validate(
dandiapi/api/services/publish/__init__.py:from dandischema.metadata import aggregate_assets_summary, validate
dandiapi/api/services/publish/__init__.py: validate(new_version.metadata, schema_key='PublishedDandiset', json_validation=True)
```

### Web frontend (Vue)

Uses JSONSchema model via vjsf to produce WebUI.
Unclear though if we are up-to-date since

```python
❯ head -n4 web/src/types/schema.ts
/**
* This file was automatically generated by json-schema-to-typescript.
* DO NOT MODIFY IT BY HAND. All changes should be made through the "yarn migrate" command.
* TypeScript typings for dandiset metadata are based on schema v0.6.2 (https://raw.githubusercontent.com/dandi/schema/master/releases/0.6.2/dandiset.json)
```

although we already use v0.6.9 of dandischema.

Web frontend then takes the `schema_url` from `/info` URL (thus JSONSchema serialization from `dandi/schema` repository) and loads it to `.schema` in [dandiset.ts](https://github.com/dandi/dandi-archive/blob/master/web/src/stores/dandiset.ts#L109).

### Vendorization

Initial motivation for this PR/design document was de-re-vendorization of DANDI instances with initial changes in the [dandi-schema:PR#294](https://github.com/dandi/dandi-schema/pull/294).
See [dandi-archive:issue#2382](https://github.com/dandi/dandi-archive/issues/2382) for more information on de-re-vendorization of dandi-archive.


### Summary

We
- do neither support nor use multiple versions of the schema in dandi-archive
- use two instantiations of the schema and rely on external process to generate JSONSchema from Pydantic models
- manually trigger update of web frontend files according to some version of the schema
- hardcoded some vendorization inside the dandi-archive codebase (backend and frontend)
- **any vendorization done via Configuration at runtime for Pydantic, is not reflected in the JSONSchema serialization used by the web frontend since loaded from a generic serialization!**


## Proposed solution

### Summary

The overall idea is to remove use/reliance on https://github.com/dandi/schema/ JSONSchema serializations and to perform serialization to be used by the frontend, by directly serializing needed JSONSchema at the backend startup time, thus accounting for possible vendorization, or upon a request of an API (public or not) call.

### Details

- create API endpoint `/api/schema/({version}|latest)/dandiset/`, with currently no `{version}` support for non-matching version(but later could expose)
- ideally should in the future allow for alternative content type requests, defaulting e.g. to [application/schema+json](https://json-schema.org/draft/2020-12/json-schema-core#section-14) providing JSONSchema serialization
- `schema_url` in `/info` should point to that instance's `/api/schema/{version}/dandiset/` URL , thus web frontend would load that schema from the backend/API instead of relying on the static JSONSchema serialization in a separate `dandi/schema` repository.

## Additional considerations

### s3://dandiarchive/schema/

As all of our data is on `dandiarchive` s3 bucket, I wonder if while refactoring our use of JSONSchema serializations, we should implement mirroring of `dandi/schema` repository under `s3://dandiarchive/schema/` and point to `context.json` from there instead of GitHub.
Mirroring could be implemented via `git annex exporttree` from the `dandi/schema` repository.
Loading