- Production data model
- Editing data models
- Release Process
- Developing in a codespace
- Legacy data models:
AD.model.* (csv | jsonld): this is the current, "live" version of the AD Portal data model. It is being used by both the staging and production versions of the multitenant Data Curator App.
- production DCA: https://dca.app.sagebionetworks.org/
- staging DCA (used by FAIR Data team to test updates to app infrastructure and schematic package): https://dca-staging.app.sagebionetworks.org/
AD.model.csv or AD.model.jsonld by hand!
The main branch of this repo is protected, so you cannot push changes to main. To make changes to the data model:
- Create a new branch in this repo and give it a descriptive name (the name will later be used to generate release notes). The CI/CD workflows will not work from a private fork.
- On that branch, make and commit any changes. You can do this by cloning the repo locally or by using a Github codespace. Please write informative commit messages in case we need to track down data model inconsistencies or introduced bugs.
- Open a pull request and request review from someone else on the AD DCC team. The
schema-convertandtestjobs will run as soon as you open the PR. If this action fails, something about the data model csv could not be converted to a json-ld and should be investigated. - Once all the necessary changes have been made, the
schema-convertandtestjobs complete successfully, and the PR is approved, the PR is ready for merge. - To merge the PR, click on the
labelstab in the GitHub PR conversation and add theautomergelabel. One the label is applied, thebuildworkflow will merge the PR as well as add the appropriate metadata templates and JSONSchema files tomain.
For more information on the automated workflows, review the CI/CD documentation for the build workflow.
The full AD.model.csv file has over 1400 attributes and is unwieldy to edit and hard to review changes for. For ease of editing, the full data model is divided into "module" subfolders, like so:
data-models/
├── AD.model.csv (do not edit!)
├── AD.model.jsonld (do not edit!)
└── modules/
├── biospecimen/
│ ├── specimenID.csv
│ ├── organ.csv
│ └── tissue.csv
└── sequencing/
├── readLength.csv
└── platform.csv
Within each module, every attribute in the data model where Parent = ManifestColumn has its own csv, named after that attribute (example: organ.csv). Any valid values of the attribute "organ" have Parent = organ and are listed as rows in the file organ.csv. Attributes with Parent = ManifestColumn are used as columns in metadata and annotation manifest templates. Attributes with Parent = ManifestTemplate describe the templates themselves. At this time, any other value for Parent means the attribute is a valid value of some other enumerated attribute.
Some common data model editing scenarios are:
- If you wanted to add a new valid value "eyeball" to our existing column attribute "organ", after making a new branch and opening the repo either locally or within a codespace, you would go to
modules/biospecimen/organ.csv. - Then, create a new row for an attribute named "eyeball", with a description and source (preferably an ontology URI). In the
Parentcolumn, make sure the value is "organ". - Next, find the row for the attribute "organ" (should be the first row), and w/in the valid values column, add "eyeball" to the comma-separated list of valid values.
- Save your changes and write an informative commit. Please try to add valid values alphabetically!
- If you wanted to add the column "furColor" to the "model-ad_individual_animal_metadata" template, first decide which module the new column should belong to. In this case, "MODEL-AD" makes the most sense.
- W/in the
MODEL-ADsubfolder, create a new csv calledfurColor.csvwith the required schematic column headers. Describe the attribute "furColor" as necessary and make sureParent=ManifestColumn. Add any valid values for "furColor" as new rows to this csv as described in the previous scenario. - Find the manifest template attributes in
modules/template/templates.csv. In the "model-ad_individual_animal_metadata" row, add your new column "furColor" to the comma-separated list of attributes in theDependsOncolumn. - Save your changes and write an informative commit.
- If you wanted to add a new template to the data model, first add the template to the bottom of the 'template.csv' file, with the column names in the order they will appear. Example: 'assay_spatialTranscriptomics_metadata_template,SysBio spatial transcriptomics metadata template schema,,"Component, individualID, biospecimenID, fileFormat, sequenceAnalysis, runID, captureArea, readIndicator, spatialRead1, spatialRead2",,False,ManifestTemplate,,sysbio.metadataTemplates-assay.spatialTranscriptomics,,,template'
- Next, add the new template name to the bottom of the dca-template-config.json. Example: '"display_name": "assay_spatialTranscriptomics_metadata_template", "schema_name": "AssaySpatialTranscriptomicsMetadataTemplate", "type": "record"'
- Stage, commit, and open a PR with these two changes, requesting review by another AD data manager. Once approved and merged, a GitHub Action workflow will run that uploads the new template to the AD Metadata Dictionary site. Note: the GitHub Actions to join the modules and convert to a json-ld data model, and to generate a test template to review take a few minutes to complete. You will know when they are complete with a green check. The GitHub Action to upload the new template to the AD Metadata Dictionary site takes roughly 1.5 hours to complete.
For more advanced data modeling scenarios like adding conditional logic, creating validation rules, or creating new manifests, please consult the #ad-dcc-team slack channel.
A persistent issue is that manually editing csvs is challening. Some columns in our modules are very short, and others are veeeeery long (Description, Valid Values). Some options for working on csvs, and their pros and cons:
- Editing in the Github UI
: convenient, but challenging to keep track of columns in plain text format. - Cloning the repo, making a branch, and opening csvs locally in Excel or another spreadsheet program 🖥️ : probably the best UI experience, but involves a few extra steps with git.
- Using a Github codespace to launch VSCode in the browser, and editing with the pre-installed RainbowCSV extension 🌈 : Still difficult to edit csvs as plain text, but the color formatting and ability to use a soft word wrap makes it much easier to distinguish columns. RainbowCSV lets you designate "sticky" rows and columns for easier scrolling, and also has a nice "CSVLint" function that will check for formatting errors after you make changes.
We are exploring better solutions to this problem -- if you have ideas, tell us!
To perform a release of the data model and trigger the registration of JSONSchema files with Synapse, perform the following steps
- From the main repository page, click on the
Releasestab on the right. - Select
Draft a new releaseat the top right of the page. - click
Tag: Select Tagand create a new tag that will be the release version. The tag should follow the convention:v<major-version>.<minor-version>.<patch-number>. Also ensure that the version is not one that has been previously used. After the tag is entered, selecteCreate new tag: vx.x.x on Publish. - Ensure that
Target:is set tomain. - Under
Release Titleenter the version number. - Under
Release NotesselectGenerate release notesand review the generated release notes for accuracy. - Once everything is set as appropritate, check
Set as the latest release. - Finally, click
Publish releaseat the bottom of the page.
This will trigger a workflow to register all of the JSONSchema files with the specified organization on Synapse, for more information see the CI/CD documentation for the release workflow.
If you want to make changes to the data model and test them out by generating manifests with schematic, you can use the devcontainer in this repo with a Github Codespace. This will open a container in a remote instance of VSCode and install the latest version of schematic. The devcontainer also installs the Rainbow CSV extension. You can make changes, commit them, and open a PR from the codespace.
Codespace secrets:
- SYNAPSE_PAT: scoped to view and download permissions on the sysbio-dcc-tasks-01 Synapse service account
- SERVICE_ACCOUNT_CREDS: these are creds for using the Google sheets api with schematic
- GENERATE_MANIFESTS_ON_ACTION: this fine-grained PAT lets manifests be successfully generated despite branch protection, has read/write access to actions, and has repo access to adknowledgeportal/data-models
Previous versions of the data model live in the legacy-data-models/ folder. This include the Diverse Cohorts pilot model and the intial "legacy" model representing the AD Portal Synapse project metadata dictionary and metadata templates from August 2023. These are not being used by DCA.
bloop [trivial change to test workflow, DELETE AFTER WORKFLOW TEST 9/27 15:56 EST] [ADDING AN ADDENDUM TO TRIVIAL UPDATE AT 18:00]