Skip to content

adknowledgeportal/data-models

Repository files navigation

The AD Knowledge Portal data model

Production data model

AD.model.* (csv | jsonld): this is the current, "live" version of the AD Portal data model. It is being used by both the staging and production versions of the multitenant Data Curator App.

Editing data models

⚠️ Do not edit AD.model.csv or AD.model.jsonld by hand! ⚠️

Github branch procedure:

The main branch of this repo is protected, so you cannot push changes to main. To make changes to the data model:

  1. Create a new branch in this repo and give it a descriptive name (the name will later be used to generate release notes). The CI/CD workflows will not work from a private fork.
  2. On that branch, make and commit any changes. You can do this by cloning the repo locally or by using a Github codespace. Please write informative commit messages in case we need to track down data model inconsistencies or introduced bugs.
  3. Open a pull request and request review from someone else on the AD DCC team. The schema-convert and test jobs will run as soon as you open the PR. If this action fails, something about the data model csv could not be converted to a json-ld and should be investigated.
  4. Once all the necessary changes have been made, the schema-convert and test jobs complete successfully, and the PR is approved, the PR is ready for merge.
  5. To merge the PR, click on the labels tab in the GitHub PR conversation and add the automerge label. One the label is applied, the build workflow will merge the PR as well as add the appropriate metadata templates and JSONSchema files to main.

For more information on the automated workflows, review the CI/CD documentation for the build workflow.

Editing attributes by module:

The full AD.model.csv file has over 1400 attributes and is unwieldy to edit and hard to review changes for. For ease of editing, the full data model is divided into "module" subfolders, like so:

data-models/
├── AD.model.csv (do not edit!)
├── AD.model.jsonld (do not edit!)
└── modules/
    ├── biospecimen/
    │   ├── specimenID.csv
    │   ├── organ.csv
    │   └── tissue.csv
    └── sequencing/
        ├── readLength.csv
        └── platform.csv

Within each module, every attribute in the data model where Parent = ManifestColumn has its own csv, named after that attribute (example: organ.csv). Any valid values of the attribute "organ" have Parent = organ and are listed as rows in the file organ.csv. Attributes with Parent = ManifestColumn are used as columns in metadata and annotation manifest templates. Attributes with Parent = ManifestTemplate describe the templates themselves. At this time, any other value for Parent means the attribute is a valid value of some other enumerated attribute.

Some common data model editing scenarios are:

Adding a new valid value to an existing manifest column:

  1. If you wanted to add a new valid value "eyeball" to our existing column attribute "organ", after making a new branch and opening the repo either locally or within a codespace, you would go to modules/biospecimen/organ.csv.
  2. Then, create a new row for an attribute named "eyeball", with a description and source (preferably an ontology URI). In the Parent column, make sure the value is "organ".
  3. Next, find the row for the attribute "organ" (should be the first row), and w/in the valid values column, add "eyeball" to the comma-separated list of valid values.
  4. Save your changes and write an informative commit. Please try to add valid values alphabetically!

Adding a new column to a manifest template:

  1. If you wanted to add the column "furColor" to the "model-ad_individual_animal_metadata" template, first decide which module the new column should belong to. In this case, "MODEL-AD" makes the most sense.
  2. W/in the MODEL-AD subfolder, create a new csv called furColor.csv with the required schematic column headers. Describe the attribute "furColor" as necessary and make sure Parent = ManifestColumn. Add any valid values for "furColor" as new rows to this csv as described in the previous scenario.
  3. Find the manifest template attributes in modules/template/templates.csv. In the "model-ad_individual_animal_metadata" row, add your new column "furColor" to the comma-separated list of attributes in the DependsOn column.
  4. Save your changes and write an informative commit.

Adding a new template to the data model:

  1. If you wanted to add a new template to the data model, first add the template to the bottom of the 'template.csv' file, with the column names in the order they will appear. Example: 'assay_spatialTranscriptomics_metadata_template,SysBio spatial transcriptomics metadata template schema,,"Component, individualID, biospecimenID, fileFormat, sequenceAnalysis, runID, captureArea, readIndicator, spatialRead1, spatialRead2",,False,ManifestTemplate,,sysbio.metadataTemplates-assay.spatialTranscriptomics,,,template'
  2. Next, add the new template name to the bottom of the dca-template-config.json. Example: '"display_name": "assay_spatialTranscriptomics_metadata_template", "schema_name": "AssaySpatialTranscriptomicsMetadataTemplate", "type": "record"'
  3. Stage, commit, and open a PR with these two changes, requesting review by another AD data manager. Once approved and merged, a GitHub Action workflow will run that uploads the new template to the AD Metadata Dictionary site. Note: the GitHub Actions to join the modules and convert to a json-ld data model, and to generate a test template to review take a few minutes to complete. You will know when they are complete with a green check. The GitHub Action to upload the new template to the AD Metadata Dictionary site takes roughly 1.5 hours to complete.

For more advanced data modeling scenarios like adding conditional logic, creating validation rules, or creating new manifests, please consult the #ad-dcc-team slack channel.

Notes on collaboratively editing csvs

A persistent issue is that manually editing csvs is challening. Some columns in our modules are very short, and others are veeeeery long (Description, Valid Values). Some options for working on csvs, and their pros and cons:

  • Editing in the Github UI :octocat: : convenient, but challenging to keep track of columns in plain text format.
  • Cloning the repo, making a branch, and opening csvs locally in Excel or another spreadsheet program 🖥️ : probably the best UI experience, but involves a few extra steps with git.
  • Using a Github codespace to launch VSCode in the browser, and editing with the pre-installed RainbowCSV extension 🌈 : Still difficult to edit csvs as plain text, but the color formatting and ability to use a soft word wrap makes it much easier to distinguish columns. RainbowCSV lets you designate "sticky" rows and columns for easier scrolling, and also has a nice "CSVLint" function that will check for formatting errors after you make changes.

We are exploring better solutions to this problem -- if you have ideas, tell us!

Release Process

To perform a release of the data model and trigger the registration of JSONSchema files with Synapse, perform the following steps

  1. From the main repository page, click on the Releases tab on the right.
  2. Select Draft a new release at the top right of the page.
  3. click Tag: Select Tag and create a new tag that will be the release version. The tag should follow the convention: v<major-version>.<minor-version>.<patch-number>. Also ensure that the version is not one that has been previously used. After the tag is entered, selecte Create new tag: vx.x.x on Publish.
  4. Ensure that Target: is set to main.
  5. Under Release Title enter the version number.
  6. Under Release Notes select Generate release notes and review the generated release notes for accuracy.
  7. Once everything is set as appropritate, check Set as the latest release.
  8. Finally, click Publish release at the bottom of the page.

This will trigger a workflow to register all of the JSONSchema files with the specified organization on Synapse, for more information see the CI/CD documentation for the release workflow.

Developing in a codespace

⚠️ If you are working in a Github Codespace, do NOT commit any Synapse credentials to the repository and do NOT use any real human data when testing data model function. This is not a secure environment!

If you want to make changes to the data model and test them out by generating manifests with schematic, you can use the devcontainer in this repo with a Github Codespace. This will open a container in a remote instance of VSCode and install the latest version of schematic. The devcontainer also installs the Rainbow CSV extension. You can make changes, commit them, and open a PR from the codespace.

Codespace secrets:

  • SYNAPSE_PAT: scoped to view and download permissions on the sysbio-dcc-tasks-01 Synapse service account
  • SERVICE_ACCOUNT_CREDS: these are creds for using the Google sheets api with schematic
  • GENERATE_MANIFESTS_ON_ACTION: this fine-grained PAT lets manifests be successfully generated despite branch protection, has read/write access to actions, and has repo access to adknowledgeportal/data-models

Legacy data models:

Previous versions of the data model live in the legacy-data-models/ folder. This include the Diverse Cohorts pilot model and the intial "legacy" model representing the AD Portal Synapse project metadata dictionary and metadata templates from August 2023. These are not being used by DCA.

bloop [trivial change to test workflow, DELETE AFTER WORKFLOW TEST 9/27 15:56 EST] [ADDING AN ADDENDUM TO TRIVIAL UPDATE AT 18:00]

About

AD Portal data model

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 9