As of 2024-07-17 this repo contains both the production data model used by the ELITE portal to submit and validate metadata through the Data Curator App; and the data dictionary website which is based on the data model and provides definitions for all metadata templates and terms used in the data model.
There is a separate data-dictionary repo which contains the same source code, and which can later be used to deploy the website when we are able to set up automation in that repository which successfully monitors this repository for changes. To simplify the process, for now we will use this data-models repo to manage both the data model and the dictionary.
- EL Metadata Dictionary Site
- EL Data Model
EL Metadata Dictionary is a Jekyll site utilizing Just the Docs theme and is published on GitHub Pages.
index.mdis the home page_config.ymlcan be used to tweak Jekyll settings, such as theme, title_layout/contains html templates we use to generate the web pages for each data model term_data/folder stores data for Jekyll to use when generating the site- files in
docs/will be accessed by GitHub Actions workflow to build the site - two scripts in
processes/can be run to generate updated files in_data/anddocs/to publish changes in the data model to the dictionary site .envcontains the link to the data model that the dictionary site is based onGemfileis package dependencies for buildling the websitepyproject.tomlandpoetry.locklist the python and package dependencies for the scripts that update both the data model and the data dictionary site- You can add additional descriptions to home page or specific page by directly editing
index.mdor markdown files indocs/.
Interim process to update the metadata dictionary site after changes have been made to the data model:
:note: Note: do this in a SEPARATE PR after changes to the data model are merged to main. The scripts to do this reference the data model at the url in processess/.env, which is the main branch of this repo. It's not the most elegant right now but keeping the data model updates and the dictionary site updates as separate steps will make rolling back errors easier while we shore up this process.
-
Start your codespace or build a new one. The codespace should build with a container image that includes the package manager
poetry. You don't need to install poetry. It should also run the commandpoetry installafter you launch it, which will tell poetry to install all the python libraries that are specified by this project (this will include schematic). -
Make a new branch.
-
From the top-level data-models directory, run
poetry run python processes/data_manager.py. This should update some files within_data/ -
Then run
poetry run python processes/page_manager.py. This should update files withindocs/. -
Optional: you can run
poetry run python processes/create_network_graph.pyto create the schema visualization network graph. This is out of date and relatively unused, but it will be good to update and make more robust later. -
Commit changes to your branch and open a PR. After review is passed and the changes are merged to main, a Github action will run via the
pages.ymlworkflow to build and deploy the site to https://eliteportal.github.io/data-models/
- Make sure you have the
poetrydependency manager installed in your workspace.
Follow steps 2-5 from the section above
-
Optional: Preview the website locally by running
bundle exec jekyll serve. -
Commit changes to your branch and open a PR. After review is passed and the changes are merged to main, a Github action will run via the
pages.ymlworkflow to build and deploy the site to https://eliteportal.github.io/data-models/
- Install Jekyll
gem install bundler jekyll - Install Bundler
bundle install - Run
bundle exec jekyll serveto build your site and preview it athttp://localhost:4000. The built site is stored in the directory_site.
EL.data.model.* (csv | jsonld): this is the current, "live" version of the EL Portal data model. It is being used by both the staging and production versions of the multitenant Data Curator App.
π§ The Github action that automatically compiles the module csvs and converts to a json-ld is not working as expected as of June 2024. For now, please use the following procedure to manually compile attribute modules and convert the updated data model to a json-ld:
The main branch of this repo is protected, so you cannot push changes to main. To make changes to the data model:
- Start your codespace or build a new one. The codespace should build with a container image that includes the package manager
poetry. You don't need to install poetry. It should also run the commandpoetry installafter you launch it, which will tell poetry to install all the python libraries that are specified by this project (this will include schematic). - Make a new branch. On that branch, make and commit any changes. Please write informative commit messages in case we need to track down data model inconsistencies or introduced bugs.
- Still in the top-level directory, run
poetry run python data_model_creation/join_data_model.pyfrom the terminal. This will run a python script that joins all the module csvs, does a few data frame quality checks, and usesschematic schema convertto create the updated json-ld data model. - If the script succeeds, double check the version control history of your json-ld data model and make sure the changes you expected have been made! Save and commit all changes, then push your local branch to the remote.
- Open a pull request and request review from someone else on the EL DCC team. The Github Action that runs when you open a PR will currently fail -- you can ignore this. EL DCC team will perform manual checks before merging changes.
- After the PR is merged, delete your branch.
- Start your codespace or build a new one. The codespace should build with a container image that includes the package manager
poetry. You don't need to install poetry. It should also run the commandpoetry installafter you launch it, which will tell poetry to install all the python libraries that are specified by this project (this will include schematic).
Follow steps 2-4 above
- [Optional]: to generate a test manifest, run
poetry run schematic manifest -c path/to/config.yml get -dt RelevantDataType -sfrom the terminal. This will generate a json schema, a manifest csv, and a link to a google sheet version of the manifest. DO NOT put any real data in the google sheet manifest! This is just an integration test to see if the manifest columns and drop downs look as expected. Don't commit the json schema and the manifest csv generated during this step to your branch -- these are ephemeral and should be deleted. - Open a pull request and request review from someone else on the EL DCC team. The Github Action that runs when you open a PR will currently fail -- you can ignore this. EL DCC team will perform manual checks before merging changes.
- After the PR is merged, delete your branch.
The full EL.data.model.csv file has over 200 attributes and is unwieldy to edit and hard to review changes for. For ease of editing, the full data model is divided into "module" subfolders, like so:
data-models/
βββ EL.data.model.csv (do not edit!)
βββ EL.data.model.jsonld (do not edit!)
βββ modules/
βββ biospecimen/
β βββ specimenID.csv
β βββ organ.csv
β βββ tissue.csv
βββ sequencing/
βββ readLength.csv
βββ platform.csv
Within each module, every attribute in the data model has its own csv, named after that attribute (example: organ.csv).
Some common data model editing scenarios are:
- If you wanted to add a new valid value "eyeball" to our existing column attribute "organ", after making a new branch and opening the repo either locally or within a codespace, you would go to
modules/biospecimen/organ.csv. - Next, find the row for the attribute "organ" (should be the first row), and w/in the valid values column, add "eyeball" to the comma-separated list of valid values.
- Save your changes and write an informative commit. Please try to add valid values alphabetically!
- If you wanted to add the column "furColor" to the "model-ad_individual_animal_metadata" template, first decide which module the new column should belong to. In this case, "MODEL-AD" makes the most sense.
- W/in the
MODEL-ADsubfolder, create a new csv calledfurColor.csvwith the required schematic column headers. Describe the attribute "furColor" as necessary and make sureParent=ManifestColumn. Add any valid values for "furColor" as new rows to this csv as described in the previous scenario. - Find the manifest template attributes in
modules/template/templates.csv. In the "model-ad_individual_animal_metadata" row, add your new column "furColor" to the comma-separated list of attributes in theDependsOncolumn. - Save your changes and write an informative commit.
For more advanced data modeling scenarios like adding conditional logic, creating validation rules, or creating new manifests, please consult the #ad-dcc-team slack channel.
A persistent issue is that manually editing csvs is challening. Some columns in our modules are very short, and others are veeeeery long (Description, Valid Values). Some options for working on csvs, and their pros and cons:
- Editing in the Github UI
: convenient, but challenging to keep track of columns in plain text format. - Cloning the repo, making a branch, and opening csvs locally in Excel or another spreadsheet program π₯οΈ : probably the best UI experience, but involves a few extra steps with git.
- Using a Github codespace to launch VSCode in the browser, and editing with the pre-installed RainbowCSV extension π : Still difficult to edit csvs as plain text, but the color formatting and ability to use a soft word wrap makes it much easier to distinguish columns. RainbowCSV lets you designate "sticky" rows and columns for easier scrolling, and also has a nice "CSVLint" function that will check for formatting errors after you make changes.
We are exploring better solutions to this problem -- if you have ideas, tell us!
βstatus unknown
Use scraping_valid_values.py to pull in values from EBI OLS sources.
π§ currently broken! Also the documentation here is from the AD data model, I don't think it's accurate for the EL repo.
When you open a PR that includes any changes to files in the modules/ directory, a Github Action will automatically run before merging is allowed. This action:
- Runs the
assemble_csv_data_model.pyscript to concatenate the modular attribute csvs into one data frame, sort alphabetically byParentand thenAttribute, and write the combined dataframe toEL.data.model.csv. The action then commits the changes to the master data model csv. - Installs
schematicfrom the develop branch and runsschema converton the newly-concatenated data model csv to generate a new version of the jsonld fileEL.data.model.jsonld. The action also commits the changes to the jsonld.
If this automated workflow fails, then the data model may be invalid and further investigation is needed.
π§ currently broken!
- Runs
update-data-dictionary.yamlin order to reflect information found in the dictionary for contributors
β status unknown
Recreates DCA template config
β status unkown
- Run the github action
create-template-config.ymlwhen adding new templates
If you want to make changes to the data model and test them out by generating manifests with schematic, you can use the devcontainer in this repo with a Github Codespace. This will open a container in a remote instance of VSCode and install the latest version of schematic.
Codespace secrets:
- SYNAPSE_PAT: scoped to view and download permissions on the sysbio-dcc-tasks-01 Synapse service account
- SERVICE_ACCOUNT_CREDS: these are creds for using the Google sheets api with schematic
Schematic API Visualization Repository
- Creates a network graph of the data model. Aim is to help see connections between components.
Software packages installed
- Poetry - See installation guide here
-
EL.data.model.csv: The CSV representation of the example data model. This file is created by the collective effort of data curators and annotators from a community (e.g. ELITE), and will be used to create a JSON-LD representation of the data model. -
EL.data.model.jsonld: The JSON-LD representation of the example data model, which is automatically created from the CSV data model using the schematic CLI. More details on how to convert the CSV data model to the JSON-LD data model can be found here. This is the central schema (data model) which will be used to power the generation of metadata manifest templates for various data types (e.g.,scRNA-seq Level 1) from the schema. -
config.yml: The schematic-compatible configuration file, which allows users to specify values for application-specific keys (e.g., path to Synapse configuration file) and project-specific keys (e.g., Synapse fileview for community project). A description of what the various keys in this file represent can be found in the Fill in Configuration File(s) section of the schematic docs.
After cloning the repository, run the following command:
poetry install
./change-log.md