| Documentation |
This repository is part of the Monarch Initiative's modular data ingestion pipeline for the Monarch Knowledge Graph. This transform is responsible for converting the clingen variant file to the biolink model format using the Koza transform library. The transform is run as a GitHub Action and the results are stored as GitHub artifacts in the release and imported into the Monarch Knowledge Graph at data.monarchinitiative.org.
The current releases create Biolink Model nodes for SequenceVariant and edges for VariantToDiseaseAssociation and VariantToGeneAssociation.
Two files are used in this clingen-ingest transform. Specific file locations and information can be found in the download.yaml file.
- The clingen variant file (clingen_variants.tsv) from the ClinGen site. This file is a tab-separated file with one variant per line the columns for this file are all listed in the transform.yaml file.
- The mapping file for HGNC Gene Symbols to Gene IDs (hgnc_complete_set.txt) is downloaded from EMBL's European Bioinformatics Institute ftp site.
SequenceVariant nodes are only created from ClinGen variants that are deemed Pathogenic or Likely Pathogenic. All variants listed as Likely Benign or Benign are currently discarded. We intend to investigate whether importing this additional information into the KG can be done in an informative way. We believe this subset provides the most credible information without cluttering the graph with information that may be difficult to interpret.
All node variant IDs are assigned their 'ClinVar Variation Id' if available, otherwise their 'Allele Registry Id' is used. Gene IDs are mapped from Gene Symbol lookup using the HGNC Gene Map.
VariantToDiseaseAssociation edges are created for each variant using the variant ID from above and the 'Mondo Disease ID' from the ClinGen variant file. The 'Assertion' column is used to determine the predicate of the edge and these edges are only created for variants that are Pathogenic or Likely Pathogenic.
VariantToGeneAssociation edges are created for each variant using the variant ID from above and the gene ID mapped from the HGNC Gene Map using Gene Symbols. These edges are only created for variants that are Pathogenic or Likely Pathogenic.
- Python >= 3.10
- Poetry
Upon creating a new project from the cookiecutter-monarch-ingest template, you can install and test the project:
cd clingen-ingest
make install
make testThere are a few additional steps to complete before the project is ready for use.
-
Create a new repository on GitHub.
-
Enable GitHub Actions to read and write to the repository (required to deploy the project to GitHub Pages).
- in GitHub, go to Settings -> Action -> General -> Workflow permissions and choose read and write permissions
-
Initialize the local repository and push the code to GitHub. For example:
cd clingen-ingest git init git remote add origin https://github.com/<username>/<repository>.git git add -A && git commit -m "Initial commit" git push -u origin main
- Edit the
download.yaml,transform.py,transform.yaml, andmetadata.yamlfiles to suit your needs.- For more information, see the Koza documentation and kghub-downloader.
- Add any additional dependencies to the
pyproject.tomlfile. - Adjust the contents of the
testsdirectory to test the functionality of your transform.
- Update this
README.mdfile with any additional information about the project. - Add any appropriate documentation to the
docsdirectory.
Note: After the GitHub Actions for deploying documentation runs, the documentation will be automatically deployed to GitHub Pages.
However, you will need to go to the repository settings and set the GitHub Pages source to thegh-pagesbranch, using the/docsdirectory.
This project is set up with several GitHub Actions workflows.
You should not need to modify these workflows unless you want to change the behavior.
The workflows are located in the .github/workflows directory:
test.yaml: Run the pytest suite.create-release.yaml: Create a new release once a week, or manually.deploy-docs.yaml: Deploy the documentation to GitHub Pages (on pushes to main).update-docs.yaml: After a release, update the documentation with node/edge reports.
Once you have completed these steps, you can remove this section from the README.md file.
cd clingen-ingest
make install
# or
poetry installNote that the
make installcommand is just a convenience wrapper aroundpoetry install.
Once installed, you can check that everything is working as expected:
# Run the pytest suite
make test
# Download the data and run the Koza transform
make download
make runThis project is set up with a Makefile for common tasks.
To see available options:
make helpDownload the data for the clingen_ingest transform:
poetry run clingen_ingest downloadTo run the Koza transform for clingen-ingest:
poetry run clingen_ingest transformTo see available options:
poetry run clingen_ingest download --help
# or
poetry run clingen_ingest transform --helpTo run the test suite:
make testThis project was generated using monarch-initiative/cookiecutter-monarch-ingest.
Keep this project up to date using cruft by occasionally running in the project directory:cruft updateFor more information, see the cruft documentation