Molecular Oncology Almanac db 2.0.0 (draft)

We are in the process of refactoring and updating moalmanac db to align with GA4GH's Variant Annotation Specification (va-spec) and Categorical Variant Representation Specification (Cat-VRS). Both of these specifications are in development and are following the GA4GH Genomic Knowledge Standards (GKS) Maturity Model. As components of each specification moves from draft to trial and to normative maturity we will update our schema to align with their recommendations. At the moment, this version of our data schema does not comply with either format.

This version of the database is under active development and, if you have any thoughts, comments, concerns, or suggestions, please contact us!

Here, we'll start preliminary documentation of our interpretation of both of these specifications and how we are implementing them.

Why are we making these changes?
Using a relational schema
Our (in progress) interpretation of va-spec

Why are we making these changes?

Most importantly, we are making this change because our current schema is something that we "just made up" throughout our original development. There is now an increasing emphasis within the field on interoperability and data standards, and we want moalmanac to both communicate with other services as well as possible while providing the most value to our users. Representing our database content within a widely used specification will increase the utility of our service.

Pragmatically, there is also technical debt associated with the current format. While we use a flat JSON schema, this is converted into a SQLite table for use with the moalmanac-browser. The representation of genomic information is particularly troublesome within this format, with nested tables to store attribute definitions and attributes of each biomarker type. Code to generate the browser's sqlite table easily results in ids of assertions, sources, or features changing between the database content releases. Over the years this has caused some hiccups with adoption by some users. To complicate matters further, we store database metadata in the version of the database used by the algorithm and as a result there are three slightly different versions of our database published: this repository, the one used by our browser and accessible through the API, and by the algorithm. We would really like to simplify this but do not want to cause further hiccups for our users. It has also made expanding our API endpoints difficult.

About a year ago in January 2024, we began curating knowledge for European precision oncology approvals (more on this soon!) in the format present that GA4GH's genomic knowledge pilot used for moalmanac. Afterwards, we went back and re-curated FDA approvals from scratch, additionally curating indications involving biomarkers that are of type protein expression, wild type, mismatch repair, and homologous recombination.

Using a relational schema

We are using a relational schema that can be dereferenced to a single JSON file using utils/dereference.py. The genomic knowledge pilot separated datasources into referenced and dereferenced sources, and so we are following their recommendations for this. We can thus have each element of the specification in its own referenced json file and these contents can be mirrored into the SQLite database that will be used by the API, or other database type chosen. There are two other additional benefits that we've noticed: testing the database content is much easier because each element can be independently evaluated and curation is much faster by being able to reference the appropriate record within a data type, instead of typing or copying data. In short using a relational schema better follows Don't repeat yourself (DRY) principles.

Our (in progress) interpretation of va-spec

VA-Spec supports a wide array of proposition types but at the moment we are only utilizing Variant Therapeutic Response Study Proposition. Our current draft schema does not follow va-spec and we are continuing to work to align our specification to their framework. Here, we'll go through each json file within referenced/ and describe each attribute. Two common data types from gks-core that are used by several data types are extensions and mappings.

Extensions are a way to capture information that are not directly supported by their data model. They will always have the fields name, value, and description. For example, our model for diseases has an extension that specifies if the cancer type is categorized as a solid tumor or not.

"extensions": [
  {
    "name": "solid_tumor",
	"value": true,
	"description": "Boolean value for if this tumor type is categorized as a solid tumor."
  }
]

Mappings are representations of a concept in other systems and, in this case, means representations of a concept outside of moalmanac. They are made up of a coding and relation statement. GKS core currently allows relation to be populated with broadMatch, closeMatch, exactMatch, narrowMatch, relatedMatch and at the moment moalmanac only uses either exactMatch or relatedMatch. For example, therapies are mapped to the NCI Enterprise Vocabulary Services:

"mappings": [  
  {    
   "coding": {  
      "id": "ncit:C411",  
      "code": "C411",  
      "name": "Dacarbazine",  
      "system": "https://evsexplore.semantics.cancer.gov/evsexplore/concept/ncit/",  
      "systemVersion": "25.01d"  
    },  
    "relation": "exactMatch"  
  }  
]

Specific extensions and mappings will be explained within their relevant data type.

We also want to give a special thank you to Daniel Puthawala and Kori Kuzama from the Wagner lab for their help and patience as we've badgered them with questions to understand the GKS ecosystem. Their expertise and the Wagner Lab's normalizers are excellent.

Uh oh!

FilesExpand file tree

referenced-schema-draft-about.md

Latest commit

History

referenced-schema-draft-about.md

File metadata and controls

Molecular Oncology Almanac db 2.0.0 (draft)

Table of contents

Why are we making these changes?

Using a relational schema

Our (in progress) interpretation of va-spec

Copy number (arm level)

Homologous recombination

Mismatch repair

Protein expression

Wild type