Our main objective is to specify the minimum information needed to characterise a genomic experiment.
When a researcher downloads a genomic dataset, they typically get CRAM or VCF documents, which are the results of a sequencing experiment. However, these files contain little information on the nature of the experiment itself: are the data from whole genome sequencing, transcriptomics, or another kind of experiment? Are the data for a bulk sequencing or single cell assay? Have techniques been applied to target specific regions of the genome?
Without metadata explaining the context, researchers cannot make sense of results from experiments in genomics, epigenomics, and more. The GA4GH Discovery Work Stream is aiming to produce a minimal checklist of metadata needed to characterise -omics datasets. The Experiments Metadata Standard will provide a dictionary of properties that makes it easier to search for experiments and to understand their results for analysis.
For more information on our group, please visit our GA4GH web page.
While the term “metadata” can be very broad (data that describes data), this Discovery Workstream subgroup exclusively focuses on the properties of the methodology and equipment used in a genomic experiment, and more precisely on library preparation and instrument run. It provides context around the preparation of biological samples into libraries for a given laboratory experiment run, and the execution context for that run. Interoperability with other GA4GH standards will be key to the adoption of the standard.
In the first phase, the group will focus exclusively on genomic sequencing instruments generating reads (high-throughput sequencing experiments, such as WGS, RNA-Seq, and Methyl-Seq). Future specification updates may consider the inclusion of other instruments, quality control metrics and -omics data, such as genotyping arrays, proteomics, and metabolomics, based on the evolving needs within the genomics community. Follow this link to our current working document.
The following topics are therefore considered out of scope (and will remain so): clinical data, biological sample descriptors, downstream data processing, and analysis. The discussions revolve around the content of the checklist, rather than the formats, leaving the latter to the DaMaSC sub-working group.
If you are creating a new resource (dataset / project / platform) and would like to implement this checklist, we suggest having a look at both the "core" and "identifiers" sections, and consider how each property could apply and be inserted in your data model. For any question on specific properties, we can provide help if you open an issue in this GitHub.
Please have a look at our mappings section.
Two documents are being presented for this first version of the checklist:
- Core: This checklist contains properties that are relevant to any sequencing assay.
- Identifiers: This checklist contains identifiers that are relevant to include with a genomic dataset.
- The Mappings section provides a mapping of existing platforms and projects to the GA4GH Experiments Metadata Checklist.
While the current checklist represents the first version of the standard, the GA4GH Experiments Metadata Standard group is actively planning enhancements for future releases. These will include:
- Categories: A key upcoming milestone is to define further properties specific to various genomic sequencing domains, such as Transcriptomics, Single-Cell Sequencing, Methylation, and Targeted Sequencing. Progress in each category will depend on the level of engagement from the respective communities to help shape and validate these specific properties.
- Ontologies: As we suggest ontologies, guided by GA4GH TASC recommendations, we aim to cover the necessary terms to describe each concept, where appropriate. Initial work has focused on instrument-related terms using OBI and GENEPIO, and this effort will continue for other properties.
- Schema: Providing an optional schema that implementers can adopt to support the checklist, without making its use mandatory.
- Involvement with Beacon: Enabling GA4GH Beacon searches on terms covered by the checklist. This will enable, for instance, to query a Beacon node for all RNA-Seq experiments or data that was sequenced using a specific platform.
- Supporting data generators and repositories: We are actively supporting implementations of the standard in wider data ecosytems that help to generate, store and discover genomics data.
- Supporting data processing: We plan on supporting data tools that want to implement the standard.
- Comments received: Many issues exist in this GitHub repository, that have been assigned to upcoming versions.
- This video explains the rationale behind the creation of the Experiments Metadata Checklist, highlighting key use cases and outlining future plans.
- The slides are also available.
- Record of past decisions
- Meetings Agenda and Minutes
- The Progress flowchart outlines the steps taken by the Experiments Metadata group in developing the checklist.