Welcome to the central repository for the Human Microbiome Compendium, an ongoing project to process and integrate hundreds of thousands of human microbiome samples using publicly available data from the members of the International Nucleotide Sequence Database Collaboration. In short, we retrieve sequencing data from the BioProject and Sequence Read Archive databases, process it with a uniform pipeline, and combine it into one large dataset that can be used as a training set for machine-learning models, a conventional dataset for microbiome analysis, or as additional context for samples of your own.

If you have feedback, questions or issues with the HMC dataset or code, you're in the right place:

For bug fixes, feature requests and suggestions, please submit a new Issue.
For troubleshooting help or questions about a particular use case, please submit a new Discussion topic.
For privacy or security concerns, please see our security policy.

Information about the project is spread across several locations:

Our website at microbiomap.org provides information about the project and the most up-to-date links for announcements, releases and publications.
- The compendium_website repository contains the code for microbiomap.org.
This repository contains the code for the compendium management software developed to automate steps for processing and quality control.
The snakemake-compendium repository contains the pipeline code used to process samples. Compendium Manager launches individual instances of this pipeline for each BioProject.
The MicroBioMap repository contains the code for the R package developed to streamline the retrieval and loading of compendium data.

Please note, the application and pipeline code in these repositories represents the most up-to-date processes in use by the compendium for what will be the 2.0 release. Code used for processing all 1.x releases of the compendium is archived on Zenodo.

HMC Compendium Manager

This is the command-line utility being developed for use with the Human Microbiome Compendium, but it may be useful to others looking to deal with bulk genomic data from NCBI. It ingests metadata about BioSamples, pulls together enough information to download the deposited FASTQ files from the Sequence Read Archive, and deploys individual Snakemake pipelines to process each project separately.

This software has not yet had a public 1.0 release, nor has it been tested on anything other than our very specific use case—please keep this in mind when using this tool or submitting feedback.

Structure

This "ingest service" is designed primarily to launch Snakemake pipelines, monitor their progress, and collect their results upon completion. However, there are multiple steps that happen before and after the pipelines are deployed. Broadly, these are the steps that make up the complete workflow:

Search results are exported from the BioSample web portal (see below) into an XML file. This file forms the core of the data used by the ingest service, which parses the XML, extracts the relevant data and saves it to a local SQLite database.
The application uses the NCBI "E-utilities" to retrieve enough information about each BioSample to associate each one with entries in the Sequence Read Archive. (This can take a long time for large collections of samples.)
The application reviews the list of projects and locates one that has not yet been processed. A copy of the Snakemake pipeline is created for this project.
The application submits a batch job to the HPC scheduler that launches Snakemake. This job is Snakemake's "supervisor" process—it runs for the duration of the pipeline, and Snakemake uses it to monitor progress through the pipeline. Snakemake submits its own batch jobs for each sample in each step of the pipeline. The application itself now exits.
The final step of the Snakemake pipeline, if it completes successfully, is to submit a new batch job that runs the application's "forward" command. This command prompts the application to check the status of all currently running projects and update its records.
If the application observes any projects that have completed since it last checked, the application loads summary information about the results and validates that the project's results are acceptable—whether the proportion of chimeric reads is too high, for example, or whether a suspiciously low proportion of forward reads were matched to reverse reads. If paired reads cannot be reliably merged, reverse reads are discarded and the project is reprocessed as a single-end dataset.
If the project passes all quality control checks, the results are parsed out of the pipeline's text file output and loaded into the local database.

Installation

This application requires Python 3 and has been tested with Python 3.9.6. The CLI should work as expected on MacOS and Linux, but the Snakemake pipelines it invokes for processing the raw data will likely only work on Linux. Please see the associated pipeline repository for details.

From the command-line of your choice, run these commands:

git clone git@github.com:blekhmanlab/compendium.git
cd compendium
###### (optional: establish a virtual environment)
python -m venv venv
source venv/bin/activate
######
pip install -r requirements.txt

Configuration

Copy the config_template.py file and name it config.py. All of the options are editable, but the following variables must be set so they can be appended to API requests sent to NCBI:

Tool: The name of the application that's collecting data.
Email: A contact address NCBI administrators can reach.
Key: An NCBI API key, available for free.

There are many other options that can be tweaked in the config file. Comments around each value explain their use.

Commands

The command line interface contains help text for all commands and parameters. All commands are nested under the primary entity being addressed:

project – Commands related to processing or evaluating data for a single BioProject.
compendium – Commands dealing with data spanning multiple projects.

For example:

python main.py project --help
python main.py compendium xml --help

Downloading metadata from NCBI

We currently extract relevant samples from search results on the BioSample website using this query:

txid408170[Organism:noexp] AND ("public"[filter] AND "biosample sra"[filter])

Once the results are displayed, select "Send to" > "File" with the "Full XML (text)" format. This will probably take a while.

Project states

The status table tracks the state of all projects that have been referred to by the CLI in some way. (Projects that had their metadata loaded, but were never processed, are not there.) The possible conditions are:

accession_list_created: A directory has been created for the project and a list of its samples has been added.
running: A job has been submitted to SLURM to be processed
to_re_run: A paired-end project has been flagged for getting re-run as single-end. (This changes to "running" once it's actually been submitted, which happens almost immediately.)
failed: Terminal status. Indicates the project was discarded.
complete: Project is done and had acceptable results that were successfully loaded into the database. Not a terminal status.
archived: The project's results were stored in a tarball.
done: Terminal status. Indicates the project's results were loaded into the database, its files archived, and its other files cleaned up and deleted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!