A Nextflow workflow for uploading Human Tumor Atlas Network (HTAN) data to the Cancer Research Data Commons (CRDC).
- Overview
- Features
- Repository Structure
- Prerequisites
- Configuration
- Input Data Format
- Running the Workflow
- Outputs
- Error Handling & Logging
- Contact & Contribution
This pipeline facilitates the preparation and upload of HTAN data into the CRDC system. It is implemented using Nextflow, and handles the required formatting, schema validation, metadata, and data transfer steps to ensure compatibility with CRDC submission requirements.
- Validates metadata against a predefined schema.
- Manages a sample sheet (TSV) for tracking inputs.
- Supports configurable parameters via
nextflow.config. - Includes a structured directory layout (e.g.
conf/) for configuration. - Integration options for monitoring (e.g., Tower) via
tower.yaml.
Here’s a breakdown of the key files and directories:
| Path | Description |
|---|---|
main.nf |
The main Nextflow workflow script. |
nextflow.config |
Configuration file for runtime settings, resource specification, containerization, etc. |
conf/ |
Directory for configuration templates or schema files. |
nextflow_schema.json |
JSON schema (or related spec) that metadata must adhere to. |
samplesheet.tsv |
Sample sheet for listing input data files and related metadata fields. |
assets/ |
Supporting scripts, utilities, or static assets used during the upload process. |
tower.yaml |
Spec for launching / integrating with Nextflow Tower (if used). |
To use this workflow, you’ll need:
- Nextflow installed on your system.
- Docker (or another container engine compatible with Nextflow) if using containerized steps.
- Access credentials / permissions for CRDC / HTAN as required.
Configuration is handled via nextflow.config and related files. Some key settings you’ll likely need to customize:
- Paths to input data (e.g. raw HTAN data).
- Output directories.
- Schema validation options (ensuring metadata complies with CRDC expected formats).
- Container / environment settings (Docker images, volumes, memory, CPUs, etc.).
- Optional monitoring / logging (e.g. via Nextflow Tower).
Data should be provided via a samplesheet in TSV format (samplesheet.tsv), which lists all files to be uploaded along with their associated metadata fields. The metadata must conform to the nextflow_schema.json schema.
Ensure that all required metadata fields (as defined by CRDC / HTAN) are present and valid. Missing or malformed metadata will likely cause validation or upload failures.
Here is an example of how to run the pipeline:
nextflow run main.nf \
--samplesheet path/to/samplesheet.tsv \
--outdir path/to/output_directory \
[other parameters as configured]You may also override configuration values:
nextflow run main.nf \
-c path/to/your_config_file \
--samplesheet samplesheet.tsv \
--outdir output/If using Nextflow Tower or similar monitoring tools, use the tower.yaml spec as appropriate.
After a successful run, you should expect:
- Metadata files formatted and validated against CRDC schema.
- Data files (images, sequencing, etc.) organized in the directory structure expected by CRDC / HTAN.
- Logs / reports of validation, transformation, or upload steps.
- Any manifest or submission files needed for CRDC ingestion.
- The workflow logs each step; errors in schema validation or upload are reported in the logs.
- Nextflow’s built-in retry/resume functionality can be used to resume after partial failures.
- Keep an eye on resource usage — memory, CPU, disk — especially when handling large files.
If you encounter issues, have feature requests, or wish to contribute:
- Open an issue in this repository.
- Submit pull requests with clear descriptions of changes and tests where applicable.
- Contact the maintainers for access or credential questions.