Copyright © 2023-2025 SDSC - Swiss Data Science Center.
Licensed under the MIT License - see the LICENSE file for details.
Funded by SPHN and PHRT.
This repository includes "digital infrastructure" code from a Swiss National Data Stream (NDS): LUCID. The general goal of the NDS initiative is to collect clinical data across five Swiss University Hospitals and share it with researchers. In the case of LUCID, research focuses on low-value care: services that provide little or no benefit to patients. If you're interested, check the official project page.
Digital infrastructure for LUCID project is also available in these repositories.
This repository provides an automated pipeline to perform (RDF) data validation with two flows:
- Success: upon successful validation, data is provided in the output folder
- Failure: upon unsuccessful validation, a report is generated in the notification folder
The code was first built around the BioMedIT environment, but to allow re-usability, most software and tools rely on public containers, meaning that there are few requirements to test it on any machine (see Requirements section).
The sections below provide more technical details about the pipeline, its implementation and use.
For any question, feel free to open an issue or contact us directly.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses podman containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
By default, the workflow assumes that:
Nextflowis installed (>=22.10.1)Podmanis installed for full pipeline reproducibility (tested on version4.9.4)- Basic UNIX libraries are installed:
gzip,cat,unzip,md5sumandfdfind
With the biomedit profile, in addition to the points above, the workflow assumes that:
- sett-rs is installed with the command-line interface available (
sett-cli, tested on version5.3.0) - A Nextflow secret
SETT_OPENPGP_KEY_PWDis set to provide the secret OpenPGP key used to decrypt data jqis installed and available (>=1.6)
See usage instructions for more information.
- Check for new
zipfiles or rerun worfklow on currentzipfiles in source directory - Decrypt and decompress files to extract datasets
- With the
biomeditprofile, metadata is extracted and used to rename the datasets' directory
- With the
- Create batches of data with sizes defined in
nextflow.config - One by one, each data batch is bundled with external terminologies and validated using SPHN SHACL rules:
a. For invalid or empty batches: create a file with datasets and their error in a notification folder
b. For valid batches: continue to steps 5-7 - Convert valid datasets to
ntformat - Compress all
ntfiles intogzip - Move
gzipto output directory
flowchart TD
input_dir(input_dir)
sett_unpack[sett_unpack]
patient_data(patient_data)
bundled_batches(bundled_batches)
val[validation]
report(report.ttl)
exit(exitStatus)
copy[copy_to_output]
compress[compress]
output_dir[output_dir]
notification[send_notification]
input_dir -->|ch_sett_pkg| check_integrity
check_integrity --> get_sett_metadata
check_integrity --> unpack
unpack --> patient_data
subgraph biomedit
get_sett_metadata --> sett_unpack
end
subgraph config
input_dir
output_dir
SPHN_SHACL_shapes
SPHN_schema
terminologies
end
output_dir --> sett_unpack
output_dir --> unpack
sett_unpack --> patient_data
patient_data --> |batching| bundled_batches
SPHN_SHACL_shapes --> val
terminologies --> |nt converter| enriched_terms
SPHN_schema --> |nt_converter| enriched_terms
enriched_terms --> bundled_batches
val --> report
val --> exit
exit --> |if !=0| notification
exit --> |else| nt_converter
nt_converter --> compress
compress --> copy
bundled_batches --> |if empty| notification
bundled_batches --> val
See usage docs for all of the available options when running the pipeline.
-
Download the pipeline and test it on a minimal dataset with a single command:
nextflow run main.nf -profile standard,test
Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (
testin the example command above). You can chain multiple config profiles in a comma-separated string. -
Start running your own analysis!
nextflow run main.nf -profile standard,test --input_dir /data/source --output_dir /data/target --notification_dir /data/notification --shapes /data/shapes
To use ingestion within the BioMedIT system, we advise pointing the Nextflow working directory to a folder on a separate partition with sufficient volume and appropriate permissions.
For the pipeline constantly monitoring for incoming data:
nextflow run main.nf -profile biomedit -w /data/work/
For the pipeline to re-run on already landed data:
nextflow run main.nf -profile biomedit -w /data/work/ --rerun=true
nds-lucid/ingestion was originally written by Stefan Milosavljevic and Cyril Matthey-Doret.
Cite this work by getting citation information from the GitHub menu on the right or the Zenodo DOI record, or like below (APA style citation):
Milosavljevic, S., Matthey-Doret, C., & Riba Grognuz, O. (2025). LUCID BioMedIT Ingestion Pipeline. Zenodo. https://doi.org/10.5281/zenodo.14726408
This pipeline uses code developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
