Skip to content

Conversation

@adamjtaylor
Copy link

This pull request adds a new MultiQC reporting step to the DCQC workflow, enabling automated generation of validation reports from DCQC suite results. The update introduces a custom MultiQC plugin (multiqc_dcqc) for parsing and visualizing validation results, integrates its execution into the workflow, and provides configuration and documentation for usage and development.

MultiQC Integration and Workflow Updates

  • Added a new MULTIQC process to the workflow, which runs MultiQC using a custom plugin and configuration, and stages its output in a dedicated directory. (modules/local/multiqc.nf, conf/modules.config, workflows/dcqc.nf)

MultiQC DCQC Plugin Implementation

  • Introduced the multiqc_dcqc plugin, including its initialization, configuration, and entry point for MultiQC, to parse and display DCQC validation results in the report. (multiqc/multiqc_dcqc/__init__.py, multiqc/setup.py)
  • Added a sample MultiQC configuration file for the plugin to ensure correct module loading and search patterns for validation result files. (multiqc/multiqc_config.yaml)

Documentation and Development Support

  • Provided a comprehensive README for the MultiQC DCQC plugin, detailing features, installation, usage, expected input format, configuration, and output sections. (multiqc/multiqc_dcqc/README.md)
  • Added a .gitignore for the plugin directory to exclude Python build artifacts and development files. (multiqc/.gitignore)

@adamjtaylor adamjtaylor requested a review from a team as a code owner November 12, 2025 12:09
num_files = 0

for f in self.find_log_files("dcqc_validation", filehandles=True):
log.debug(f"Processing file: {f['fn']}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good candidate for an info level log message so it's clear to report back where in the process it currently is

Comment on lines +129 to +132
log.warning(f"Could not parse JSON from {f['fn']}: {e}")
continue
except Exception as e:
log.warning(f"Error processing {f['fn']}: {e}")
Copy link

@BryanFauble BryanFauble Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than stringifying the exception into the log message I would recommend leaving the exception out, but setting exc_info to True.

https://docs.python.org/3/library/logging.html#logging.Logger.debug

If exc_info does not evaluate as false, it causes exception information to be added to the logging message. If an exception tuple (in the format returned by sys.exc_info()) or an exception instance is provided, it is used; otherwise, sys.exc_info() is called to get the exception information.

for suite_type in sorted(self.suite_types.keys()):
self.add_suite_type_sections(suite_type)

def parse_suites_json(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Adding return typing to all of the method calls makes it a bit easier to understand what's going on when someone looks at the code later on. In this case I think it's just an int return type.

There is also quite a bit of nuance added as in-line comments throughout the code that describe what is going on - That is better served in the method docstring so it's all in one place.

@BryanFauble BryanFauble requested a review from a team November 12, 2025 17:43
Setup script for multiqc_dcqc MultiQC plugin
"""

from setuptools import setup, find_packages

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use pyproject.toml instead of setup.py? The latter is now deprecated in favor of the modern PEP 517/518 standards.

Claude gave me this as the replacement file:

[build-system]
requires = ["setuptools>=61.0", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "multiqc_dcqc"
version = "0.1.0"
description = "MultiQC plugin for DCQC validation results"
readme = {text = """
A MultiQC plugin that parses and visualizes validation results
from the DCQC (Data Quality Control) suite. It displays validation status,
test results, and detailed error messages for data files validated against
various schema requirements (including HTAN and CELLxGENE).
""", content-type = "text/plain"}
authors = [
    {name = "HTAN Data Coordinating Center"}
]
license = {text = "MIT"}
requires-python = ">=3.8"
dependencies = [
    "multiqc>=1.14",
]
classifiers = [
    "Development Status :: 4 - Beta",
    "Environment :: Console",
    "Environment :: Web Environment",
    "Intended Audience :: Science/Research",
    "License :: OSI Approved :: MIT License",
    "Natural Language :: English",
    "Operating System :: OS Independent",
    "Programming Language :: Python",
    "Programming Language :: Python :: 3",
    "Programming Language :: Python :: 3.8",
    "Programming Language :: Python :: 3.9",
    "Programming Language :: Python :: 3.10",
    "Programming Language :: Python :: 3.11",
    "Topic :: Scientific/Engineering",
    "Topic :: Scientific/Engineering :: Bio-Informatics",
]

[project.urls]
Homepage = "https://github.com/ncihtan/dcqc"
Repository = "https://github.com/ncihtan/dcqc"

[project.entry-points."multiqc.modules.v1"]
dcqc_validation = "multiqc_dcqc.dcqc_validation:MultiqcModule"

[tool.setuptools]
include-package-data = true
zip-safe = false

[tool.setuptools.packages.find]
where = ["."]
include = ["multiqc_dcqc*"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/ncihtan/dcqc doesn't seem to exist, and should swap over to https://github.com/Sage-Bionetworks-Workflows/nf-dcqc (I assume)

Comment on lines +30 to +33
# Install the custom MultiQC plugin first
cd ${plugin_dir}
pip install --user --no-deps .
cd -

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is going to be ran often it would make a lot more sense for the container to be published to GHCR that can be ran without needing to run this setup script each and every time. That way it's already baked into a published image that is used instead of:

    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/multiqc:1.15--pyhdfd78af_0' :
        'quay.io/biocontainers/multiqc:1.15--pyhdfd78af_0' }"

@BryanFauble BryanFauble requested a review from a team November 12, 2025 18:00
label 'process_single'

conda "bioconda::multiqc=1.15"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
Copy link

@rxu17 rxu17 Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fixed to a specific container version + id? Does it make sense to have it possible to pass a custom docker image so if it updates, can just pass that in over having to update this path, creating a PR and so forth? I was looking at here:

nf-dcqc/docs/usage.md

Lines 98 to 122 in baaf4c1

> We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.
The pipeline also dynamically loads configurations from [https://github.com/nf-core/configs](https://github.com/nf-core/configs) when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the [nf-core/configs documentation](https://github.com/nf-core/configs#documentation).
Note that multiple profiles can be loaded, for example: `-profile test,docker` - the order of arguments is important!
They are loaded in sequence, so later profiles can overwrite earlier profiles.
If `-profile` is not specified, the pipeline will run locally and expect all software to be installed and available on the `PATH`. This is _not_ recommended, since it can lead to different results on different machines dependent on the computer enviroment.
- `test`
- A profile with a complete configuration for automated testing
- Includes links to test data so needs no other parameters
- `docker`
- A generic configuration profile to be used with [Docker](https://docker.com/)
- `singularity`
- A generic configuration profile to be used with [Singularity](https://sylabs.io/docs/)
- `podman`
- A generic configuration profile to be used with [Podman](https://podman.io/)
- `shifter`
- A generic configuration profile to be used with [Shifter](https://nersc.gitlab.io/development/shifter/how-to-use/)
- `charliecloud`
- A generic configuration profile to be used with [Charliecloud](https://hpc.github.io/charliecloud/)
- `conda`
- A generic configuration profile to be used with [Conda](https://conda.io/docs/). Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud.


## Output

The plugin generates three main sections in the MultiQC report:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a sample output screenshot you could add here?


## Data Format

The plugin expects a `suites.json` file with the following structure:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it expect all of the following fields you listed to be filled out or which ones are optional?

for f in self.find_log_files("dcqc_validation", filehandles=True):
log.debug(f"Processing file: {f['fn']}")
num_files += 1
try:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add deeper validation for the suites.json file here while parsing for the user (e.g: more debugging messages to know where their structure is failing) or perhaps even suggest a simple json tool for validating? Or does the suites.json file input come pre-validated?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be worth adding a unit test suite for this python module and potentially adding it to the CI/CD outside of the usual dcqc integration tests (unless that is enough coverage)

@andrewelamb
Copy link
Contributor

andrewelamb commented Nov 18, 2025

@adamjtaylor Thanks for this! That's quite a bit of Python code to live in a Nextflow tool. I think it would make a lot more sense to move most/all of the Python code over to PY-DCQC. This is the Python library that is solely for CLI tools that NF-DCQC uses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants