-
Notifications
You must be signed in to change notification settings - Fork 3
Add MultiQC plugin for DCQC validation results #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add MultiQC plugin for DCQC validation results #18
Conversation
| num_files = 0 | ||
|
|
||
| for f in self.find_log_files("dcqc_validation", filehandles=True): | ||
| log.debug(f"Processing file: {f['fn']}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good candidate for an info level log message so it's clear to report back where in the process it currently is
| log.warning(f"Could not parse JSON from {f['fn']}: {e}") | ||
| continue | ||
| except Exception as e: | ||
| log.warning(f"Error processing {f['fn']}: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than stringifying the exception into the log message I would recommend leaving the exception out, but setting exc_info to True.
https://docs.python.org/3/library/logging.html#logging.Logger.debug
If exc_info does not evaluate as false, it causes exception information to be added to the logging message. If an exception tuple (in the format returned by sys.exc_info()) or an exception instance is provided, it is used; otherwise, sys.exc_info() is called to get the exception information.
| for suite_type in sorted(self.suite_types.keys()): | ||
| self.add_suite_type_sections(suite_type) | ||
|
|
||
| def parse_suites_json(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Adding return typing to all of the method calls makes it a bit easier to understand what's going on when someone looks at the code later on. In this case I think it's just an int return type.
There is also quite a bit of nuance added as in-line comments throughout the code that describe what is going on - That is better served in the method docstring so it's all in one place.
| Setup script for multiqc_dcqc MultiQC plugin | ||
| """ | ||
|
|
||
| from setuptools import setup, find_packages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use pyproject.toml instead of setup.py? The latter is now deprecated in favor of the modern PEP 517/518 standards.
Claude gave me this as the replacement file:
[build-system]
requires = ["setuptools>=61.0", "wheel"]
build-backend = "setuptools.build_meta"
[project]
name = "multiqc_dcqc"
version = "0.1.0"
description = "MultiQC plugin for DCQC validation results"
readme = {text = """
A MultiQC plugin that parses and visualizes validation results
from the DCQC (Data Quality Control) suite. It displays validation status,
test results, and detailed error messages for data files validated against
various schema requirements (including HTAN and CELLxGENE).
""", content-type = "text/plain"}
authors = [
{name = "HTAN Data Coordinating Center"}
]
license = {text = "MIT"}
requires-python = ">=3.8"
dependencies = [
"multiqc>=1.14",
]
classifiers = [
"Development Status :: 4 - Beta",
"Environment :: Console",
"Environment :: Web Environment",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: MIT License",
"Natural Language :: English",
"Operating System :: OS Independent",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Topic :: Scientific/Engineering",
"Topic :: Scientific/Engineering :: Bio-Informatics",
]
[project.urls]
Homepage = "https://github.com/ncihtan/dcqc"
Repository = "https://github.com/ncihtan/dcqc"
[project.entry-points."multiqc.modules.v1"]
dcqc_validation = "multiqc_dcqc.dcqc_validation:MultiqcModule"
[tool.setuptools]
include-package-data = true
zip-safe = false
[tool.setuptools.packages.find]
where = ["."]
include = ["multiqc_dcqc*"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/ncihtan/dcqc doesn't seem to exist, and should swap over to https://github.com/Sage-Bionetworks-Workflows/nf-dcqc (I assume)
| # Install the custom MultiQC plugin first | ||
| cd ${plugin_dir} | ||
| pip install --user --no-deps . | ||
| cd - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is going to be ran often it would make a lot more sense for the container to be published to GHCR that can be ran without needing to run this setup script each and every time. That way it's already baked into a published image that is used instead of:
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/multiqc:1.15--pyhdfd78af_0' :
'quay.io/biocontainers/multiqc:1.15--pyhdfd78af_0' }"
| label 'process_single' | ||
|
|
||
| conda "bioconda::multiqc=1.15" | ||
| container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fixed to a specific container version + id? Does it make sense to have it possible to pass a custom docker image so if it updates, can just pass that in over having to update this path, creating a PR and so forth? I was looking at here:
Lines 98 to 122 in baaf4c1
| > We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported. | |
| The pipeline also dynamically loads configurations from [https://github.com/nf-core/configs](https://github.com/nf-core/configs) when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the [nf-core/configs documentation](https://github.com/nf-core/configs#documentation). | |
| Note that multiple profiles can be loaded, for example: `-profile test,docker` - the order of arguments is important! | |
| They are loaded in sequence, so later profiles can overwrite earlier profiles. | |
| If `-profile` is not specified, the pipeline will run locally and expect all software to be installed and available on the `PATH`. This is _not_ recommended, since it can lead to different results on different machines dependent on the computer enviroment. | |
| - `test` | |
| - A profile with a complete configuration for automated testing | |
| - Includes links to test data so needs no other parameters | |
| - `docker` | |
| - A generic configuration profile to be used with [Docker](https://docker.com/) | |
| - `singularity` | |
| - A generic configuration profile to be used with [Singularity](https://sylabs.io/docs/) | |
| - `podman` | |
| - A generic configuration profile to be used with [Podman](https://podman.io/) | |
| - `shifter` | |
| - A generic configuration profile to be used with [Shifter](https://nersc.gitlab.io/development/shifter/how-to-use/) | |
| - `charliecloud` | |
| - A generic configuration profile to be used with [Charliecloud](https://hpc.github.io/charliecloud/) | |
| - `conda` | |
| - A generic configuration profile to be used with [Conda](https://conda.io/docs/). Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud. | |
|
|
||
| ## Output | ||
|
|
||
| The plugin generates three main sections in the MultiQC report: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a sample output screenshot you could add here?
|
|
||
| ## Data Format | ||
|
|
||
| The plugin expects a `suites.json` file with the following structure: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it expect all of the following fields you listed to be filled out or which ones are optional?
| for f in self.find_log_files("dcqc_validation", filehandles=True): | ||
| log.debug(f"Processing file: {f['fn']}") | ||
| num_files += 1 | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to add deeper validation for the suites.json file here while parsing for the user (e.g: more debugging messages to know where their structure is failing) or perhaps even suggest a simple json tool for validating? Or does the suites.json file input come pre-validated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be worth adding a unit test suite for this python module and potentially adding it to the CI/CD outside of the usual dcqc integration tests (unless that is enough coverage)
|
@adamjtaylor Thanks for this! That's quite a bit of Python code to live in a Nextflow tool. I think it would make a lot more sense to move most/all of the Python code over to PY-DCQC. This is the Python library that is solely for CLI tools that NF-DCQC uses. |
This pull request adds a new MultiQC reporting step to the DCQC workflow, enabling automated generation of validation reports from DCQC suite results. The update introduces a custom MultiQC plugin (
multiqc_dcqc) for parsing and visualizing validation results, integrates its execution into the workflow, and provides configuration and documentation for usage and development.MultiQC Integration and Workflow Updates
MULTIQCprocess to the workflow, which runs MultiQC using a custom plugin and configuration, and stages its output in a dedicated directory. (modules/local/multiqc.nf,conf/modules.config,workflows/dcqc.nf)MultiQC DCQC Plugin Implementation
multiqc_dcqcplugin, including its initialization, configuration, and entry point for MultiQC, to parse and display DCQC validation results in the report. (multiqc/multiqc_dcqc/__init__.py,multiqc/setup.py)multiqc/multiqc_config.yaml)Documentation and Development Support
multiqc/multiqc_dcqc/README.md).gitignorefor the plugin directory to exclude Python build artifacts and development files. (multiqc/.gitignore)