This document describes the output produced by the pipeline. Most plots are taken from the pmultiqc report, which summarises results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
The pipeline is built using Nextflow and processes DIA data using the following steps:
- (Optional) Raw files are downloaded from PRIDE Archive using pridepy
- RAW data is converted to mzML using ThermoRawFileParser; SCIEX
.wifffiles are converted via WiffConverter;.d(Bruker) and.diafiles are handled natively - DIA-NN is used for identification and quantification of peptides and proteins
- DIA-NN report is converted to MSstats-compatible format
- Generation of QC reports using pmultiqc
Output will be saved to the folder defined by the parameter --outdir.
results/
├── pipeline_info/ # Nextflow pipeline information
├── pridepy/ # (Optional) Downloaded raw files from PRIDE Archive
├── sdrf/ # SDRF files and configs
├── quant_tables/ # Quantification tables and results
│ ├── diann_report.{tsv,parquet} # Main DIA-NN report
│ ├── diann_report.pg_matrix.tsv # Protein group matrix
│ ├── diann_report.pr_matrix.tsv # Precursor matrix
│ ├── diann_report.gg_matrix.tsv # Gene group matrix
│ └── out_msstats_in.csv # MSstats-compatible output
└── pmultiqc/ # pmultiqc reports
├── multiqc_plots/
│ ├── png/
│ ├── svg/
│ └── pdf/
└── multiqc_data/
For more detailed output with all intermediate files, use the verbose output configuration by providing -profile verbose_modules. This is useful for debugging or detailed analysis:
results/
├── pipeline_info/
├── sdrf/
├── spectra/
│ ├── thermorawfileparser/ # Converted raw files
│ └── mzml_statistics/ # mzML file statistics
├── database_generation/
│ ├── insilico_library_generation/ # In silico library
│ └── assemble_empirical_library/ # Empirical library
├── diann_preprocessing/
│ ├── preliminary_analysis/ # Preliminary analysis results
│ └── individual_analysis/ # Individual analysis results
├── quant_tables/
└── pmultiqc/
- DIA-NN quantification results:
quant_tables/diann_report.{tsv,parquet}- Main DIA-NN report with peptide and protein quantificationquant_tables/diann_report.pr_matrix.tsv- Precursor quantification matrixquant_tables/diann_report.pg_matrix.tsv- Protein group quantification matrixquant_tables/diann_report.gg_matrix.tsv- Gene group quantification matrixquant_tables/diann_report.unique_genes_matrix.tsv- Unique gene quantification matrixquant_tables/out_msstats_in.csv- MSstats-compatible quantification table
Starting with DIA-NN 2.0, the main report is produced in Apache Parquet format (diann_report.parquet) instead of the legacy TSV (diann_report.tsv). Parquet files are columnar, compressed, and significantly faster to load in downstream tools such as Python (pandas/pyarrow) or R (arrow).
| DIA-NN Version | Main report format | Matrix format |
|---|---|---|
| 1.8.1 | diann_report.tsv |
.tsv |
| 2.1.0+ | diann_report.parquet |
.tsv |
The pipeline detects the DIA-NN version and handles the output format automatically. Downstream steps (MSstats conversion, pmultiqc) accept both formats.
To read Parquet files:
# Python
import pandas as pd
df = pd.read_parquet("diann_report.parquet")# R
library(arrow)
df <- read_parquet("diann_report.parquet")The pipeline produces quant_tables/out_msstats_in.csv, an MSstats-compatible quantification table generated by quantms-utils. This file contains long-format precursor-level intensities with the columns required by the MSstats R package for downstream statistical analysis (e.g. differential expression, sample-size estimation).
Key columns include: ProteinName, PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run, Intensity.
The condition and biological replicate assignments are derived from the SDRF factor columns.
These files are not published by default. Enable them with save_* parameters or ext.* config properties (see Usage: Optional outputs).
library_generation/*.tsv- TSV spectral library from in-silico library generation (--save_speclib_tsv)
When --enable_qpx_export is set, the pipeline produces a QPX Parquet dataset and a MuData .h5mu file under results/qpx/. <prefix> defaults to diann, overridden by --project_accession.
<prefix>.feature.parquet— precursor-level features<prefix>.pg.parquet— protein-group intensities per run<prefix>.sample.parquet,<prefix>.run.parquet— SDRF-derived metadata<prefix>.h5mu— MuData withprecursorsandproteinsmodalities
import mudata as mu
mdata = mu.read("results/qpx/PXD019909.h5mu")Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline.
pipeline_info/:
execution_report.html- Resource usage reportexecution_timeline.html- Timeline visualizationexecution_trace.txt- Detailed execution tracepipeline_dag.html- DAG visualizationsoftware_versions.yml- Software versions used
All QC results are generated by pmultiqc, a proteomics plugin for MultiQC. The interactive HTML report provides:
- Identification and quantification metrics
- Sample-level quality statistics
- Pipeline software versions