Skip to content

Commit f115e48

Browse files
authored
Merge pull request #24 from bigbio/feature/dat-bypass-workflow
Feature/dat bypass workflow
2 parents 3a7fd3e + 7b8feb3 commit f115e48

2 files changed

Lines changed: 220 additions & 33 deletions

File tree

README.md

Lines changed: 6 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -8,42 +8,15 @@ quantms has reanalyzed an extensive number of datasets with almost 1 billion MS/
88

99
## Workflow Overview
1010

11-
> A diagram will be added here once the unified workflow figure is drawn.
12-
> In the meantime, see the ASCII flow below and the design spec at
13-
> [`docs/workflow_diagram_spec.md`](docs/workflow_diagram_spec.md).
14-
>
15-
> File formats (QPX inputs, the `.dat` binary, `.scan_titles.txt`,
16-
> cluster-DB parquets, MSP, and the pre-existing cluster DB layout
17-
> expected by `--existing_cluster_db`) are documented in
18-
> [`docs/formats.md`](docs/formats.md).
11+
![SpectrafUSE Workflow](docs/images/spectrafuse_workflow.svg)
1912

13+
> See [`docs/formats.md`](docs/formats.md) for the exact structure of every
14+
> file the pipeline reads and writes — QPX inputs, the `.dat` binary,
15+
> `.scan_titles.txt`, cluster-DB parquets, MSP, and the pre-existing cluster
16+
> DB layout expected by `--existing_cluster_db`.
2017
21-
SpectrafUSE is a single pipeline. New QPX projects are always converted to MaRaCluster's `.dat` binary format (~100 bytes/spectrum), sliced into precursor m/z windows, clustered, and written to a cluster DB plus an MSP spectral library. If `--existing_cluster_db <path>` is supplied, representative spectra from that DB are extracted to `.dat` and clustered alongside the new data — the rest of the pipeline is identical, and the final step merges into the existing DB instead of writing a fresh one.
2218

23-
```
24-
(--existing_cluster_db) ┌─ EXTRACT_REPS_DAT ─┐
25-
│ │
26-
new QPX projects ─────────── ├─ PARQUET_TO_DAT ───┤
27-
└────────────────────┴──┐
28-
29-
CONCAT_DAT_FILES (per species/[instrument]/charge)
30-
31-
32-
SPLIT_MZ_WINDOWS (default: 300 Da, 1 Da overlap)
33-
34-
35-
RUN_MARACLUSTER_DAT (parallel per window)
36-
37-
38-
MERGE_MZ_WINDOWS (reconcile overlap zones)
39-
40-
41-
┌──────────── cluster TSV + scan_titles ──────────┐
42-
▼ ▼
43-
BUILD_CLUSTER_DB ▲ GENERATE_MSP_FORMAT (*.msp.gz per partition)
44-
MERGE_INTO_EXISTING_DB │
45-
(if --existing_cluster_db)
46-
```
19+
SpectrafUSE is a single pipeline. New QPX projects are always converted to MaRaCluster's `.dat` binary format (~100 bytes/spectrum), sliced into precursor m/z windows, clustered, and written to a cluster DB plus an MSP spectral library. If `--existing_cluster_db <path>` is supplied, representative spectra from that DB are extracted to `.dat` and clustered alongside the new data — the rest of the pipeline is identical, and the final step merges into the existing DB instead of writing a fresh one.
4720

4821
1. **Parquet → Dat** (`PARQUET_TO_DAT`): Converts PSM parquet files directly to MaRaCluster's binary `.dat` format. Replicates MaRaCluster's internal binning (`bin = floor(mz / 1.000508 + 0.32)`, top-40 peaks). ~100 bytes/spectrum.
4922

0 commit comments

Comments
 (0)