Skip to content

perf: parallel ISA-L compression and pwrite output pipeline#666

Closed
KimYannn wants to merge 1 commit into
OpenGene:masterfrom
KimYannn:feat/parallel-isal-compress
Closed

perf: parallel ISA-L compression and pwrite output pipeline#666
KimYannn wants to merge 1 commit into
OpenGene:masterfrom
KimYannn:feat/parallel-isal-compress

Conversation

@KimYannn
Copy link
Copy Markdown
Member

Summary

  • Parallel ISA-L gzip compression: Worker threads compress output in parallel using ISA-L, with flight batch management to coordinate ordering, replacing the serial single-writer-thread bottleneck
  • Parallel pwrite output: Workers write directly via pwrite() using a lock-free offset ring buffer, eliminating the writer thread as a serialization point
  • Bounded mismatch counting: Early-exit overlap analysis rejects non-overlapping offsets after comparing just the first few dozen bytes instead of the full 150bp (~5-10x reduction in comparison work)
  • Adaptive timeout with spin backoff: EMA-based ingress rate estimation auto-tunes flush timeout; progressive backoff replaces tight spin-wait in pwrite offset coordination
  • Decoupled FASTQ reader/writer with zstd support: Reader and writer paths are decoupled, with zstd input/output support and seekable output with autotuned workers
  • Pipeline backpressure and writer flight control: Improved flow control between pipeline stages
  • AArch64 static linking fix: Removed full -static linking that caused ISA-L relocation overflow on ARM64
  • End-to-end benchmark suite (scripts/bench_e2e.py): PE/SE/stdin-stdout modes, hardware profiling, dual-mode (baseline vs optimized) comparison with JSON merge capability

Key Performance Improvements

Mode Before After Speedup
PE fq→fq (-w 12) 8.21s 6.33s 1.30x
PE fq→gz (parallel ISA-L) significant eliminates writer bottleneck

Changed Files

  • Core: peprocessor.cpp/h, seprocessor.cpp/h, writerthread.cpp/h, writer.cpp/h
  • New: isal_compress.h, flight_batch_manager.cpp/h, trace_profiler.cpp/h
  • Build: Makefile, .github/workflows/ci.yml
  • Bench: scripts/bench_e2e.py (replaces bench_e2e.sh)
  • Docs: 3 RFCs in docs/rfc/, 2 benchmark results in docs/bench/

Test Plan

  • CI passes on Linux (x86_64) and macOS (ARM64)
  • E2E benchmark: python scripts/bench_e2e.py ./fastp_baseline ./fastp_opt
  • MD5 verification of output correctness across all modes (PE/SE, fq/gz/zst)
  • Memory usage (peak RSS) does not regress significantly
  • Verify pwrite gz output is byte-identical to serial writer output

🤖 Generated with Claude Code

@KimYannn KimYannn force-pushed the feat/parallel-isal-compress branch 8 times, most recently from abc71f6 to 9d57b42 Compare March 21, 2026 21:34
@KimYannn KimYannn marked this pull request as draft March 22, 2026 02:59
@KimYannn KimYannn force-pushed the feat/parallel-isal-compress branch 5 times, most recently from 377a19a to 7769cc7 Compare March 22, 2026 04:20
- Replace zlib with ISA-L for gzip compression in worker threads,
  using SIMD-accelerated stateless deflate for parallel chunk compression
- Add pwrite-based multi-threaded output bypassing single writer bottleneck
- Implement adaptive flight batch manager with timeout and size controls
- Add bounded mismatch counting in overlap analysis for early exit
- Decouple FASTQ reader/writer and add zstd input/output support
- Switch zst backend to seekable output with autotuned workers
- Refine thread budget split and writer-path autotuning
- Keep -w as worker thread count (backward-compatible with upstream),
  extra reader/writer threads are added automatically
- Add CI workflow for Ubuntu and macOS with ISA-L/Highway/htslib builds
- Fix Debian/Ubuntu multiarch lib paths in Makefile FIND_STATIC

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@KimYannn KimYannn force-pushed the feat/parallel-isal-compress branch from 7769cc7 to 0f29e8b Compare March 22, 2026 04:25
@KimYannn KimYannn closed this Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant