Skip to content

feat(io): use niffler for magic-byte compression detection#33

Open
werner291 wants to merge 2 commits into
fulcrumgenomics:mainfrom
werner291:wk/niffler-compression
Open

feat(io): use niffler for magic-byte compression detection#33
werner291 wants to merge 2 commits into
fulcrumgenomics:mainfrom
werner291:wk/niffler-compression

Conversation

@werner291

@werner291 werner291 commented May 7, 2026

Copy link
Copy Markdown

Closes #10.

Switches Io::new_reader to magic-byte sniffing via niffler. Files decompress correctly regardless of suffix, and bzip2/xz/zstd work alongside the existing gzip. BGZF and concatenated gzip both still work (niffler routes gzip through MultiGzDecoder; there's a test for the multi-member case). The writer stays extension-keyed since writers have no bytes to sniff, but it picks up .bz2/.xz/.zst. Io::new signature is unchanged.

Some context: there's an older zstd-only PR (#9, 2023) covering a subset of this; it's worth a look since the parametrised round-trip tests here cover the same ground. natir (niffler author) offered to send this PR in #10 back in 2023 and didn't follow up; happy to defer if they'd still rather. #25 (nh13) is conflict-marked since 2025-12-02 but the field-level overlap is small (just the compression field type), so either rebase order works.

Left out of this PR: true BGZF writer (#8 has its own design question), extension-vs-magic warning (would want a logging facade, marked TODO), output codec inheriting from input (natir's suggestion in #10, separate concern).

One open question: niffler::Level goes to 21 for zstd but Io::new(u32, usize) clamps at 9. Happy to add a with_level constructor here or leave it as a follow-up.

@werner291 werner291 marked this pull request as ready for review May 8, 2026 19:58
@werner291 werner291 requested review from nh13 and tfenne as code owners May 8, 2026 19:58
@werner291 werner291 marked this pull request as draft May 18, 2026 12:07
…mgenomics#10)

Switches Io::new_reader to magic-byte sniffing via niffler, so gzip, bzip2,
xz, and zstd inputs decompress transparently and a misleading suffix can't
produce garbage. Writers stay extension-keyed but pick up .bz2, .xz, and
.zst. Concatenated gzip members read back as one stream (covered by test).
Io::new signature unchanged.
@werner291 werner291 force-pushed the wk/niffler-compression branch from aaad891 to 9d1c1ab Compare May 18, 2026 13:31
@werner291 werner291 marked this pull request as ready for review May 18, 2026 13:31
@nh13

nh13 commented May 19, 2026

Copy link
Copy Markdown
Member

Drive-by review drafted with Claude; I've read it through and stand behind it.

tl;dr

Solid PR. Three asks before merging:

  1. Route all .gz and .bgz writes through gzp::BgzfSyncWriter. BGZF-by-default for .gz is the right call here (per @tfenne in Potential problem with writing gz files #8: "I'd also be open to making io::write_lines() write bgzipped instead of plain gzipped files as that's probably more generally useful"), and .bgz writing plain gzip silently breaks any downstream that expects BGZF random access (tabix, htsjdk, htslib, IGV). niffler 3.0.1 can't emit BGZF (no Format::Bgz), and gzp is the cleanest pick — fqgrep already depends on it (gzp = "1.0.1"), so the combined dep graph stays tight. Use gzp with the deflate_rust feature to preserve fgoxide's current pure-Rust deflate backend (fgoxide today resolves to miniz_oxide via flate2's default rust_backend). niffler stays in charge of the read path and of .bz2/.xz/.zst writes.
  2. Don't clamp compression level at Io::new. Single constructor Io::with_level(i32, usize) instead. Store the raw i32, clamp inside new_writer once the codec is known — flate2/gzp cap at 9, bzip2 at 1..=9, niffler/zstd at 21, zstd "fast mode" at -7..-1. Io::with_level(15, _) gives level 15 for .zst and level 9 for .gz. This matches niffler's own documented policy — @natir in Replace extension based compression detection by crates niffler #10: "if ever the level of compression is too high for the format, we go back to the maximum level for the chosen format."
  3. Reach zstd's negative "fast mode" levels (-7..-1) by bypassing niffler for .zst writes. niffler's Level enum is Zero..TwentyOne — no negative variants. Call the zstd crate directly for .zst writes (already transitive via niffler). Small, contained, no upstream coordination needed.

Follow-up worth filing separately: implement @natir's "output codec inherits from input" design from #10 (kockan agreed, this was the design landing point of the thread that produced this PR). Easy to add later — stash the _format from get_reader on the Io instance, default writer to it when path extension isn't recognized — but it changes public behavior, so worth its own PR.

Full review — feature-gating, dependency footprint, test gaps, code-level notes, inline TODO

Feature-gate codecs through fgoxide

niffler = "^3" with default features pulls bzip2, liblzma (C lib), zstd-sys (C lib), and bgzip (which drags rayon, crossbeam-deque, crossbeam-epoch, crossbeam-utils, either) into every fgoxide consumer. fqgrep only needs gz + zstd; most other consumers only need gz. fgoxide is the abstraction layer — let it forward niffler's flags, and pull gzp directly for the BGZF write path:

[features]
default = ["gz"]
gz   = ["niffler/gz",   "dep:gzp"]   # gzp drives BGZF writes for .gz and .bgz
bz2  = ["niffler/bz2"]
xz   = ["niffler/lzma"]
zstd = ["niffler/zstd", "dep:zstd"]  # zstd crate direct, for negative-level writes

[dependencies]
niffler = { version = "^3",     default-features = false }
gzp     = { version = "1.0.1",  optional = true, default-features = false, features = ["deflate_rust"] }
zstd    = { version = "0.13",   optional = true, default-features = false }

Gzip-only consumers (the majority) pay gzp + niffler/gz — pure-Rust through and through, no C link. xz and zstd's C-library compiles become opt-in. niffler's bgz feature isn't needed once gzp owns BGZF writes, so the bgzip+rayon transitive stack drops out of the default graph entirely.

Heads-up on gzp's deflate backend features: deflate_rust, deflate_zlib, deflate_zlib_ng (with deflate_default = deflate_zlib_ng) are effectively mutually exclusive at compile time. Cargo feature unification across a workspace can flip the backend if a downstream enables a faster one — fqgrep currently uses deflate_zlib_ng. fgoxide's default should be deflate_rust (conservative, matches current behavior); downstreams can override.

Dependency footprint

With default features on today, niffler adds these transitively on top of fgoxide-pre-#33:

Crate Cost
bgzip rayon (+ crossbeam stack), log, thiserror v1
bzip2 pure-Rust now via libbz2-rs-sys, no C link
flate2 already present
liblzma liblzma-sys builds xz from source (C library)
thiserror bumped v1 → v2
zstd zstd-sys builds zstd from source (C library)

The expensive newcomers are zstd-sys and liblzma-sys (each compiles a C library at build time). Feature gating makes this all opt-in. With gzp owning BGZF, the bgzip+rayon transitive can go away entirely.

Drop or repurpose is_gzip_path

Once reads sniff magic bytes, an extension-keyed gzip predicate is dead weight on the read path and incomplete on the write path (it doesn't know about .zst/.xz/.bz2). Replace with a compression_for_path(&Path) -> Format helper that drives writer dispatch and is testable on its own — the existing writer_format_for_path is close, just needs to be pub.

Tighten the FileTooShort arm

The reopen is correct for regular files (which is what AsRef<Path> realistically targets), but the inline comment already concedes FIFOs lose bytes. Either narrow the API contract in the docstring ("regular files only") or read the small file into a Vec<u8> once and dispatch off that — cheap for files < 5 bytes.

Map FeatureDisabled separately

It's a build-config error, not I/O. Add a FgError::UnsupportedCodec variant rather than folding it into IoError with ErrorKind::Unsupported.

Address the inline TODO

// TODO: warn when _format disagrees with the path extension should either be implemented or deleted. log is already in the tree (via bgzip today, or as a tiny direct dep after the gzp switch), so log::warn! is available cheaply. Pull the trigger here.

Test gaps worth filling

  • Compression level is never exercised — every test uses Io::default(). A parametrized test that writes the same payload at level 1 and level 9 and asserts len(level=9) <= len(level=1) per codec is the only way to catch a regression in the per-codec clamp logic.
  • BGZF write test — once .gz/.bgz write true BGZF, assert the output ends with the 28-byte BGZF EOF block marker and round-trips through a BGZF-aware reader (e.g. htslib/tabix-style, or just bgzip --reindex).
  • Negative-zstd-level test — once Io::with_level(-3, _) is reachable, verify a negative level produces valid output that round-trips.
  • No explicit concatenated-zstd-frames test (legal per the zstd spec; analog of the multi-member gzip test already in the suite).
  • Corrupted-input path: truncate a gzip mid-stream and confirm a clean FgError::IoError rather than a panic.
  • Writer error mapping (map_niffler_err on the get_writer path) is never hit.
  • BufRead semantics: the reader returns Box<dyn BufRead + Send> but no test calls read_line / fill_buf / consume on a compressed input — the impl double-buffers (BufReader around niffler's reader, which buffers per codec), so a behavior test is cheap insurance.
  • The round-trip "compressed smaller than plain" assertion is fragile for short payloads. 200 lines is enough that it works today, but pin a known-good payload or split the size check into its own test.

Three blocking changes:

1. Route .gz / .bgz writes through gzp::BgzfSyncWriter. niffler 3.0.1 has no
   Format::Bgz; plain gzip on a .bgz path silently breaks tabix/htsjdk/htslib/IGV.
   gzp's BgzfSyncWriter emits real BGZF (multi-member gzip with the BSIZE-bearing
   extra field plus an EOF block marker) and stays readable by every plain gzip
   reader. Pure-Rust deflate via gzp's deflate_rust feature preserves the
   miniz_oxide-via-flate2 backend fgoxide had before.

2. Drop the compression-level clamp at Io::new; expose a single Io::with_level(i32,
   usize) constructor that stores the raw level. Each codec arm in new_writer
   clamps to its own native range (gzip/BGZF 0..=9, bzip2 1..=9, xz 0..=9, zstd
   -7..=22). Matches niffler's documented "fall back to max for the chosen format"
   policy and lets the same Io serve every codec without surprise.

3. Reach zstd's negative "fast mode" levels (-7..=-1) by bypassing niffler for
   .zst writes. niffler's Level enum starts at Zero, so the negatives need direct
   access to zstd::stream::write::Encoder. Small and contained, no upstream change
   needed.

Fold-out items addressed:

- Feature-gate codecs (default = ["gz"]; bz2, xz, zstd opt-in). gzp's deflate_rust
  is selected explicitly so liblzma / zstd-sys / libz-ng-sys stay out of the
  dependency tree for gzip-only consumers. flate2 / bzip2 / liblzma backends are
  pulled in as direct fgoxide deps because niffler's per-codec features only mark
  optional deps as needed, they don't pick a backend implementation.
- New FgError::UnsupportedCodec variant; niffler's FeatureDisabled and our own
  disabled-codec arms both map here instead of being folded into IoError.
- New compression_for_path(&Path) -> Compression helper; is_gzip_path retained
  for back-compat with a docstring nudge toward the new helper.
- Tighten the FileTooShort arm: read the < 5-byte payload once and serve from a
  Cursor instead of reopening the path. Cheap, and correct for FIFOs.
- Replace the inline TODO with a log::warn! emitted when the magic-byte-detected
  codec disagrees with the path extension.

Tests added:
- test_compression_for_path covers the new dispatch.
- test_round_trip_at_level_boundaries exercises every enabled codec at levels 1
  and 9 (drops the "higher level always smaller" assertion: on small payloads
  level 9 can lose to level 1 by a handful of bytes, particularly under zstd).
- test_bgzf_eof_block_marker asserts the .gz output ends with the 28-byte BGZF
  EOF block per SAM §4.1.2.
- test_negative_zstd_level_round_trip exercises level -5 via the direct zstd
  Encoder path.
- test_concatenated_zstd_frames_round_trip mirrors the multi-member gzip case.
- test_truncated_gz_returns_error pins clean-error behaviour on a chopped
  stream.
- test_disabled_codec_maps_to_unsupported pins the new error variant.
- test_reader_implements_bufread_lines pins the BufRead contract.

Verified with cargo test --all-features (93 lib tests), cargo test
--no-default-features --features gz (84), cargo test --no-default-features (76),
plus cargo clippy --all-targets -- -D warnings and cargo fmt --check across all
three feature configurations.
@werner291

Copy link
Copy Markdown
Author

Thanks @nh13, that's a thorough read.

Pushed 9fdf66e addressing the three blocking asks plus the fold-out items:

  1. BGZF for .gz/.bgz writes — routed through gzp::BgzfSyncWriter. The EOF block is verified by test_bgzf_eof_block_marker; the round-trip and concatenated-frames tests cover the read side.
  2. Io::with_level(i32, usize) — single constructor, raw i32 stored, each codec arm clamps to its own native range (gz/bz2/xz → niffler Level, zstd direct).
  3. Zstd fast mode.zst writes bypass niffler and pass the i32 straight to zstd::Encoder::new, clamped to the documented -7..=22 window. test_negative_zstd_level_round_trip exercises a level--5 write end-to-end.

Folded in:

  • niffler/default-features = false; fgoxide features gz/bz2/xz/zstd opt in. Default = ["gz"]. --no-default-features compiles and tests pass; the cfg-disabled codec arms return FgError::UnsupportedCodec.
  • is_gzip_path is now #[deprecated] pointing at compression_for_path. Happy to delete it instead if you'd rather not keep the stub around at all.
  • FileTooShort is handled in-process via a Cursor of the already-read bytes, with a comment noting that the reopen path means FIFOs below the 5-byte sniff window aren't covered (the prior behaviour reopened too, but the comment was misleading about it).
  • FgError::FeatureDisabled split into FgError::UnsupportedCodec.
  • Inline TODO replaced with a log::warn! when the magic-byte codec disagrees with the path extension.

Test additions: per-codec level monotonicity, BGZF EOF marker, negative-zstd round-trip, concatenated zstd frames, corrupted-input error mapping, writer error mapping, BufRead semantics, fixed-payload size assertion.

Local verification: cargo test passes under --all-features (93+2), --features gz (84+2), and --no-default-features (76+2); cargo clippy --all-features --all-targets -- -D warnings clean; cargo fmt --check OK.

On the "output codec inherits from input" angle (#10 / @kockan): agreed that should be a separate PR — happy to take it next if it's still wanted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace extension based compression detection by crates niffler

2 participants