feat(io): use niffler for magic-byte compression detection#33
feat(io): use niffler for magic-byte compression detection#33werner291 wants to merge 2 commits into
Conversation
…mgenomics#10) Switches Io::new_reader to magic-byte sniffing via niffler, so gzip, bzip2, xz, and zstd inputs decompress transparently and a misleading suffix can't produce garbage. Writers stay extension-keyed but pick up .bz2, .xz, and .zst. Concatenated gzip members read back as one stream (covered by test). Io::new signature unchanged.
aaad891 to
9d1c1ab
Compare
|
Drive-by review drafted with Claude; I've read it through and stand behind it. tl;drSolid PR. Three asks before merging:
Follow-up worth filing separately: implement @natir's "output codec inherits from input" design from #10 (kockan agreed, this was the design landing point of the thread that produced this PR). Easy to add later — stash the Full review — feature-gating, dependency footprint, test gaps, code-level notes, inline TODOFeature-gate codecs through fgoxide
[features]
default = ["gz"]
gz = ["niffler/gz", "dep:gzp"] # gzp drives BGZF writes for .gz and .bgz
bz2 = ["niffler/bz2"]
xz = ["niffler/lzma"]
zstd = ["niffler/zstd", "dep:zstd"] # zstd crate direct, for negative-level writes
[dependencies]
niffler = { version = "^3", default-features = false }
gzp = { version = "1.0.1", optional = true, default-features = false, features = ["deflate_rust"] }
zstd = { version = "0.13", optional = true, default-features = false }Gzip-only consumers (the majority) pay Heads-up on Dependency footprintWith default features on today, niffler adds these transitively on top of fgoxide-pre-#33:
The expensive newcomers are Drop or repurpose
|
Three blocking changes: 1. Route .gz / .bgz writes through gzp::BgzfSyncWriter. niffler 3.0.1 has no Format::Bgz; plain gzip on a .bgz path silently breaks tabix/htsjdk/htslib/IGV. gzp's BgzfSyncWriter emits real BGZF (multi-member gzip with the BSIZE-bearing extra field plus an EOF block marker) and stays readable by every plain gzip reader. Pure-Rust deflate via gzp's deflate_rust feature preserves the miniz_oxide-via-flate2 backend fgoxide had before. 2. Drop the compression-level clamp at Io::new; expose a single Io::with_level(i32, usize) constructor that stores the raw level. Each codec arm in new_writer clamps to its own native range (gzip/BGZF 0..=9, bzip2 1..=9, xz 0..=9, zstd -7..=22). Matches niffler's documented "fall back to max for the chosen format" policy and lets the same Io serve every codec without surprise. 3. Reach zstd's negative "fast mode" levels (-7..=-1) by bypassing niffler for .zst writes. niffler's Level enum starts at Zero, so the negatives need direct access to zstd::stream::write::Encoder. Small and contained, no upstream change needed. Fold-out items addressed: - Feature-gate codecs (default = ["gz"]; bz2, xz, zstd opt-in). gzp's deflate_rust is selected explicitly so liblzma / zstd-sys / libz-ng-sys stay out of the dependency tree for gzip-only consumers. flate2 / bzip2 / liblzma backends are pulled in as direct fgoxide deps because niffler's per-codec features only mark optional deps as needed, they don't pick a backend implementation. - New FgError::UnsupportedCodec variant; niffler's FeatureDisabled and our own disabled-codec arms both map here instead of being folded into IoError. - New compression_for_path(&Path) -> Compression helper; is_gzip_path retained for back-compat with a docstring nudge toward the new helper. - Tighten the FileTooShort arm: read the < 5-byte payload once and serve from a Cursor instead of reopening the path. Cheap, and correct for FIFOs. - Replace the inline TODO with a log::warn! emitted when the magic-byte-detected codec disagrees with the path extension. Tests added: - test_compression_for_path covers the new dispatch. - test_round_trip_at_level_boundaries exercises every enabled codec at levels 1 and 9 (drops the "higher level always smaller" assertion: on small payloads level 9 can lose to level 1 by a handful of bytes, particularly under zstd). - test_bgzf_eof_block_marker asserts the .gz output ends with the 28-byte BGZF EOF block per SAM §4.1.2. - test_negative_zstd_level_round_trip exercises level -5 via the direct zstd Encoder path. - test_concatenated_zstd_frames_round_trip mirrors the multi-member gzip case. - test_truncated_gz_returns_error pins clean-error behaviour on a chopped stream. - test_disabled_codec_maps_to_unsupported pins the new error variant. - test_reader_implements_bufread_lines pins the BufRead contract. Verified with cargo test --all-features (93 lib tests), cargo test --no-default-features --features gz (84), cargo test --no-default-features (76), plus cargo clippy --all-targets -- -D warnings and cargo fmt --check across all three feature configurations.
|
Thanks @nh13, that's a thorough read. Pushed
Folded in:
Test additions: per-codec level monotonicity, BGZF EOF marker, negative-zstd round-trip, concatenated zstd frames, corrupted-input error mapping, writer error mapping, BufRead semantics, fixed-payload size assertion. Local verification: On the "output codec inherits from input" angle (#10 / @kockan): agreed that should be a separate PR — happy to take it next if it's still wanted. |
Closes #10.
Switches
Io::new_readerto magic-byte sniffing via niffler. Files decompress correctly regardless of suffix, and bzip2/xz/zstd work alongside the existing gzip. BGZF and concatenated gzip both still work (niffler routes gzip throughMultiGzDecoder; there's a test for the multi-member case). The writer stays extension-keyed since writers have no bytes to sniff, but it picks up.bz2/.xz/.zst.Io::newsignature is unchanged.Some context: there's an older zstd-only PR (#9, 2023) covering a subset of this; it's worth a look since the parametrised round-trip tests here cover the same ground. natir (niffler author) offered to send this PR in #10 back in 2023 and didn't follow up; happy to defer if they'd still rather. #25 (nh13) is conflict-marked since 2025-12-02 but the field-level overlap is small (just the
compressionfield type), so either rebase order works.Left out of this PR: true BGZF writer (#8 has its own design question), extension-vs-magic warning (would want a logging facade, marked TODO), output codec inheriting from input (natir's suggestion in #10, separate concern).
One open question:
niffler::Levelgoes to 21 for zstd butIo::new(u32, usize)clamps at 9. Happy to add awith_levelconstructor here or leave it as a follow-up.