Skip to content

Commit cc58bdb

Browse files
committed
test(writer): add Java orc-tools cross-validation integration tests
Cross-validate the writer-side compression work in this PR against the Java reference implementation (Apache ORC 1.9.5 `orc-tools`): - Rust writer → Java reader for all 3 codecs: `orc-tools meta` parses the PostScript, reports `Compression: SNAPPY/ZLIB/ZSTD` correctly, row count matches. `orc-tools data` decompresses and decodes every stripe, emitting the exact rows we wrote. - Java writer → Rust reader: a file produced by `orc-tools convert` (default ZLIB) round-trips through the Rust reader with byte-identical column values. The tests are `#[ignore]`d and gated on the `ORC_TOOLS_JAR` environment variable, so they don't break CI on machines without a JDK. Run with: export ORC_TOOLS_JAR=/path/to/orc-tools-<version>-uber.jar cargo test --test java_interop -- --ignored PR_BODY.md updated with the new evidence section and a runnable reviewer recipe. Signed-off-by: Youichi Uda <youichi.uda@gmail.com>
1 parent 5b88318 commit cc58bdb

2 files changed

Lines changed: 678 additions & 0 deletions

File tree

.github/PR_BODY.md

Lines changed: 286 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,286 @@
1+
# feat(writer): add SNAPPY / ZLIB / ZSTD compression support
2+
3+
## Problem
4+
5+
orc-rust's writer always emits uncompressed ORC files even though the
6+
reader fully supports SNAPPY / ZLIB / ZSTD. This breaks interop with
7+
the Java ORC ecosystem (Hive, Spark, Trino, DuckDB) where every
8+
production deployment defaults to compressed tables — a 10-100x
9+
storage cost for downstream consumers and an awkward gap for crates
10+
that need to write ORC for Hive-shaped systems.
11+
12+
The PostScript-level codec selection has been a `// TODO: support
13+
compression` marker in `src/arrow_writer.rs::serialize_postscript`
14+
since the writer was added; this PR removes the TODO.
15+
16+
## Solution
17+
18+
Implement per-chunk compression matching the reference Java writer
19+
(`org.apache.orc.impl.PhysicalFsWriter`) per the ORC v1 spec
20+
(<https://orc.apache.org/specification/ORCv1/#compression>):
21+
22+
- **Per-stream, per-chunk** compression with a configurable block size
23+
(default 256 KiB, matching `OrcConf.BUFFER_SIZE`).
24+
- **3-byte little-endian chunk header**: 23-bit length + 1 ORIGINAL
25+
flag bit. Encodes correctly against the reader's existing
26+
`decode_header` (the reader's known-answer test for `5 → [0x0b, 0,
27+
0]` and `100 000 → [0x40, 0x0d, 0x03]` is mirrored as a writer-side
28+
KAT in `src/writer/compression.rs::header_kat_*`).
29+
- **Original-fallback** when `compressed_len >= original_len` — the
30+
spec-mandated and Java-reference behaviour. Verified per codec in
31+
`original_fallback_when_compression_would_expand`.
32+
33+
Compression is applied to every column stream (Present / Data /
34+
Length / Secondary / DictionaryData), to every stripe footer, and to
35+
the file footer. The PostScript itself is **not** compressed (it
36+
lives at a fixed offset from EOF so readers can locate it without
37+
first knowing the codec) and now records both `compression`
38+
(`CompressionKind`) and `compression_block_size` so any conformant
39+
reader runs the matching decompressor.
40+
41+
## Public API
42+
43+
```rust
44+
use orc_rust::arrow_writer::{ArrowWriterBuilder, Compression};
45+
46+
let writer = ArrowWriterBuilder::new(file, schema)
47+
.with_compression(Compression::Snappy)
48+
// optional — defaults to 256 KiB
49+
.with_compression_block_size(64 * 1024)
50+
.try_build()?;
51+
```
52+
53+
`Compression` is exposed as:
54+
55+
```rust
56+
pub enum Compression {
57+
None, // default — byte-identical to pre-PR output
58+
Snappy,
59+
Zlib { level: u32 }, // raw DEFLATE; default level 6
60+
Zstd { level: i32 }, // default level 3
61+
}
62+
```
63+
64+
with convenience constructors `Compression::zlib()` /
65+
`Compression::zstd()` for the spec-default levels, and
66+
`DEFAULT_ZLIB_LEVEL` / `DEFAULT_ZSTD_LEVEL` /
67+
`DEFAULT_COMPRESSION_BLOCK_SIZE` re-exports so call sites can derive
68+
their own configuration without duplicating constants.
69+
70+
## Design
71+
72+
The compression machinery lives in a new module
73+
`src/writer/compression.rs` with three internal functions:
74+
75+
- `write_header(out, length, original)` — emits the 3-byte little-
76+
endian chunk header. Debug-asserts the 23-bit length cap.
77+
- `encode_chunk(codec, chunk)` — codec-specific compression of one
78+
chunk's payload.
79+
- `compress_stream(codec, block_size, payload)` — splits the payload
80+
on `block_size` boundaries and writes each chunk through
81+
`write_chunk`, falling back to ORIGINAL when the codec doesn't
82+
shrink the chunk.
83+
84+
`StripeWriter` carries a `pub(crate) StripeCompression { compression,
85+
block_size }` and feeds every emitted stream + the stripe footer
86+
through `compress_stream`. `ArrowWriter::close()` does the same for
87+
the file footer before serialising the PostScript.
88+
89+
`Compression::None` is represented as `Option::None` at the
90+
`StripeWriter` level so the no-compression code path stays
91+
branchless and produces byte-identical output to the pre-PR writer.
92+
This is a verified invariant — see the
93+
`backward_compat_default_no_compression_byte_identical` test.
94+
95+
## Alternatives considered
96+
97+
1. **Whole-stripe compression** — rejected: non-spec-compliant, would
98+
break every existing ORC reader.
99+
2. **Per-stream global (no chunks)** — rejected: same problem; the
100+
spec mandates per-chunk framing because readers stream-decode.
101+
3. **ZSTD-only first, SNAPPY/ZLIB later** — rejected: Hive and Trino
102+
both default to SNAPPY, so an MVP without it has zero deployment
103+
value. The three codecs all share the chunked-frame envelope and
104+
only differ in the codec call inside `encode_chunk`, so doing all
105+
three in one PR is no extra design risk.
106+
4. **Fork a `Codec` trait for extensibility** — deferred. Adding a
107+
trait now would commit us to a particular extension shape (e.g.
108+
how to plumb encoder context for streaming codecs) before there's
109+
a concrete second-implementation use case. The current `enum`
110+
covers every codec the ORC spec defines.
111+
112+
## Breaking changes
113+
114+
None. `Compression::None` is the default. Existing call sites
115+
continue to produce **byte-identical** output (verified by the
116+
`backward_compat_default_no_compression_byte_identical` integration
117+
test; the PostScript's `compression_block_size` field is
118+
deliberately omitted when the codec is NONE so we match Java's
119+
writer which also omits it).
120+
121+
`StripeWriter::new` is dropped (was an internal artifact; the writer
122+
module itself is `mod writer;` private — no public callers exist).
123+
`StripeWriter::with_compression` replaces it as the only constructor
124+
and is also `pub(crate)` since the public surface is
125+
`ArrowWriterBuilder`.
126+
127+
## Tests
128+
129+
29 new tests, all green:
130+
131+
**Unit tests** in `src/writer/compression.rs` (11):
132+
- `header_round_trip` — fuzzed length/flag combinations including
133+
the 23-bit boundary case
134+
- `header_kat_matches_reader_decode` — known-answer matching the
135+
reader's `decode_compressed` test (`100 000 → [0x40, 0x0d, 0x03]`)
136+
- `header_kat_uncompressed_5_bytes` — matching the reader's
137+
`decode_uncompressed` test (`5 → [0x0b, 0, 0]`)
138+
- `empty_stream_emits_no_chunks` — ORC streams of 0 length carry no
139+
headers, mirrors the reader's empty-stream short-circuit
140+
- `snappy_roundtrip_via_reader_decoder` / `zlib_roundtrip_via_…` /
141+
`zstd_roundtrip_via_…` — feed the writer's output back through the
142+
matching reader codec
143+
- `zstd_high_level_round_trip` — exercises ZSTD level 19
144+
- `original_fallback_when_compression_would_expand` — spec-mandated
145+
fallback verified across all three codecs
146+
- `block_size_chunks_input_into_multiple_frames` — 5000-byte input
147+
with 1024-byte block size produces exactly 5 chunks of [1024,
148+
1024, 1024, 1024, 904] input bytes
149+
- `compress_stream_panics_on_compression_none_in_debug` — defence
150+
in depth against future refactor mistakes
151+
152+
**Integration tests** in `tests/writer_compression.rs` (16):
153+
- Round-trip for SNAPPY, ZLIB (default + level 9), ZSTD (default +
154+
level 19) on a mixed Int32 + Utf8 batch
155+
- PostScript inspection (parse the file tail with `prost`): verify
156+
CompressionKind matches for SNAPPY / ZLIB / ZSTD, and that
157+
`compression_block_size` is populated to the user's value or the
158+
documented 256 KiB default
159+
- Backward compat: `Compression::None` produces a byte-stream
160+
bit-identical to a builder built without any compression call
161+
- Incompressible payload (xorshift bytes) survives round-trip via
162+
the spec's "fall back to original chunk" code path
163+
- Tiny block size (4 KiB) over a multi-megabyte stream forces
164+
multiple compression chunks per stream and round-trips
165+
- API hardening: oversize block sizes are clamped under the 23-bit
166+
spec ceiling; zero falls back to the 256 KiB default
167+
- Multi-stripe writes with compression round-trip cleanly
168+
169+
**`cargo test --all-features`** passes 425 tests total (151 unit +
170+
16 new integration + 13 doc + the rest unchanged) — zero failures.
171+
172+
## Benchmarks
173+
174+
`cargo bench --bench writer_compression` on a 10 000-row Int64 +
175+
Utf8 batch (Apple silicon, debug rustc 1.95):
176+
177+
| codec | output bytes | ratio | write time |
178+
|-----------|-------------:|-------:|-----------:|
179+
| none | 246 698 | 1.30x | 110 µs |
180+
| snappy | 59 346 | 5.39x | 293 µs |
181+
| zlib_1 | 34 318 | 9.32x | 375 µs |
182+
| zlib_6 | 32 461 | 9.86x | 3.4 ms |
183+
| zlib_9 | 32 461 | 9.86x | 11.8 ms |
184+
| zstd_1 | 12 538 | 25.52x | 264 µs |
185+
| zstd_3 | 8 834 | 36.22x | 314 µs |
186+
| zstd_9 | 12 981 | 24.65x | 1.2 ms |
187+
| zstd_19 | 4 823 | 66.35x | 81.0 ms |
188+
189+
ZSTD level 3 dominates the speed/ratio Pareto frontier on this
190+
workload, matching the upstream Java ORC default of
191+
`orc.compress.zstd.level = 3`.
192+
193+
## Cross-implementation interop
194+
195+
The compressed chunk format is byte-for-byte identical to the
196+
existing reader's expectations — verified by the round-trip-via-
197+
reader-decoder unit tests, which feed the writer's output directly
198+
to the reader's `flate2::read::DeflateDecoder` /
199+
`zstd::stream::decode_all` / `snap::raw::Decoder::decompress_vec`.
200+
201+
### Cross-implementation validation (Apache ORC 1.9.5 `orc-tools`)
202+
203+
Cross-validated against the Java reference implementation:
204+
205+
- **Rust writer → Java reader (all 3 codecs).** `orc-tools meta`
206+
parses the PostScript of files this PR produces, and reports the
207+
correct `Compression:` field (SNAPPY / ZLIB / ZSTD) with the
208+
matching row count. `orc-tools data` decompresses and decodes
209+
every stripe without error and prints the exact rows we wrote.
210+
- **Java writer → Rust reader.** Files written by `orc-tools
211+
convert` (default ZLIB on orc-tools 1.9) are read by this PR's
212+
reader with byte-identical column values. Proves our reader is
213+
happy with the Java writer's chunk framing too.
214+
215+
`orc-tools meta` excerpt on a 3-row SNAPPY file emitted by the
216+
Rust writer:
217+
218+
```
219+
File Version: 0.12 with FUTURE by Unknown(-1)
220+
Rows: 3
221+
Compression: SNAPPY
222+
Compression size: 262144
223+
Calendar: Julian/Gregorian
224+
Type: struct<id:int,name:string>
225+
```
226+
227+
Identical output shape for ZLIB and ZSTD (only the `Compression:`
228+
line differs). `orc-tools data` emits:
229+
230+
```
231+
{"id":1,"name":"alpha"}
232+
{"id":2,"name":"bravo"}
233+
{"id":3,"name":"charlie"}
234+
```
235+
236+
— exactly the rows we fed to `ArrowWriter::write`.
237+
238+
Evidence: see `tests/java_interop.rs` in this PR; the tests are
239+
`#[ignore]`d by default and gated on the `ORC_TOOLS_JAR`
240+
environment variable pointing at
241+
`orc-tools-<version>-uber.jar` (tested against 1.9.5 from Maven
242+
Central). Run with:
243+
244+
```bash
245+
export ORC_TOOLS_JAR=/path/to/orc-tools-1.9.5-uber.jar
246+
cargo test --test java_interop -- --ignored
247+
```
248+
249+
Result on the submitter's machine (JDK 17 + orc-tools 1.9.5):
250+
251+
```
252+
running 4 tests
253+
test snappy_file_validates_with_java_orc_tools ... ok
254+
test zlib_file_validates_with_java_orc_tools ... ok
255+
test zstd_file_validates_with_java_orc_tools ... ok
256+
test java_zlib_file_reads_with_rust ... ok
257+
258+
test result: ok. 4 passed; 0 failed
259+
```
260+
261+
## Checklist
262+
263+
- [x] `cargo test --all-features` passes (425 tests)
264+
- [x] `cargo clippy --all-features -- -D warnings` passes for the new
265+
code (3 pre-existing warnings on `main` unrelated to this PR
266+
`row_index.rs::useless_conversion`,
267+
`delta.rs::explicit_counter_loop` ×2 — also reproduce on
268+
`cargo clippy` with no changes)
269+
- [x] `cargo fmt -- --check` passes
270+
- [x] `cargo doc --no-deps --all-features` builds; the 3 pre-existing
271+
doc warnings on main are not introduced by this PR
272+
- [x] Benchmarks added (`benches/writer_compression.rs`)
273+
- [x] Backward compat verified (`Compression::None` is byte-identical
274+
to pre-PR output)
275+
- [x] Apache 2.0 license header on every new file
276+
- [x] Conventional commit messages, signed off
277+
278+
## Commits
279+
280+
```
281+
feat(writer): add per-chunk compression module (SNAPPY/ZLIB/ZSTD)
282+
feat(writer): wire compression through ArrowWriter and StripeWriter
283+
test(writer): end-to-end compression round-trip + PostScript inspection
284+
bench(writer): codec comparison benchmark on a 10k-row mixed batch
285+
test(writer): add Java orc-tools cross-validation integration tests
286+
```

0 commit comments

Comments
 (0)