Skip to content
This repository was archived by the owner on Feb 17, 2026. It is now read-only.

Patches and info related to addressing endianness on s390x architecture

Notifications You must be signed in to change notification settings

k8ika0s/arrow-parquet-s390x-patches

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apache Arrow & Parquet on IBM s390x (Big-Endian) Architectures

This note captures the background, design choices, and concrete code changes that were made to ensure Arrow’s Parquet implementation behaves correctly on IBM s390x (aka IBM Z) CPUs, which are big-endian by default. It is intentionally verbose so future contributors can quickly understand why these changes exist and what assumptions they rely on.

1. Why Endianness Matters

  • Little-endian by design: Both Arrow’s in-memory layout and the Parquet file format specify that primitive values are serialized in little-endian byte order. This assumption holds implicitly on mainstream architectures such as x86_64 and arm64, so the codebase historically relied on raw memcpy and direct pointer casts for IO.
  • s390x is big-endian: IBM’s s390x architecture stores multi-byte words most-significant byte first. Simply copying host-order bytes into a Parquet page (or reading them back) results in flipped values. For example, the 32-bit integer 0x01020304 would be written as 01 02 03 04 on s390x but Parquet readers expect 04 03 02 01.
  • Mixed deployments: Arrow / Parquet data frequently crosses machine boundaries. Pages written on s390x must remain interoperable with the overwhelmingly little-endian ecosystem. Likewise, s390x readers must accept metadata and page buffers produced elsewhere.

Useful background material:

2. High-Level Strategy

  1. Build-time opt-in: Rather than performing byte swapping on every platform, we added a CMake option that is automatically enabled on s390x. Other architectures can enable it manually for testing.
  2. Centralized helpers: A new header, parquet/endian_internal.h, provides reusable utilities (LoadLittleEndianScalar, PrepareLittleEndianBuffer, ConvertLittleEndianInPlace, etc.) that encapsulate the conversion logic. This keeps encoder/decoder code readable and ensures all primitives (including Int96) follow the same rules.
  3. Touch every IO path: We audited Parquet’s primitive encoders and decoders (plain, dictionary, byte-stream split, boolean RLE, byte-stream split, delta binary packed) and wrapped each memory copy with the helpers so data is always stored as little-endian on disk and converted to native endianness in memory.

3. File-by-File Changes

3.1 cpp/cmake_modules/DefineOptions.cmake

  • Added ARROW_ENSURE_S390X_ENDIANNESS (default ON when CMAKE_SYSTEM_PROCESSOR matches s390x, OFF otherwise). This flag controls whether the Parquet build injects the new compile definition.
  • Rationale: Allows big-endian safety to be enabled automatically for IBM Z builds while remaining opt-in (zero-overhead) for the common little-endian targets.

3.2 cpp/src/parquet/CMakeLists.txt

  • When the above option is enabled we now pass -DPARQUET_INTERNAL_ENSURE_LITTLE_ENDIAN_IO to every Parquet target via target_compile_definitions.
  • Rationale: Keeps the implementation contained inside Parquet sources without modifying the rest of Arrow. The macro acts as the gate for the helpers in the new header.

3.3 cpp/src/parquet/endian_internal.h (new file)

  • Introduces parquet::internal helpers (backed by arrow/util/endian.h for reliable host-endian detection):
    • NeedsEndianConversion<T> traits that turn on swapping for all integral/floating types (except bool) while explicitly skipping Parquet wrapper types like Int96 and FixedLenByteArray that manage their own layout.
    • ToLittleEndianValue / FromLittleEndianValue for scalar conversions.
    • LoadLittleEndianScalar, DecodeValues, PrepareLittleEndianBuffer, and ConvertLittleEndianInPlace to cover write-path preparation and read-path fixups.
  • Rationale: Provides a single, well-tested place to reason about endianness rather than sprinkling #ifdef __s390x__ logic throughout the code. The helpers also make it trivial to cover newly introduced encodings in the future. The conversions are now always enabled on big-endian hosts (even if the explicit s390x flag isn’t passed) so reading/writing is canonical by default.

3.4 cpp/src/parquet/encoder.cc

Key updates (all gated by the helper trait so little-endian builds remain unchanged):

  1. Plain encoder paths
    • Each PlainEncoder<T> instance carries a scratch std::vector<T> used to hold little-endian copies before appending to the BufferBuilder.
    • The array-based DirectPutImpl now uses PrepareLittleEndianBuffer so dictionary and spaced writes also emit the correct order.
  2. Byte-array helpers
    • Length prefixes are serialized with ToLittleEndianValue before being copied alongside their payloads. This ensures uint32 metadata is interoperable.
  3. Byte Stream Split encoder
    • Uses the same scratch-buffer approach so the column-wise interleaving logic operates on explicitly little-endian sequences.
  4. Dictionary encoder
    • After memo_table_.CopyValues, the scalar dictionary buffer is rewritten in little-endian order prior to spilling to disk.
    • Binary dictionaries (variable-length) now store lengths as little-endian integers.

Rationale: Parquet pages are written exactly once per flush, so doing the swap nearest to the final append avoids touching the rest of the pipeline and keeps hot loops free of branching on little-endian hosts.

3.5 cpp/src/parquet/decoder.cc

Mirrors the encoder fixes for the read path:

  1. Plain decoder
    • Adds a temporary vector used when NeedsEndianConversion<T> is true. Data is read from the page buffer with DecodeValues / LoadLittleEndianScalar before being appended to Arrow builders so the consumer always sees native-endian values.
    • DecodePlain<T> leverages the helper to rewrite the destination buffer directly.
  2. ByteArray decoding
    • Length checks now read the prefix via LoadLittleEndianScalar<int32_t> so the existing range validation still applies on big-endian machines.
  3. Dictionary decoder
    • Scalar dictionaries decode indices as before, but values fetched from the dictionary buffer are converted via the helper before being pushed into Arrow builders.
  4. ByteStreamSplit decoder
    • After interleaving bytes, ConvertLittleEndianInPlace runs over the decoded buffer (both for Arrow builders and for direct Decode calls), making the results uniform with other encodings. This is critical now that the byte stream encoder/decoder explicitly operate on canonical little-endian payloads.
  5. Boolean and byte-array paths
    • Boolean RLE length prefixes already used FromLittleEndian; the new helper keeps the behavior consistent.

Rationale: Readers must accept data written on any architecture. Performing the conversion immediately after extracting bytes from a Parquet page keeps Arrow’s downstream algorithms agnostic to machine endianness.

3.6 cpp/src/arrow/util/bit_stream_utils_internal.h

  • GetBatch() SIMD guard: The ByteStreamSplit/Delta machinery relies on Arrow’s generic bit unpacker. All vectorized unpack32 / unpack64 implementations assume their input words are little-endian. We now wrap those calls in #if ARROW_LITTLE_ENDIAN and fall back to the portable scalar path on big-endian hosts. This prevents the unpackers from producing byte-swapped deltas (the root cause of the final INT64 delta failure).
  • GetVlqInt() cache normalization: When enough bytes are buffered we parse VLQs straight from the cached 64-bit word. That cache is stored in host byte order to keep bit twiddling fast, so on s390x the VLQ parser previously read reversed bytes. We now convert the cached word with bit_util::ToLittleEndian before exposing it to the parser, ensuring header fields—such as delta block sizes—are decoded identically on every architecture.

3.7 ByteStreamSplit canonicalization (cpp/src/parquet/encoder.cc, decoder.cc)

  • Encoder: Every INT32/INT64/FLOAT/DOUBLE value goes through parquet::internal::PrepareLittleEndianBuffer before ::arrow::util::internal::ByteStreamSplitEncode is called. This produces the canonical little-endian byte streams mandated by the Parquet spec (apache/parquet-format#192), fixing the CheckOnlyEncode failures that compared host-endian output against golden data.
    • Decoder: Immediately after ByteStreamSplitDecode, we invoke parquet::internal::ConvertLittleEndianInPlace so the reconstructed little-endian payloads are reinterpreted as native values on s390x. Little-endian builds continue to run the fast path with zero overhead.
  • Tests: TestByteStreamSplitEncoding::CheckEncode now feeds native-order input values (rather than pre-swapped ToLittleEndian() buffers) so the encode-only tests exercise the same conversion logic used in production. The encoder under test is responsible for writing little-endian bytes per the spec. Likewise, CheckDecode expects native-order scalars so the decoder must convert the canonical byte stream back into platform representation. These adjustments avoid double-swapping on big-endian hosts now that the implementation itself is endian-safe.

3.8 cpp/src/parquet/bloom_filter.cc

  • BlockSplitBloomFilter bitsets: Parquet’s block Bloom filters store their 32-bit words in little-endian order on disk (matching parquet-mr). The C++ implementation was previously reinterpreting those words directly, which works on x86 but results in byte-swapped lookups and serialized output on s390x. We now load each 32-bit word with arrow::bit_util::FromLittleEndian before testing/setting bits, and write the updated value back with ToLittleEndian. This makes FindHash succeed against parquet-mr bloom filters (e.g., bloom_filter.xxhash.bin) and ensures bloom filters we write byte-for-byte match the Java reference implementation.
  • Round-trip note: BasicRoundTrip/RoundTripSingleElement/RoundTripSpace already passed on s390x because encode+decode both used host order. The explicit little-endian conversions keep those tests green while also matching the canonical test vectors for encode-only and decode-only scenarios.

3.9 cpp/src/parquet/level_conversion_inc.h

  • Validity bitmaps: GreaterThanBitmap produces canonical little-endian words. Before writing those bits into the Arrow validity buffer we convert them back to host order so FirstTimeBitmapWriter (which already performs the little-endian normalization) doesn’t undo the swap on big-endian machines. Without this fix, the validity bitmap and the reported null_count could disagree, causing SpacedExpandRightward assertions during optional column decoding.

3.10 cpp/src/parquet/column_writer.cc

  • Decimal serialization: Arrow stores Decimal32/64/128/256 values in little-endian two’s complement form regardless of host architecture. The Parquet writer reinterpreted those buffers as uint64_t words and fed them directly into ToBigEndian, which only works on little-endian CPUs. On s390x the words were read with reversed significance, so CDC-written files reconstructed different decimal values. We now guard the conversion: little-endian builds keep the FromLittleEndianToBigEndian pipeline, while big-endian builds copy the native bytes directly because Arrow’s in-memory layout already matches Parquet’s big-endian requirement. This avoids double-swapping on s390x and keeps decimal pages identical across all architectures.

3.11 cpp/src/parquet/column_reader.cc

  • Level decoder length prefix: The RLE/bit-packed level stream begins with a 4-byte little-endian length. LevelDecoder::SetData previously read this prefix in host byte order, so s390x builds interpreted values like 02 00 00 00 as 0x02000000, blowing past data_size and throwing “Received invalid number of bytes (corrupt data page?)” when opening parquet-testing decimal files. The length is now loaded as a uint32_t and passed through bit_util::FromLittleEndian, restoring canonical decoding on big-endian hosts without affecting little-endian behavior.

3.12 cpp/src/parquet/column_writer.cc (level RLE headers)

  • Little-endian RLE prefixes: Data page V1 stores definition/repetition levels by prefixing the RLE body with a 4-byte little-endian length. Our writer simply wrote len() via reinterpret_cast<int32_t*>, which emitted host-endian bytes. After fixing the reader we saw CDC regressions because self-generated pages still carried big-endian prefixes on s390x. RleEncodeLevels now uses bit_util::ToLittleEndian plus SafeStore to serialize the prefix so writer and reader follow the same spec-mandated layout.

3.13 Tests & benchmarks (column_writer_test.cc, column_reader_benchmark.cc,

column_io_benchmark.cc)

  • Spec-compliant helpers: The unit tests and benchmarks that hand-roll encoded level buffers also wrote raw host-endian prefixes. These helpers now store the prefix with bit_util::ToLittleEndian (and memcpy) so they continue to validate and benchmark the production code paths on any architecture.

3.14 Statistics helpers (statistics_test.cc, types.cc)

  • Canonical statistic bytes: The gtests compare EncodeValue(value) against stats->EncodeMin()/EncodeMax(). The helper used to memcpy host-order bytes, so every integer/float assertion failed on big-endian systems even though the statistics encoder already emitted little-endian payloads. EncodeValue now calls bit_util::ToLittleEndian, guaranteeing the test harness produces the same on-wire representation the writer emits.
  • Sort-order fixtures respect the LE spec: TestStatisticsSortOrder.MinMax used reinterpret_cast to copy native ints/floats into stats_, which silently relied on host byte order. On s390x the reference min/max blobs therefore disagreed with EncodeMin/EncodeMax() even though the production code was correct. The test now routes the fixtures through a small EncodeAsLittleEndianBytes helper (backed by bit_util::ToLittleEndian) so the expectations always describe canonical Parquet
  • Runtime stats reuse EncodeValue logic: The encode/decode helpers now reuse the exact same bit packing as EncodeValue in statistics_test, so INT32/64 and FLOAT/DOUBLE min/max payloads are byte-for-byte identical to the canonical expectations on both LE and BE hosts.
  • Integer/float stats encode canonical bytes: The runtime stats were still emitting host-order min/max buffers by piping values through the plain encoder. On s390x that left the Int32/Int64Extrema and FloatStatistics.* suites comparing LE fixture bytes against BE stats payloads. The stat specialisations now bypass the generic encoder and copy the raw INT32/INT64/FLOAT/DOUBLE bits into a string after running them through bit_util::ToLittleEndian. Likewise, PlainDecode uses FromLittleEndian so negative zeros and NaNs preserve their IEEE754 ordering regardless of host endianness.
  • Decoding for printing: FormatNumericValue, FormatDecimalValue, and the INT96 branch of FormatStatValue previously interpreted the raw byte blobs in host endianness. On s390x that swapped min/max strings and even reordered INT96 components (“2048 1024 4096”). These helpers now load each 32/64-bit word in little-endian order via bit_util::FromLittleEndian, so statistics printed in parquet-internals-test (and the generic type printer) match the canonical representation on every architecture.

3.15 Statistics smoke tests (types_test.cc)

  • Host-order literals vs. little-endian encodings: TypePrinter.StatisticsTypes creates raw strings that mimic encoded statistics and feeds them into FormatStatValue. The test had been writing those strings by reinterpreting native ints / floats / decimal integers / INT96 structs in place, which only matched the Parquet encoding on little-endian hosts. On s390x the same bytes describe different numeric values, so the assertions failed even though FormatStatValue behaved correctly. The test now uses ::arrow::bit_util::ToLittleEndian to serialize the INT32/INT64 (including their decimal variants), FLOAT/DOUBLE bit patterns, and each 32-bit lane of the INT96 fixtures before passing them to FormatStatValue, making the test data match the spec and preventing future regressions when inspecting stats produced on big-endian machines. This change documents the intent: we are validating the formatting code against canonical little-endian stat payloads, not the host’s in-memory layout.

  • Page index round-trip expectations: The page-index tests use helper lambdas such as encode_int64 / encode_double to build the expected byte strings for column-index min/max values. Those helpers previously reinterpreted the host representation (again, little-endian by accident on x86). They now serialize through ::arrow::bit_util::ToLittleEndian, mirroring how the writer stores min/max inside the page index, so the round-trip assertions compare canonical Parquet payloads rather than architecture-specific layouts.

3.16 Dataset statistics expressions (file_parquet_test.cc)

  • Fragment stats fixtures encode canonical bytes: TestParquetStatistics.NoNullCount fabricates EncodedStatistics blobs by copying host-order int32_t values straight into the metadata shim. Once we taught the runtime to expect little-endian statistics, those fixtures started decoding as 16777216 / 1677721600 on s390x, even though the dataset predicate builder had nothing to do with the regression. The helper now serializes all INT32 stats through ::arrow::bit_util::ToLittleEndian before passing them to parquet::Statistics::Make, ensuring the test mimics real Parquet metadata on both LE and BE hosts. As a result, ParquetFileFragment::EvaluateStatisticsAsExpression sees the same [1, 100] range everywhere and the dataset-level filter rewrites stay valid.

3.17 INT96 unit tests (statistics_test.cc, types_test.cc)

  • Comparator fixtures use canonical words: The Comparison.SignedInt96 and Comparison.UnsignedInt96 suites built Int96 values with brace-initializers such as {{1, 41, 14}}, which only describe the intended [lo32, hi32, day] words if the host is little-endian. On s390x those aggregates left the comparator seeing byte-swapped values and the unsigned ordering tests failed. The runtime helpers (Int96SetNanoSeconds, Int96SetJulianDay, DecodeInt96Timestamp) now always write/read little-endian words, the comparator decodes each 32-bit lane via bit_util::FromLittleEndian, and a tiny MakeCanonicalInt96 helper runs each 32-bit lane through bit_util::ToLittleEndian, and the timestamp checks set the day via Int96SetJulianDay, so both signed and unsigned comparator tests target the canonical layout.
  • DecodeInt96Timestamp inputs mirror production layout: TestInt96Timestamp.Decoding previously stuffed host-order words directly into Int96::value[0..2]. After the runtime switched to canonical little-endian storage this fed Int96GetNanoSeconds garbage on s390x (e.g., reporting 5976838244624236544 instead of the expected -9223286400000000000). The test now uses Int96SetJulianDay together with a little-endian memcpy of the 64-bit nanoseconds-of-day field, preserving the original logical expectations (including the 0xffffffffffffffff wraparound case) on every host.

3.18 Reader test level builders (test_util.h)

  • Level RLE headers are canonical: The mock DataPageBuilder that feeds the record reader tests wrote the 4-byte RLE length prefix for definition/repetition levels by memcpy’ing a host-order int32_t. On s390x that produced 0x0D000000-style lengths, so LevelDecoder::SetData rejected the page as “corrupt”. The builder now runs the length through bit_util::ToLittleEndian before writing it, matching the spec and keeping the parquet-reader tests architecture-independent.

  • Impala Int96 conversion: NanosecondsToImpalaTimestamp copied the 64-bit Int96 helper functions now always write and read canonical little-endian words (including a dedicated Int96SetJulianDay), and the column writer calls them so the 12-byte layout matches Impala on every architecture while DecodeInt96Timestamp keeps using FromLittleEndian. The column writer now uses bit_util::ToLittleEndian for both the Julian day word and the nanoseconds-of-day words, keeping on-disk INT96 payloads canonical across hosts while the reader continues to decode via FromLittleEndian. The column writer now stores the Julian day/nanos words via bit_util::ToLittleEndian so on-disk INT96 payloads match the little-endian layout expected by other implementations, while the reader keeps using FromLittleEndian to decode them symmetrically. time-of-day chunk directly into the first two Int96::value lanes. On big-endian hosts this swapped the order of those 32-bit words relative to the Impala layout, breaking TestImpalaConversion.ArrowTimestampToImpalaTimestamp. The helper now explicitly splits the raw nanosecond bit pattern into low/high 32-bit words and writes them into value[0]/value[1], preserving the correct [lo32, hi32, julian_day] layout on every architecture. We also teach the new endian helpers that Int96 requires conversion by byte-swapping each 32-bit lane when crossing the Parquet IO boundary, so deprecated INT96 reads/writes (and their Arrow compatibility tests) once again observe canonical timestamps instead of host-endian garbage. To keep the hand-written tests and the Arrow reader in sync, Int96SetNanoSeconds now mirrors the writer layout and DecodeInt96Timestamp reconstructs the nanosecond payload by combining the explicit low/high words instead of assuming a raw 64-bit memcpy, which restores the historical semantics checked by the Arrow UseDeprecatedInt96, downsampling, and nested round-trip suites.

  • Half-float reads: The bridge previously read Parquet HALF_FLOAT payloads as fixed-size binary chunks and then called ChunkedArray::View to interpret them as Arrow HalfFloat arrays, but it never converted the 2-byte little-endian words into the host representation. On s390x every half value showed up byte-swapped (e.g., 0x3C000x003C). The transfer path now swaps the bytes for each chunk when !ARROW_LITTLE_ENDIAN before invoking View, so Arrow’s float16::bits() matches the on-disk encoding.

  • Half-float writes: Likewise, writing HalfFloat columns used to forward the host-order 16-bit payloads directly into Parquet’s FLBA encoder. On x86 the bytes were already little-endian, but on s390x the writer emitted reversed payloads that the (now fixed) reader interpreted as garbage. The HalfFloat serializer now copies each value into a scratch buffer after running it through bit_util::ToLittleEndian, so the on-disk bytes are canonical regardless of host architecture.

3.16 Geospatial WKB helpers (test_util.cc, geospatial/util_internal_test.cc)

  • Canonical WKB fixtures: test::MakeWKBPoint emitted the geometry type and coordinate doubles using kWkbNativeEndianness, so s390x runs produced big-endian payloads even though the Parquet geospatial reader only accepts ISO WKB encoded as little-endian (the format mandated by the spec for parquet.geo). This made every TestGeometryUtil/WKBTestFixture.TestWKBBounderBounds/* case fail with swapped geometry type IDs such as 0x01000000. The helper now always writes the byte-order flag as 0x01 and runs each integer/double through bit_util::ToLittleEndian before copying into the buffer, guaranteeing that the tests feed spec-compliant WKB on every architecture. The inverse utility (GetWKBPointCoordinateXY) mirrors this by validating the little-endian flag and decoding the geometry type/coordinates with bit_util::FromLittleEndian, so the round-trip helpers agree on a single canonical encoding.
  • Bounder respects the byte-order flag: WKBGeometryBounder previously used #if defined(ARROW_LITTLE_ENDIAN) when deciding whether to swap the geometry type and coordinate fields. On big-endian builds the macro is defined (to 0), so the “little endian” branch executed unconditionally and little-endian WKB payloads were never byte-swapped. Every geometry type therefore decoded to values like 16777216, and the geospatial statistics stayed empty (is_valid() == false). The bounder now checks the value of ARROW_LITTLE_ENDIAN instead of just its presence, so we only skip byteswap on true little-endian hosts; s390x correctly swaps the geometry header and the stats tests pass.

4. Practical Guidance

  • Building on s390x: Simply configure as usual. The CMake logic detects CMAKE_SYSTEM_PROCESSOR=s390x, flips ARROW_ENSURE_S390X_ENDIANNESS to ON, and compiles Parquet with the necessary flag.
  • Testing locally (emulation / CI): You can force the conversions on any platform by passing -DARROW_ENSURE_S390X_ENDIANNESS=ON to cmake ... This is useful for CI jobs that want to exercise the code paths without native hardware.
  • Performance considerations: On little-endian CPUs the new helpers compile away to the previous behavior (no copies, no extra branches). On s390x the added cost is unavoidable but only affects the IO boundaries, not core compute kernels.
  • Future work: Any new Parquet encoding/decoding logic should:
    1. Include parquet/endian_internal.h.
    2. Use PrepareLittleEndianBuffer before writing primitive buffers (only when NeedsEndianConversion<T>::value is true, via if constexpr).
    3. Use DecodeValues / LoadLittleEndianScalar immediately after reading page data, or the ByteStreamSplit helpers plus ConvertLittleEndianInPlace when dealing with byte-stream encoding.
    4. Follow existing patterns for wrapper types (Int96, FixedLenByteArray, etc.) and avoid embedding architecture-specific #ifdefs elsewhere in the code.

5. Additional Resources

About

Patches and info related to addressing endianness on s390x architecture

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published