This note captures the background, design choices, and concrete code changes that were made to ensure Arrow’s Parquet implementation behaves correctly on IBM s390x (aka IBM Z) CPUs, which are big-endian by default. It is intentionally verbose so future contributors can quickly understand why these changes exist and what assumptions they rely on.
- Little-endian by design: Both Arrow’s in-memory layout and the Parquet file format
specify that primitive values are serialized in little-endian byte order. This assumption
holds implicitly on mainstream architectures such as
x86_64andarm64, so the codebase historically relied on rawmemcpyand direct pointer casts for IO. - s390x is big-endian: IBM’s s390x architecture stores multi-byte words most-significant
byte first. Simply copying host-order bytes into a Parquet page (or reading them back)
results in flipped values. For example, the 32-bit integer
0x01020304would be written as01 02 03 04on s390x but Parquet readers expect04 03 02 01. - Mixed deployments: Arrow / Parquet data frequently crosses machine boundaries. Pages written on s390x must remain interoperable with the overwhelmingly little-endian ecosystem. Likewise, s390x readers must accept metadata and page buffers produced elsewhere.
Useful background material:
- Arrow Columnar Specification – Endianness
- Parquet Format Encoding (all primitive encodings assume little-endian order)
- IBM developer docs on s390x architecture and its big-endian memory model
- Build-time opt-in: Rather than performing byte swapping on every platform, we added a CMake option that is automatically enabled on s390x. Other architectures can enable it manually for testing.
- Centralized helpers: A new header,
parquet/endian_internal.h, provides reusable utilities (LoadLittleEndianScalar,PrepareLittleEndianBuffer,ConvertLittleEndianInPlace, etc.) that encapsulate the conversion logic. This keeps encoder/decoder code readable and ensures all primitives (includingInt96) follow the same rules. - Touch every IO path: We audited Parquet’s primitive encoders and decoders (plain, dictionary, byte-stream split, boolean RLE, byte-stream split, delta binary packed) and wrapped each memory copy with the helpers so data is always stored as little-endian on disk and converted to native endianness in memory.
- Added
ARROW_ENSURE_S390X_ENDIANNESS(defaultONwhenCMAKE_SYSTEM_PROCESSORmatchess390x,OFFotherwise). This flag controls whether the Parquet build injects the new compile definition. - Rationale: Allows big-endian safety to be enabled automatically for IBM Z builds while remaining opt-in (zero-overhead) for the common little-endian targets.
- When the above option is enabled we now pass
-DPARQUET_INTERNAL_ENSURE_LITTLE_ENDIAN_IOto every Parquet target viatarget_compile_definitions. - Rationale: Keeps the implementation contained inside Parquet sources without modifying the rest of Arrow. The macro acts as the gate for the helpers in the new header.
- Introduces
parquet::internalhelpers (backed byarrow/util/endian.hfor reliable host-endian detection):NeedsEndianConversion<T>traits that turn on swapping for all integral/floating types (exceptbool) while explicitly skipping Parquet wrapper types likeInt96andFixedLenByteArraythat manage their own layout.ToLittleEndianValue/FromLittleEndianValuefor scalar conversions.LoadLittleEndianScalar,DecodeValues,PrepareLittleEndianBuffer, andConvertLittleEndianInPlaceto cover write-path preparation and read-path fixups.
- Rationale: Provides a single, well-tested place to reason about endianness rather than
sprinkling
#ifdef __s390x__logic throughout the code. The helpers also make it trivial to cover newly introduced encodings in the future. The conversions are now always enabled on big-endian hosts (even if the explicit s390x flag isn’t passed) so reading/writing is canonical by default.
Key updates (all gated by the helper trait so little-endian builds remain unchanged):
- Plain encoder paths
- Each
PlainEncoder<T>instance carries a scratchstd::vector<T>used to hold little-endian copies before appending to theBufferBuilder. - The array-based
DirectPutImplnow usesPrepareLittleEndianBufferso dictionary and spaced writes also emit the correct order.
- Each
- Byte-array helpers
- Length prefixes are serialized with
ToLittleEndianValuebefore being copied alongside their payloads. This ensuresuint32metadata is interoperable.
- Length prefixes are serialized with
- Byte Stream Split encoder
- Uses the same scratch-buffer approach so the column-wise interleaving logic operates on explicitly little-endian sequences.
- Dictionary encoder
- After
memo_table_.CopyValues, the scalar dictionary buffer is rewritten in little-endian order prior to spilling to disk. - Binary dictionaries (variable-length) now store lengths as little-endian integers.
- After
Rationale: Parquet pages are written exactly once per flush, so doing the swap nearest to the final append avoids touching the rest of the pipeline and keeps hot loops free of branching on little-endian hosts.
Mirrors the encoder fixes for the read path:
- Plain decoder
- Adds a temporary vector used when
NeedsEndianConversion<T>is true. Data is read from the page buffer withDecodeValues/LoadLittleEndianScalarbefore being appended to Arrow builders so the consumer always sees native-endian values. DecodePlain<T>leverages the helper to rewrite the destination buffer directly.
- Adds a temporary vector used when
- ByteArray decoding
- Length checks now read the prefix via
LoadLittleEndianScalar<int32_t>so the existing range validation still applies on big-endian machines.
- Length checks now read the prefix via
- Dictionary decoder
- Scalar dictionaries decode indices as before, but values fetched from the dictionary buffer are converted via the helper before being pushed into Arrow builders.
- ByteStreamSplit decoder
- After interleaving bytes,
ConvertLittleEndianInPlaceruns over the decoded buffer (both for Arrow builders and for directDecodecalls), making the results uniform with other encodings. This is critical now that the byte stream encoder/decoder explicitly operate on canonical little-endian payloads.
- After interleaving bytes,
- Boolean and byte-array paths
- Boolean RLE length prefixes already used
FromLittleEndian; the new helper keeps the behavior consistent.
- Boolean RLE length prefixes already used
Rationale: Readers must accept data written on any architecture. Performing the conversion immediately after extracting bytes from a Parquet page keeps Arrow’s downstream algorithms agnostic to machine endianness.
GetBatch()SIMD guard: The ByteStreamSplit/Delta machinery relies on Arrow’s generic bit unpacker. All vectorizedunpack32/unpack64implementations assume their input words are little-endian. We now wrap those calls in#if ARROW_LITTLE_ENDIANand fall back to the portable scalar path on big-endian hosts. This prevents the unpackers from producing byte-swapped deltas (the root cause of the final INT64 delta failure).GetVlqInt()cache normalization: When enough bytes are buffered we parse VLQs straight from the cached 64-bit word. That cache is stored in host byte order to keep bit twiddling fast, so on s390x the VLQ parser previously read reversed bytes. We now convert the cached word withbit_util::ToLittleEndianbefore exposing it to the parser, ensuring header fields—such as delta block sizes—are decoded identically on every architecture.
- Encoder: Every INT32/INT64/FLOAT/DOUBLE value goes through
parquet::internal::PrepareLittleEndianBufferbefore::arrow::util::internal::ByteStreamSplitEncodeis called. This produces the canonical little-endian byte streams mandated by the Parquet spec (apache/parquet-format#192), fixing theCheckOnlyEncodefailures that compared host-endian output against golden data.- Decoder: Immediately after
ByteStreamSplitDecode, we invokeparquet::internal::ConvertLittleEndianInPlaceso the reconstructed little-endian payloads are reinterpreted as native values on s390x. Little-endian builds continue to run the fast path with zero overhead.
- Decoder: Immediately after
- Tests:
TestByteStreamSplitEncoding::CheckEncodenow feeds native-order input values (rather than pre-swappedToLittleEndian()buffers) so the encode-only tests exercise the same conversion logic used in production. The encoder under test is responsible for writing little-endian bytes per the spec. Likewise,CheckDecodeexpects native-order scalars so the decoder must convert the canonical byte stream back into platform representation. These adjustments avoid double-swapping on big-endian hosts now that the implementation itself is endian-safe.
- BlockSplitBloomFilter bitsets: Parquet’s block Bloom filters store their 32-bit words
in little-endian order on disk (matching parquet-mr). The C++ implementation was
previously reinterpreting those words directly, which works on x86 but results in
byte-swapped lookups and serialized output on s390x. We now load each 32-bit word with
arrow::bit_util::FromLittleEndianbefore testing/setting bits, and write the updated value back withToLittleEndian. This makesFindHashsucceed against parquet-mr bloom filters (e.g.,bloom_filter.xxhash.bin) and ensures bloom filters we write byte-for-byte match the Java reference implementation. - Round-trip note: BasicRoundTrip/RoundTripSingleElement/RoundTripSpace already passed on s390x because encode+decode both used host order. The explicit little-endian conversions keep those tests green while also matching the canonical test vectors for encode-only and decode-only scenarios.
- Validity bitmaps:
GreaterThanBitmapproduces canonical little-endian words. Before writing those bits into the Arrow validity buffer we convert them back to host order soFirstTimeBitmapWriter(which already performs the little-endian normalization) doesn’t undo the swap on big-endian machines. Without this fix, the validity bitmap and the reportednull_countcould disagree, causingSpacedExpandRightwardassertions during optional column decoding.
- Decimal serialization: Arrow stores Decimal32/64/128/256 values in little-endian
two’s complement form regardless of host architecture. The Parquet writer reinterpreted
those buffers as
uint64_twords and fed them directly intoToBigEndian, which only works on little-endian CPUs. On s390x the words were read with reversed significance, so CDC-written files reconstructed different decimal values. We now guard the conversion: little-endian builds keep theFromLittleEndian→ToBigEndianpipeline, while big-endian builds copy the native bytes directly because Arrow’s in-memory layout already matches Parquet’s big-endian requirement. This avoids double-swapping on s390x and keeps decimal pages identical across all architectures.
- Level decoder length prefix: The RLE/bit-packed level stream begins with a 4-byte
little-endian length.
LevelDecoder::SetDatapreviously read this prefix in host byte order, so s390x builds interpreted values like02 00 00 00as0x02000000, blowing pastdata_sizeand throwing “Received invalid number of bytes (corrupt data page?)” when opening parquet-testing decimal files. The length is now loaded as auint32_tand passed throughbit_util::FromLittleEndian, restoring canonical decoding on big-endian hosts without affecting little-endian behavior.
- Little-endian RLE prefixes: Data page V1 stores definition/repetition levels by
prefixing the RLE body with a 4-byte little-endian length. Our writer simply wrote
len()viareinterpret_cast<int32_t*>, which emitted host-endian bytes. After fixing the reader we saw CDC regressions because self-generated pages still carried big-endian prefixes on s390x.RleEncodeLevelsnow usesbit_util::ToLittleEndianplusSafeStoreto serialize the prefix so writer and reader follow the same spec-mandated layout.
column_io_benchmark.cc)
- Spec-compliant helpers: The unit tests and benchmarks that hand-roll encoded level
buffers also wrote raw host-endian prefixes. These helpers now store the prefix with
bit_util::ToLittleEndian(andmemcpy) so they continue to validate and benchmark the production code paths on any architecture.
- Canonical statistic bytes: The gtests compare
EncodeValue(value)againststats->EncodeMin()/EncodeMax(). The helper used tomemcpyhost-order bytes, so every integer/float assertion failed on big-endian systems even though the statistics encoder already emitted little-endian payloads.EncodeValuenow callsbit_util::ToLittleEndian, guaranteeing the test harness produces the same on-wire representation the writer emits. - Sort-order fixtures respect the LE spec:
TestStatisticsSortOrder.MinMaxusedreinterpret_castto copy native ints/floats intostats_, which silently relied on host byte order. On s390x the reference min/max blobs therefore disagreed withEncodeMin/EncodeMax()even though the production code was correct. The test now routes the fixtures through a smallEncodeAsLittleEndianByteshelper (backed bybit_util::ToLittleEndian) so the expectations always describe canonical Parquet - Runtime stats reuse EncodeValue logic: The encode/decode helpers now reuse the exact
same bit packing as
EncodeValuein statistics_test, so INT32/64 and FLOAT/DOUBLE min/max payloads are byte-for-byte identical to the canonical expectations on both LE and BE hosts. - Integer/float stats encode canonical bytes: The runtime stats were still emitting
host-order min/max buffers by piping values through the plain encoder. On s390x that left
the
Int32/Int64ExtremaandFloatStatistics.*suites comparing LE fixture bytes against BE stats payloads. The stat specialisations now bypass the generic encoder and copy the raw INT32/INT64/FLOAT/DOUBLE bits into a string after running them throughbit_util::ToLittleEndian. Likewise,PlainDecodeusesFromLittleEndianso negative zeros and NaNs preserve their IEEE754 ordering regardless of host endianness. - Decoding for printing:
FormatNumericValue,FormatDecimalValue, and the INT96 branch ofFormatStatValuepreviously interpreted the raw byte blobs in host endianness. On s390x that swapped min/max strings and even reordered INT96 components (“2048 1024 4096”). These helpers now load each 32/64-bit word in little-endian order viabit_util::FromLittleEndian, so statistics printed inparquet-internals-test(and the generic type printer) match the canonical representation on every architecture.
-
Host-order literals vs. little-endian encodings:
TypePrinter.StatisticsTypescreates raw strings that mimic encoded statistics and feeds them intoFormatStatValue. The test had been writing those strings by reinterpreting native ints / floats / decimal integers / INT96 structs in place, which only matched the Parquet encoding on little-endian hosts. On s390x the same bytes describe different numeric values, so the assertions failed even thoughFormatStatValuebehaved correctly. The test now uses::arrow::bit_util::ToLittleEndianto serialize the INT32/INT64 (including their decimal variants), FLOAT/DOUBLE bit patterns, and each 32-bit lane of the INT96 fixtures before passing them toFormatStatValue, making the test data match the spec and preventing future regressions when inspecting stats produced on big-endian machines. This change documents the intent: we are validating the formatting code against canonical little-endian stat payloads, not the host’s in-memory layout. -
Page index round-trip expectations: The page-index tests use helper lambdas such as
encode_int64/encode_doubleto build the expected byte strings for column-index min/max values. Those helpers previously reinterpreted the host representation (again, little-endian by accident on x86). They now serialize through::arrow::bit_util::ToLittleEndian, mirroring how the writer stores min/max inside the page index, so the round-trip assertions compare canonical Parquet payloads rather than architecture-specific layouts.
- Fragment stats fixtures encode canonical bytes:
TestParquetStatistics.NoNullCountfabricatesEncodedStatisticsblobs by copying host-orderint32_tvalues straight into the metadata shim. Once we taught the runtime to expect little-endian statistics, those fixtures started decoding as16777216/1677721600on s390x, even though the dataset predicate builder had nothing to do with the regression. The helper now serializes all INT32 stats through::arrow::bit_util::ToLittleEndianbefore passing them toparquet::Statistics::Make, ensuring the test mimics real Parquet metadata on both LE and BE hosts. As a result,ParquetFileFragment::EvaluateStatisticsAsExpressionsees the same[1, 100]range everywhere and the dataset-level filter rewrites stay valid.
- Comparator fixtures use canonical words: The
Comparison.SignedInt96andComparison.UnsignedInt96suites builtInt96values with brace-initializers such as{{1, 41, 14}}, which only describe the intended[lo32, hi32, day]words if the host is little-endian. On s390x those aggregates left the comparator seeing byte-swapped values and the unsigned ordering tests failed. The runtime helpers (Int96SetNanoSeconds,Int96SetJulianDay,DecodeInt96Timestamp) now always write/read little-endian words, the comparator decodes each 32-bit lane viabit_util::FromLittleEndian, and a tinyMakeCanonicalInt96helper runs each 32-bit lane throughbit_util::ToLittleEndian, and the timestamp checks set the day viaInt96SetJulianDay, so both signed and unsigned comparator tests target the canonical layout. - DecodeInt96Timestamp inputs mirror production layout:
TestInt96Timestamp.Decodingpreviously stuffed host-order words directly intoInt96::value[0..2]. After the runtime switched to canonical little-endian storage this fedInt96GetNanoSecondsgarbage on s390x (e.g., reporting5976838244624236544instead of the expected-9223286400000000000). The test now usesInt96SetJulianDaytogether with a little-endian memcpy of the 64-bit nanoseconds-of-day field, preserving the original logical expectations (including the0xffffffffffffffffwraparound case) on every host.
-
Level RLE headers are canonical: The mock
DataPageBuilderthat feeds the record reader tests wrote the 4-byte RLE length prefix for definition/repetition levels by memcpy’ing a host-orderint32_t. On s390x that produced0x0D000000-style lengths, soLevelDecoder::SetDatarejected the page as “corrupt”. The builder now runs the length throughbit_util::ToLittleEndianbefore writing it, matching the spec and keeping the parquet-reader tests architecture-independent. -
Impala Int96 conversion:
NanosecondsToImpalaTimestampcopied the 64-bit Int96 helper functions now always write and read canonical little-endian words (including a dedicated Int96SetJulianDay), and the column writer calls them so the 12-byte layout matches Impala on every architecture while DecodeInt96Timestamp keeps using FromLittleEndian. The column writer now uses bit_util::ToLittleEndian for both the Julian day word and the nanoseconds-of-day words, keeping on-disk INT96 payloads canonical across hosts while the reader continues to decode via FromLittleEndian. The column writer now stores the Julian day/nanos words via bit_util::ToLittleEndian so on-disk INT96 payloads match the little-endian layout expected by other implementations, while the reader keeps using FromLittleEndian to decode them symmetrically. time-of-day chunk directly into the first twoInt96::valuelanes. On big-endian hosts this swapped the order of those 32-bit words relative to the Impala layout, breakingTestImpalaConversion.ArrowTimestampToImpalaTimestamp. The helper now explicitly splits the raw nanosecond bit pattern into low/high 32-bit words and writes them intovalue[0]/value[1], preserving the correct[lo32, hi32, julian_day]layout on every architecture. We also teach the new endian helpers thatInt96requires conversion by byte-swapping each 32-bit lane when crossing the Parquet IO boundary, so deprecated INT96 reads/writes (and their Arrow compatibility tests) once again observe canonical timestamps instead of host-endian garbage. To keep the hand-written tests and the Arrow reader in sync,Int96SetNanoSecondsnow mirrors the writer layout andDecodeInt96Timestampreconstructs the nanosecond payload by combining the explicit low/high words instead of assuming a raw 64-bitmemcpy, which restores the historical semantics checked by the ArrowUseDeprecatedInt96, downsampling, and nested round-trip suites. -
Half-float reads: The bridge previously read Parquet HALF_FLOAT payloads as fixed-size binary chunks and then called
ChunkedArray::Viewto interpret them as Arrow HalfFloat arrays, but it never converted the 2-byte little-endian words into the host representation. On s390x every half value showed up byte-swapped (e.g.,0x3C00→0x003C). The transfer path now swaps the bytes for each chunk when!ARROW_LITTLE_ENDIANbefore invokingView, so Arrow’sfloat16::bits()matches the on-disk encoding. -
Half-float writes: Likewise, writing HalfFloat columns used to forward the host-order 16-bit payloads directly into Parquet’s FLBA encoder. On x86 the bytes were already little-endian, but on s390x the writer emitted reversed payloads that the (now fixed) reader interpreted as garbage. The HalfFloat serializer now copies each value into a scratch buffer after running it through
bit_util::ToLittleEndian, so the on-disk bytes are canonical regardless of host architecture.
- Canonical WKB fixtures:
test::MakeWKBPointemitted the geometry type and coordinate doubles usingkWkbNativeEndianness, so s390x runs produced big-endian payloads even though the Parquet geospatial reader only accepts ISO WKB encoded as little-endian (the format mandated by the spec for parquet.geo). This made everyTestGeometryUtil/WKBTestFixture.TestWKBBounderBounds/*case fail with swapped geometry type IDs such as0x01000000. The helper now always writes the byte-order flag as0x01and runs each integer/double throughbit_util::ToLittleEndianbefore copying into the buffer, guaranteeing that the tests feed spec-compliant WKB on every architecture. The inverse utility (GetWKBPointCoordinateXY) mirrors this by validating the little-endian flag and decoding the geometry type/coordinates withbit_util::FromLittleEndian, so the round-trip helpers agree on a single canonical encoding. - Bounder respects the byte-order flag:
WKBGeometryBounderpreviously used#if defined(ARROW_LITTLE_ENDIAN)when deciding whether to swap the geometry type and coordinate fields. On big-endian builds the macro is defined (to0), so the “little endian” branch executed unconditionally and little-endian WKB payloads were never byte-swapped. Every geometry type therefore decoded to values like16777216, and the geospatial statistics stayed empty (is_valid() == false). The bounder now checks the value ofARROW_LITTLE_ENDIANinstead of just its presence, so we only skip byteswap on true little-endian hosts; s390x correctly swaps the geometry header and the stats tests pass.
- Building on s390x: Simply configure as usual. The CMake logic detects
CMAKE_SYSTEM_PROCESSOR=s390x, flipsARROW_ENSURE_S390X_ENDIANNESStoON, and compiles Parquet with the necessary flag. - Testing locally (emulation / CI): You can force the conversions on any platform by
passing
-DARROW_ENSURE_S390X_ENDIANNESS=ONtocmake ... This is useful for CI jobs that want to exercise the code paths without native hardware. - Performance considerations: On little-endian CPUs the new helpers compile away to the previous behavior (no copies, no extra branches). On s390x the added cost is unavoidable but only affects the IO boundaries, not core compute kernels.
- Future work: Any new Parquet encoding/decoding logic should:
- Include
parquet/endian_internal.h. - Use
PrepareLittleEndianBufferbefore writing primitive buffers (only whenNeedsEndianConversion<T>::valueistrue, viaif constexpr). - Use
DecodeValues/LoadLittleEndianScalarimmediately after reading page data, or the ByteStreamSplit helpers plusConvertLittleEndianInPlacewhen dealing with byte-stream encoding. - Follow existing patterns for wrapper types (
Int96,FixedLenByteArray, etc.) and avoid embedding architecture-specific#ifdefs elsewhere in the code.
- Include
-
IBM LinuxONE / s390x overview (good background on the memory model): https://www.ibm.com/docs/en/linux-on-systems?topic=systems-linux-one
-
C++ background reading on endianness and
std::byteswap(C++23): https://en.cppreference.com/w/cpp/numeric/byteswap -
Detailed explanation of Parquet’s physical types (useful to understand why
Int96/FixedLenByteArrayare handled specially): https://github.com/apache/parquet-format/blob/master/LogicalTypes.md Feel free to extend this document as we cover more subsystems (e.g., Arrow IPC, CSV reader) with explicit big-endian handling. The more thorough we are in documenting rationale now, the easier future reviews and refactors will be.