Skip to content

refactor: Trivial cleanup in FieldReader, FieldWriter, StreamUtil (#16485)#1727

Open
prestodb-ci wants to merge 62 commits intooss-baselinefrom
staging-4e7d8194f-pr
Open

refactor: Trivial cleanup in FieldReader, FieldWriter, StreamUtil (#16485)#1727
prestodb-ci wants to merge 62 commits intooss-baselinefrom
staging-4e7d8194f-pr

Conversation

@prestodb-ci
Copy link
Collaborator

Test PR for branch staging-4e7d8194f-pr with head 4e7d819

xiaoxmeng and others added 30 commits February 13, 2026 10:54
Summary:
Pull Request resolved: facebookincubator#16360

X-link: facebookincubator/nimble#480

This diff implements stripe-level batched index read support for Nimble files in Velox. The key motivation is to improve index lookup performance by processing multiple lookup requests in batches at the stripe level, rather than processing each request individually.

**New `SelectiveNimbleIndexReader` class**: A format-specific index reader that handles:
   - Encoding index bounds into Nimble-specific encoded keys
   - Looking up stripes and row ranges using the tablet index
   - Managing stripe iteration and data reading with batched processing
   - Returning results in request order via an iterator pattern (`startLookup`/`hasNext`/`next`)

**Batched stripe processing**: Instead of loading stripes per-request, the reader:
   - Maps all lookup requests to their matching stripes upfront
   - Merges overlapping row ranges within stripes for efficient reading (without filters)
   - Splits overlapping ranges into non-overlapping segments to preserve filter semantics (with filters)
   - Tracks output references with ref-counting to share read data across requests

**HiveIndexReader refactoring**: Simplified to focus on index bounds creation and result assembly, delegating format-specific control logic to `SelectiveNimbleIndexReader`.

**KeyEncoder enhancements**: Added support for encoding index bounds with constant values for efficient multi-row range queries, and extended test coverage for edge cases.

**New runtime stats**: Added metrics for tracking index lookup performance:
   - `kNumIndexLookupRequests`: Total lookup requests processed
   - `kNumIndexLookupStripes`: Number of stripes accessed
   - `kNumIndexLookupReadSegments`: Number of read segments processed

Reviewed By: HuamengJiang, tanjialiang

Differential Revision: D92848948

fbshipit-source-id: d6f5c78ef23dca1d4d8faa2d27acf85f77612805
)

Summary:
Fixes commit message correctness for multi-file writes by updating Iceberg writer rotation handling and tracking per-file row counts and emitting one commit task per rotated file.
Fix parquet writer does not honor smaller values of `max-target-file-size` config.

These are the adjustments required after facebookincubator#16077.

Pull Request resolved: facebookincubator#16365

Reviewed By: Yuhta

Differential Revision: D93157234

Pulled By: jainxrohit

fbshipit-source-id: f09cd1a01cb2ee63172f233d9fc8234688037fb6
Summary:
## Description

Supersedes facebookincubator#16223, facebookincubator#16036, and facebookincubator#16160

This PR enables Velox to use updated (chunked) hybrid scan APIs as well as do smarter IO when possible.

## Checklist
- [x] All pre-requisite PRs merged
- [x] Build succeeds
- [x] Velox-cuDF tests pass
- [x] I am familiar with the contribution guide

Pull Request resolved: facebookincubator#16226

Reviewed By: kevinwilfong

Differential Revision: D92444137

Pulled By: peterenescu

fbshipit-source-id: 113d6dc120f8f09db04cbb0d5f881ce1d09064c1
…ebookincubator#16385)

Summary:
VectorHasher has two tryMapToRange template specializations:
`tryMapToRange<bool>` and `tryMapToRange<StringView>`.  The
`tryMapToRange<StringView>` template specialization was declared
twice. This patch removes the redundant declaration.

No functional changes.

Pull Request resolved: facebookincubator#16385

Reviewed By: kKPulla

Differential Revision: D93285257

Pulled By: jainxrohit

fbshipit-source-id: ce314691513cae6bfef8c86736c547e24d376e04
…ebookincubator#16293)

Summary:
Root cause: On the scalar (non-AVX2) path, StringDictionaryColumnVisitor::process() was
 calling filterPassed(index) in the "no dictionary filter" case. That passed raw dictionary
 indices (int32/64) into ValueHook, which for TestingHook<StringView> in E2EFilterTestBase
 triggers an INVALID_STATE because it expects string values, not integers. This path is only
 exercised on non-AVX2 architectures (e.g. ARM), so x86 with AVX2 was masked by the SIMD fast path.

 Fix: In StringDictionaryColumnVisitor::process(), when !DictSuper::hasFilter() and kHasHook is true,
 call values_.addValue(rowIndex + numValuesBias_, valueInDictionary(index)) so the hook receives
 decoded StringView values. For the no-hook case we keep the original behavior (filterPassed(index)),
 and the dictionary-filtering branch is left unchanged.

 Rationale: This aligns the scalar path with the existing SIMD bulk path (processRun),
 which already decodes indices and feeds StringView values to ValueHook. As a result,
 both DWRF and Parquet string/varbinary dictionary readers now deliver decoded values to hooks on
 ARM, fixing the failures in velox_dwrf_e2e_filter_test (stringDictionary) and velox_parquet_e2e_filter_test
 (stringDictionary, dedictionarize, varbinaryDictionary) without changing behavior for non-hooked scans or
 x86 fast-path behavior.

fix facebookincubator#12954

Pull Request resolved: facebookincubator#16293

Reviewed By: Yuhta

Differential Revision: D92783998

Pulled By: jainxrohit

fbshipit-source-id: 589b5a9b02835a66df803c99b88e89084a9ed2c2
…ching source vector (facebookincubator#16243)

Summary:
Pull Request resolved: facebookincubator#16243

Adds function `BaseVector::createLike` to BaseVector's vector creation suite to allow empty vector creation matching a source vector. This will help in creating template vectors preserving flat map encoding and will adds capability for other encoding preserve template vectors.

Reviewed By: pedroerp

Differential Revision: D92333447

fbshipit-source-id: 1a444b48d175df788722cff454262ec31f6c90c7
…incubator#16390)

Summary:
Pull Request resolved: facebookincubator#16390

Allow nested flat maps to copy and preserve encodings. Prior to this change, copy of `MAP<type1, MAP<type2, type3>>` would always generate an inner map of type `MAP`, which may break copying into a `FLAT_MAP`.

Reviewed By: pedroerp

Differential Revision: D93186699

fbshipit-source-id: 20e6080787f37d8f751ef83ac9fcc33f87faf196
…facebookincubator#16328)

Summary:
Add a new QueryConfig option `spark.collect_list.ignore_nulls` (default: `true`) to control null handling in the Spark `collect_list` aggregate function.

When set to `false` (RESPECT NULLS), null values are included in the output array instead of being skipped. This aligns with Spark 4.2+ RESPECT NULLS / IGNORE NULLS syntax ([SPARK-55256](https://issues.apache.org/jira/browse/SPARK-55256)).

## Changes

- **`velox/core/QueryConfig.h`**: Add `kSparkCollectListIgnoreNulls` config constant and `sparkCollectListIgnoreNulls()` accessor
- **`velox/exec/SimpleAggregateAdapter.h`**: Add public `fn()` accessor to allow factory lambdas to configure FUNC members after construction (follows the pattern used by Presto's array_agg)
- **`velox/functions/sparksql/aggregates/CollectListAggregate.cpp`**:
  - Add `ignoreNulls_` member to `CollectListAggregate`
  - Pass config to `AccumulatorType` via the fn pointer
  - Remove static `toIntermediate()` (cannot access runtime config; see note below)
  - Factory lambda reads config and sets `ignoreNulls_` via `fn()` accessor
- **`velox/functions/sparksql/aggregates/tests/CollectListAggregateTest.cpp`**: Add 3 new tests:
  - `respectNulls`: global aggregation with nulls included
  - `respectNullsGroupBy`: group-by aggregation with nulls included
  - `respectNullsAllNulls`: all-null input produces array of nulls (not empty array)
- **`velox/docs/functions/spark/aggregate.rst`**: Updated documentation

## Note on `toIntermediate` removal

The original `toIntermediate()` was a static method that always skipped nulls. Since it's called as `FUNC::toIntermediate()` via SFINAE detection in `SimpleAggregateAdapter`, it cannot access instance-level `ignoreNulls_` config. Removing it ensures correctness for RESPECT NULLS at the cost of using the standard accumulator path for partial aggregation.

Pull Request resolved: facebookincubator#16328

Test Plan:
All 7 tests pass (4 existing + 3 new):
```
[  PASSED  ] 7 tests.
  CollectListAggregateTest.groupBy
  CollectListAggregateTest.global
  CollectListAggregateTest.ignoreNulls
  CollectListAggregateTest.allNullsInput
  CollectListAggregateTest.respectNulls
  CollectListAggregateTest.respectNullsGroupBy
  CollectListAggregateTest.respectNullsAllNulls
```

Reviewed By: kevinwilfong

Differential Revision: D93285699

Pulled By: jainxrohit

fbshipit-source-id: b1aaba25494aa8c31d37b43dc3c3501793172051
Summary:
Implement key_sampling_percent as part of facebookincubator#16041

Pull Request resolved: facebookincubator#16113

Reviewed By: kgpai

Differential Revision: D93118512

Pulled By: jainxrohit

fbshipit-source-id: b3b2731b559ea1e48412ba06c172e969592e1de8
Summary:
Pull Request resolved: facebookincubator#16398

X-link: facebookincubator/nimble#490

This diff adds a `maxRowsPerIndexRequest` configuration option that limits the number of rows returned per index lookup request. This is useful for controlling memory usage and query performance when index lookups could potentially return very large result sets.

The implementation works differently based on whether filters are applied:
- **Without filters**: Stripes can be pruned upfront based on pre-filter row counts since input rows == output rows. Middle stripes are considered for truncation (first/last stripes may be partial with unknown row counts until `lookupRowRanges()`).
- **With filters**: Truncation is applied during output tracking in `trackStripeSegmentOutputRefs()` since we don't know output rows ahead of time.

- Add `maxRowsPerIndexRequest` config in `HiveConfig` (both session and config-level)
- Extend `IndexReader::Options` to pass `maxRowsPerRequest` to the reader
- Implement stripe pruning and row range truncation in `SelectiveNimbleIndexReader`
- Add `removeRequestFromSubsequentStripes()` helper for cleaning up requests that have reached their limit
- Add comprehensive unit tests for both `HiveIndexReader` and `HiveIndexSource` APIs

Reviewed By: HuamengJiang

Differential Revision: D93313029

fbshipit-source-id: fff9d58bb5bfaed0b4f413c873387608c0064405
Differential Revision:
D93313029

Original commit changeset: fff9d58bb5bf

Original Phabricator Diff: D93313029

fbshipit-source-id: 5846b9ddaf6ca5a6ac1ce0b7463f8b537225820e
Summary:
Pull Request resolved: facebookincubator#16382

X-link: facebookincubator/nimble#481

Fix a class of issues on validation service: https://fburl.com/scuba/dwrf_reader_checksum/ku7kyj88

When the full range is null, the batch reader path in checksum stack swallows top level nulls and returns const null children instead. e.g. for schema ROW<c0:int, c1:float>, the correct behavior would be to return null for the whole row, instead of {null, null}.

Turns out the issue came from fb_velox NimbleReader calling projectColumns, which lost the top level nulls.

Reviewed By: xiaoxmeng

Differential Revision: D93098105

fbshipit-source-id: ac04e29c3df2a57a1854756be2f128390f886d67
…bator#16400)

Summary:
Pull Request resolved: facebookincubator#16400

X-link: facebookincubator/nimble#492

Original commit changeset: 5846b9ddaf6c

Original Phabricator Diff: D93313029

Reviewed By: HuamengJiang

Differential Revision: D93348953

fbshipit-source-id: a70b687917429542b3564993834cb1e61518eb7f
…arness/container/scheduler.cpp +1

Summary:
This diff removes a variable that was set, but which was not used.

LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused but set variables often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance.

Removing this variable will not change how your code works, but the unused variable may indicate your code isn't working the way you thought it was. If you feel the diff needs changes before landing, **please commandeer** and make appropriate changes: there are hundreds of these and responding to them individually is challenging.

For questions/comments, contact r-barnes.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Differential Revision: D93430794

fbshipit-source-id: 7ccdd92c6b688251b29e7303c9e0d83c87c6384a
…g use case (facebookincubator#16251)

Summary:
This PR introduces multi-threaded execution support for Task Barrier processing.
Previously, barrier execution in Velox was strictly limited to serial execution
mode (enforced in `Task::ensureBarrierSupport`).
This limitation prevents leveraging parallelism when tasks require barrier synchronization.
This PR lifts the restriction by broadcasting barrier splits to all drivers associated with a
leaf node and introducing driver-aware split tracking.

Pull Request resolved: facebookincubator#16251

Reviewed By: srsuryadev

Differential Revision: D92622777

Pulled By: xiaoxmeng

fbshipit-source-id: bd124e0700ea40661341af675939c6379064834a
… mode (facebookincubator#16401)

Summary:
Pull Request resolved: facebookincubator#16401

This diff fixes data races detected by ThreadSanitizer (TSAN) in the barrier processing code under multi-threaded execution mode.

**Race condition #1**: Between `Driver::startBarrier()` and `Driver::hasBarrier()`
- Write: `startBarrier()` setting `barrier_` state
- Read: `hasBarrier()` (via `isDraining()`) checking barrier state
- These accesses happen concurrently from different driver threads.

**Race condition #2**: Between `Driver::dropInput()` and `Driver::shouldDropOutput()`
- Write: `dropInput()` modifying `barrier_.dropInputOpId` (called from a different driver's thread via `Task::dropInputLocked()`)
- Read: `shouldDropOutput()` reading `barrier_.dropInputOpId` (called from this driver's own thread)

**Fix approach:**
1. Added atomic flag `hasBarrier_` to track whether barrier processing is active, with `memory_order_acquire` on reads and `memory_order_release` on writes.
2. Changed `dropInputOpId` from `std::optional<int32_t>` to `std::atomic_int32_t` with sentinel value `kNoDropInput = -1` for thread-safe cross-driver access.
3. Added `BarrierState::reset()` method to cleanly reset barrier state.
4. Note that `barrier_` state is only meaningful when `hasBarrier_` is true.
5. Added `waitForAllTasksToBeDeleted()` in `barrierAfterNoMoreSplits` and `MergeJoinTest.barrier` tests to ensure all driver threads complete before test iterations end.

The acquire-release memory ordering ensures proper synchronization: any thread that reads `hasBarrier_` as `true` is guaranteed to see the fully initialized `barrier_` state.

Reviewed By: kunigami, srsuryadev

Differential Revision: D93355327

fbshipit-source-id: 5d7d3c636bef62f58daaa036089f41ea01572d3d
…ubator#15727)

Summary:
This pull request implements `toBitMask()` for Arm NEON and optimizes how xsimd boolean operators and `toBitMask()` are combined. For details, see the commit messages.

Pull Request resolved: facebookincubator#15727

Reviewed By: Yuhta, peterenescu

Differential Revision: D93287759

Pulled By: jainxrohit

fbshipit-source-id: bbfc8a900aa73eb5e8e52aa9ab7d5ca8b4c8ac61
Summary:
closes facebookincubator#15933 with implementation as seen from presto docs: https://prestodb.io/docs/current/functions/binary.html#murmur3_x64_128-binary-varbinary

Pull Request resolved: facebookincubator#15951

Reviewed By: HeidiHan0000

Differential Revision: D93286559

Pulled By: jainxrohit

fbshipit-source-id: 82268a837ebe79221d9ce25fc37b8f12de096df6
…or#16409)

Summary:
Pull Request resolved: facebookincubator#16409

Currently we are using 10000 as the threshold for the fast path, and in
case of nested data structure (ARRAY and MAP), we often see the number of values
in a batch can be up to 1000000.  Improve this by increase the threshold
accordingly.  This is a per-process singleton and will not increase the overall
memory usage in any significant way.

Reviewed By: xiaoxmeng

Differential Revision: D93424384

fbshipit-source-id: 830b59074feff5e63f1947a8921a1558c3190f18
Summary:
ST_Equals for empty geometries should return true even if they are different geometry types.
closes: prestodb/presto#26253

Pull Request resolved: facebookincubator#15702

Reviewed By: pedroerp

Differential Revision: D93044051

Pulled By: jainxrohit

fbshipit-source-id: 22f26429c11fedaf9c2d41aece93951e9adf2da9
…ator#16388)

Summary:
Pull Request resolved: facebookincubator#16388

Replace hardcoded operator type string literals scattered across operator constructors and comparison sites with centralized `static constexpr std::string_view` constants in a new `OperatorType.h` header.

Test-only operators (e.g., "Pauser", "Throw") are intentionally excluded and can be addressed in a follow-up.

Reviewed By: kKPulla

Differential Revision: D93187082

fbshipit-source-id: 0421c8bb6cd75cd66ee803c9a6ca64f05b4931f3
Summary:
Pull Request resolved: facebookincubator#16412

Will be used in Axiom: facebookincubator/axiom#903

Reviewed By: srsuryadev

Differential Revision: D93465947

fbshipit-source-id: c2b9b7aae443e52f5170660da393e53c80342266
Summary:
Pull Request resolved: facebookincubator#16397

When running task debugger cursor with breakpoints installed, adding
an optional parameter to moveStep() to enable user to move the cursor to the
next breakpoint matching a give planId. If planId is not specified, move to the
next breakpoint.

Adding python bindings and test as appropriate.

Reviewed By: kKPulla

Differential Revision: D93314252

fbshipit-source-id: 7ddab2673b78e550dd1c4d4a25a3bcdf15383008
…cro-benchmarking (facebookincubator#16367)

Summary:
Pull Request resolved: facebookincubator#16367

Timer utility (rtdsc based) implementation to support micro-benchmarking use-cases.

Tracking the breakdown of time spent on each context with better precision. Moved this as a separate timer implementation to the directory velox/common/time/.

Reviewed By: HuamengJiang

Differential Revision: D92240619

fbshipit-source-id: 1b32676509670cfd2041bb9e38e2b1353fae2526
…6426)

Summary:
Pull Request resolved: facebookincubator#16426

X-link: facebookincubator/nimble#494

This diff introduces checked cast APIs (`asChecked`, `checkedContext`) across Nimble's type system to provide safer dynamic casting with clear error messages when casts fail. It also includes several code quality improvements and test file renames for consistency.

**Key changes:**

1. **Checked cast APIs:**
   - `StatisticsCollector::asChecked<T>()` - Casts with `NIMBLE_DCHECK_NOT_NULL` assertion
   - `StreamDescriptorBuilder::checkedContext<T>()` - Safe context retrieval
   - `TypeBuilder::checkedContext<T>()` - Safe context retrieval

2. **API rename:** `isShared()` → `shared()` for consistency with naming conventions

3. **Code cleanup:**
   - Replace `as<>()` with `asChecked<>()` where cast is expected to succeed
   - Use `asChecked<velox::FlatMapVector>()` instead of manual null check
   - Add `const` qualifiers where appropriate
   - Replace `VELOX_CHECK` with `NIMBLE_CHECK` for consistency
   - Flatten nested if-else in `initializeEncodingLayouts` with early return for FlatMap
   - Rename macro `_SET_STREAM_CONTEXT` to `SET_STREAM_CONTEXT` (remove leading underscore)

4. **Test file renames:** Standardize test file naming from `*Tests.cpp` to `*Test.cpp`

Reviewed By: Yuhta

Differential Revision: D93427282

fbshipit-source-id: 67b7c3014597281c7092af6d1d4d7ccac2ebec29
…acebookincubator#16434)

Summary:
Pull Request resolved: facebookincubator#16434

Refactor the Driver's barrier processing state management by consolidating the standalone `hasBarrier_` atomic flag into the `BarrierState` struct as `active`. This provides better encapsulation and cleaner state management.

Key changes:
- Move `hasBarrier_` into `BarrierState` and rename to `active`
- Add `start()` method to activate the barrier with precondition check
- Add `reset()` method to deactivate and clear all barrier state
- Remove redundant `barrier_.reset()` call from `startBarrier()` since the precondition check guarantees clean state
- Add VELOX_CHECK assertions in `start()`/`reset()` to enforce state transition invariants

Reviewed By: srsuryadev

Differential Revision: D93574459

fbshipit-source-id: 1e71fc04981a34a60b7faff72c7eb73133b11da9
…al SplitsStore subclass (facebookincubator#16442)

Summary: Pull Request resolved: facebookincubator#16442

Reviewed By: xiaoxmeng

Differential Revision: D93608665

fbshipit-source-id: cf15ba5db9258f4a4e39cc7183848afc6f26ea05
Summary:
Pull Request resolved: facebookincubator#16445

The test can be failed as
```
Exceeded memory allocator limit when allocating 1151 new pages for total allocation of 1151 pages, the memory allocator capacity is 16384 pages, the allocated pages is 17512, Source: RUNTIME, ErrorCode: MEM_ALLOC_ERROR
```
cap the Alloc to 2mb

Reviewed By: xiaoxmeng

Differential Revision: D93629537

fbshipit-source-id: a89f1899255076302ffc28d2174919a7f39fddf4
Summary:
Pull Request resolved: facebookincubator#16443

The test may fail when multiple HashBuild and HashProbe operators are in non-reclaimable state

```
W0218 07:23:24.665766 341086 HashBuild.cpp:1310] Can't reclaim from hash build operator, state_[RUNNING], nonReclaimableSection_[1], op.4.1.0.HashBuild, usage: 22.50MB
W0218 07:23:24.665939 341086 HashProbe.cpp:1894] Can't reclaim from hash probe operator, state_[WAIT_FOR_BUILD], nonReclaimableSection_[0], inputSpiller_[nullptr], table_[nullptr], table_ numDistinct[nullptr], op.4.0.3.HashProbe, usage: 0B, node pool reservation: 128.00MB
```

Eventually, the process fails with the following error:

```
terminate called after throwing an instance of 'facebook::velox::VeloxRuntimeError'
what(): Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: MEM_CAP_EXCEEDED
```

Reviewed By: xiaoxmeng

Differential Revision: D93619111

fbshipit-source-id: d333f53248cf4f3a7857f6bb920a275cea3e5009
…e of the encoded Vector in clustered mode (facebookincubator#16430)

Summary:
Pull Request resolved: facebookincubator#16430

Currently, when ArbitraryAggregate is used in clustered mode (used in streaming
aggregations) for each group it stores the Vector the first non-null value came from and its
position. As part of doing this it decodes the Vector to see if the value is null.

When we produce the output from ArbitraryAggregate we either copy the values for each
group, wrap the Vector in a dictionary if all results in a batch come from the same Vector, or
return a slice of the Vector directly if the results in a batch appear sequentially in the same
Vector. This means we either need to decode the Vector again to do the copy, or will likely
decode it at some point downstream when we further process the results.

Given we've already done the work of decoding it, we can store the base Vector in the
ArbitraryAggregate's state, along with the decoded index, so that we don't need to decode it
again. This saves us a little bit of additional work.

The down side is that if the results appear sequentially in the encoded Vector we may now
dictionary encode instead of taking a slice, but on the other hand if the results appear
sequentially in the base Vector we will now take a slice instead of dictionary encoding.

Reviewed By: xiaoxmeng

Differential Revision: D93531319

fbshipit-source-id: 0169adc586367a1dd3434f7b36c40ca7610c5ca2
jiahaol-work and others added 14 commits February 19, 2026 04:41
…ubator#16432)

Summary:
Pull Request resolved: facebookincubator#16432

make state_ atomic to be used in SST writer parallel encode

Reviewed By: xiaoxmeng

Differential Revision: D93561815

fbshipit-source-id: 2593c4588a9eee839f8006f74ca88b1171462b29
…or#16410)

Summary:
Pull Request resolved: facebookincubator#16410

Add `indexEnabled` config property to HiveConfig that controls cluster index-based row pruning. The property is wired into RowReaderOptions via configureRowReaderOptions().

Reviewed By: xiaoxmeng

Differential Revision: D93435795

fbshipit-source-id: b1b752f724f6b2dbc2694963319a3908268fa75d
…facebookincubator#16408)

Summary:
Pull Request resolved: facebookincubator#16408

In the current code when we do the dense loop on `rows` in `rowLoop`,
we stops earlier than necessary.  This results in forwarding most of the work to
the slow `sparseN` calls in case `begin` is not 0, and degrades the performance.
Fix this by considering `begin` when doing dense loop.

Reviewed By: xiaoxmeng

Differential Revision: D93424377

fbshipit-source-id: ef5c1cb5d8266176407af4c62a8096f10dc3bdff
…kincubator#16447)

Summary:
Pull Request resolved: facebookincubator#16447

ExprToSubfieldFilterParser::makeOrFilter currently only supports merging BigintRange and floating-point range filters, but not BytesValues (VARCHAR equality filters).

This causes OR expressions on VARCHAR columns like `n_name = 'FRANCE' OR n_name = 'GERMANY'` to remain as remaining filters instead of being pushed down as BytesValues('FRANCE', 'GERMANY').

This change adds a tryMergeBytesValues helper (similar to tryMergeBigintRanges) that:
1. Checks if all disjuncts are BytesValues filters
2. Collects all values from the disjuncts
3. Returns a single BytesValues filter containing the union of all values

This diff is to fix facebookincubator#16440

Reviewed By: Yuhta

Differential Revision: D93635127

fbshipit-source-id: 9dddb26cd4ed5a45a459d2018f9eb5421619edd4
…acebookincubator#16420)

Summary:
Pull Request resolved: facebookincubator#16420

Addresses problems like this:

```
void* buf = /*...*/;
uint16_t word = /*...*/;
storeUnaligned(buf, word & 0xffff); // oops, 32-bit store due to implicit integer promotion
```

Reviewed By: Yuhta

Differential Revision: D93482910

fbshipit-source-id: 3e393a68874231ca75c2d35827a0b10aef654f51
…ework (facebookincubator#16256)

Summary:
Pull Request resolved: facebookincubator#16256

Follow-up to D87932003. Remove the backward compatibility initialize() overload that was added for Koski batch functions.

Now that Koski batch functions pass MemoryPool to initialize() (D92427523), we can remove the overload without MemoryPool parameter that was kept for backward compatibility.

Reviewed By: Yuhta

Differential Revision: D92427521

fbshipit-source-id: 2ab806b85f8cd4f3104639b1907861e3c0b30c34
…tor (facebookincubator#16459)

Summary:
Pull Request resolved: facebookincubator#16459

The isDate / isDecimal type checks need dynamic cast checking.

Doing this for every element of every array in castToJson operator shows severe performance
penalty, to the point where 50% of the operator cost is just dynamic_cast ops.

 {F1985476080}

Arrays are homogenous, we can do this check once for the whole array and re-use it.
This diff implements this change.

After this change, for the same operator run, we see that the dynamic_cast cost is eliminated
 {F1985476152}

This should make the FilterProject operator ~2x cpu efficient for these cases

Reviewed By: xiaoxmeng

Differential Revision: D92987919

fbshipit-source-id: 0e1862507f5a7f00eb80e92205aefa985a14548e
Summary:
```C++
Undefined symbols for architecture arm64:
  "facebook::velox::serializer::IndexBounds::set(facebook::velox::serializer::IndexBound, facebook::velox::serializer::IndexBound)", referenced from:

Reviewed By: duxiao1212

Differential Revision: D93986758

Pulled By: xiaoxmeng

fbshipit-source-id: 4dfe47f48c33b1ff7db64bd7eb7986ff4bb10fc7
Summary:
Pull Request resolved: facebookincubator#16471

X-link: facebookincubator/nimble#502

This change adds a new static method `BaseVector::reusable()` that recursively checks if a vector and all its nested children are reusable (have `use_count == 1`).

**Problems:**
1. The existing `verifyVectorState()` check in nimble FieldReader only verified the top-level vector's `use_count`, but complex types like `ARRAY<ROW<BIGINT, REAL>>` can have nested children that are shared across different parent vectors. When `prepareForReuse()` is called on a vector whose nested children are still referenced elsewhere, it corrupts those shared children by resetting them to size 0.

2. In `BatchMutationManager::getChunkedMutationStatus()`, the same input vector check only compared pointers, missing the case where memory is reused but the vector contains different data (e.g., due to vector replacement from `recursivelyReusable` checks).

3. The NIMBLE row numbers reader was caching the `rowNumbers` child vector, which prevented vector reuse in the nimble field reader underneath.

**Solution:**
- Add `BaseVector::reusable()` which recursively checks `use_count` for ROW children, ARRAY elements, and MAP keys/values
- Update `verifyVectorState()` in nimble FieldReader to use the new recursive check
- Fix `BatchMutationManager` to compare both pointer AND first row number value to detect vector replacement
- Fix NIMBLE row numbers reader to not cache the rowNumbers child vector
- Add unit tests for the new method

**Additional cleanups:**
- Refactor test fixtures to follow Google Test naming conventions (`RowNumberTest`, `MergeSplitReaderTest`, `BatchMutationManagerTest`)
- Use constructor initializer lists for memory pool setup in tests
- Remove deprecated `deprecatedAddDefaultLeafMemoryPool` usage
- Add parameterized test name generators for better test output

Reviewed By: Yuhta

Differential Revision: D93790972

fbshipit-source-id: 23f353ae988adb89545ff54d6f5ad546485fed73
…tor#16480)

Summary:
Pull Request resolved: facebookincubator#16480

Add `resolveWindowResultType` and `resolveWindowResultTypeWithCoercions`
to the window function registry, mirroring the existing aggregate
function APIs (`resolveResultType`, `resolveResultTypeWithCoercions`).

Both APIs try window function signatures first, then fall back to
aggregate function signatures since all aggregate functions can be used
as window functions.

Reviewed By: amitkdutta

Differential Revision: D93990936

fbshipit-source-id: a2687c49249769650682c7eefc40a8f30a8d1461
…ebookincubator#16411)

Summary:
Pull Request resolved: facebookincubator#16411

Replace hardcoded runtime stat string literals in HiveDataSource::getRuntimeStats() with centralized static inline const std::string constants on the HiveDataSource class. Also updated CudfHiveDataSource to use HiveDataSource::kTotalScanTime and Connector::kTotalRemainingFilterTime instead of hardcoded strings.

Constants added: kNumPrefetch, kPrefetchBytes, kTotalScanTime, kOverreadBytes, kStorageReadBytes, kNumLocalRead, kLocalReadBytes, kNumRamRead, kRamReadBytes, kNumBucketConversion.

Reviewed By: xiaoxmeng

Differential Revision: D93369914

fbshipit-source-id: 9bcdd72a7313e5f4ba31c16a7457b7455e092e38
…ebookincubator#16450)

Summary:
Pull Request resolved: facebookincubator#16450

Replace hardcoded operator type string literals ("HashBuild", "HashProbe", "TableScan", "Window", "Values", "Aggregation") with OperatorType::k* constants from OperatorType.h in test and benchmark files. Also replace hardcoded runtime stat key strings ("maxSpillLevel", "abandonBuildNoDupHash") with their corresponding class constants where the files were already being modified.

Reviewed By: xiaoxmeng

Differential Revision: D93369889

fbshipit-source-id: 16e21e1f3d77e8c8d02796dc5a84f3982c72ad08
…er (facebookincubator#16481)

Summary:
Pull Request resolved: facebookincubator#16481

Move shared types (OrderByClause, AggregateExpr, WindowType, BoundType,
WindowFrame, WindowExpr) from DuckParser.h to SqlExpressionsParser.h as
single source of truth.

Add parseAggregateExpr, parseWindowExpr, and parseScalarOrWindowExpr
pure virtual methods to SqlExpressionsParser. Implement in
DuckSqlExpressionsParser.

Rename AggregateExpr::maskExpr to filter for clarity.

bypass-github-export-checks

Reviewed By: xiaoxmeng

Differential Revision: D93997347

fbshipit-source-id: f1c99a9a3ffd69b9a4d7fcbb4dab73ffb09afac2
…cebookincubator#16485)

Summary:
Pull Request resolved: facebookincubator#16485

X-link: facebookincubator/nimble#505

Minor style improvements:
- Use const auto where applicable
- Brace initialization
- Explicit null checks (if (ptr != nullptr) instead of if (ptr))
- Use asChecked<> instead of as<> with separate null check
- NIMBLE_DCHECK_EQ instead of NIMBLE_DCHECK for equality checks

Reviewed By: duxiao1212

Differential Revision: D94037387

fbshipit-source-id: 25ad778a89ff4fca731e5aa17c599d742b043ca2
rui-mo and others added 11 commits February 23, 2026 14:00
Alchemy-item: (ID = 1043) [OAP] Support struct schema evolution matching by name commit 1/1 - 5c132f1
Alchemy-item: (ID = 883) [OAP] [13620] Allow reading integers into smaller-range types  commit 1/1 - 4cae2f5
…ter join

Signed-off-by: Yuan <yuanzhou@apache.org>

Alchemy-item: (ID = 1095) [OAP] [11771] Fix smj result mismatch issue commit 1/1 - 791678d
Alchemy-item: (ID = 1094) Iceberg staging hub commit 1/4 - e186144
The function toValues removes duplicated values from the vector and
return them in a std::vector. It was used to build an InPredicate. It
will be needed for building NOT IN filters for Iceberg equality delete
read as well, therefore moving it from velox/functions/prestosql/InPred
icate.cpp to velox/type/Filter.h. This commit also renames it to
deDuplicateValues to make it easier to understand.

feat(connector): Support reading Iceberg split with equality deletes

This commit introduces EqualityDeleteFileReader, which is used to read
Iceberg splits with equality delete files. The equality delete files
are read to construct domain filters or filter functions, which then
would be evaluated in the base file readers.

When there is only one equality delete field, and when that field is
an Iceberg identifier field, i.e. non-floating point primitive types,
the values would be converted to a list as a NOT IN domain filter,
with the NULL treated separately. This domain filter would then be
pushed to the ColumnReaders to filter our unwanted rows before they
are read into Velox vectors. When the equality delete column is a
nested column, e.g. a sub-column in a struct, or the key in a map,
such column may not be in the base file ScanSpec. We need to add/remove
these subfields to/from the SchemaWithId and ScanSpec recursively if
they were not in the ScanSpec already. A test is also added for such
case.

If there are more than one equality delete field, or the field is not
an Iceberg identifier field, the values would be converted to a typed
expression in the conjunct of disconjunts form. This expression would
be evaluated as the remaining filter function after the rows are read
into the Velox vectors. Note that this only works for Presto now as
the "neq" function is not registered by Spark. See https://github.com/
facebookincubator/issues/12667

Note that this commit only supports integral types. VARCHAR and
VARBINARY need to be supported in future commits (see
facebookincubator#12664).

Co-authored-by: Naveen Kumar Mahadevuni <Naveen.Mahadevuni@ibm.com>

Alchemy-item: (ID = 1094) Iceberg staging hub commit 2/4 - a2e40be
Add iceberg partition transforms.

Co-authored-by: Chengcheng Jin <Chengcheng.Jin@ibm.com>

Add NaN statistics to parquet writer.

Collect Iceberg data file statistics in dwio.

Integrate Iceberg data file statistics and adding unit test.

Support write field_id to parquet metadata SchemaElement.

Implement iceberg sort order

Add clustered Iceberg writer mode.

Fix parquet writer ut

Add IcebergConnector

Fix unittest error

Alchemy-item: (ID = 1094) Iceberg staging hub commit 3/4 - f552f4c
Alchemy-item: (ID = 1094) Iceberg staging hub commit 4/4 - addac8b
Signed-off-by: Yuan <yuanzhou@apache.org>

Alchemy-item: (ID = 906) fix: Adding daily tests commit 1/2 - e2eb2c6
we can cache ccache on every build even on failure, since ibm/velox is
always incremental build

Alchemy-item: (ID = 906) fix: Adding daily tests commit 2/2 - 0899ddc
1
Alchemy-item: (ID = 988) Add fileNameGenerator to the constructor of IcebergInsertTableHandle commit 1/1 - a5f7e46
Alchemy-item: (ID = 1046) Remove website folder commit 1/1 - 1787292
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.