Skip to content

fix: Ignore string column statistics for parquet-mr versions before 1.8.2#16744

Open
lifulong wants to merge 1 commit intofacebookincubator:mainfrom
lifulong:fix_parquet_181_undefined_min_max
Open

fix: Ignore string column statistics for parquet-mr versions before 1.8.2#16744
lifulong wants to merge 1 commit intofacebookincubator:mainfrom
lifulong:fix_parquet_181_undefined_min_max

Conversation

@lifulong
Copy link
Contributor

@lifulong lifulong commented Mar 12, 2026

Parquet files written by parquet-mr versions before 1.8.2 contain a bug where string (binary) column statistics min and max are computed using signed byte ordering rather than the standard unsigned lexicographic (memcmp) ordering.
Velox compares row group statistics against filter values using memcmp, which is an unsigned byte comparison. When reading files produced by parquet-mr < 1.8.2, this mismatch causes incorrect row group filtering: row groups that should be scanned are skipped, leading to wrong query results.

fix #16743

@lifulong lifulong requested a review from majetideepak as a code owner March 12, 2026 11:15
@netlify
Copy link

netlify bot commented Mar 12, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 507370c
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69bb6564deef0c0008549b82

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 12, 2026
@lifulong lifulong changed the title Fix: compatible with parquet-1.8.1 min/max not defined Fix: fix error equal filter result with parquet-1.8.1 and column meta min/max not defined Mar 12, 2026
@lifulong lifulong changed the title Fix: fix error equal filter result with parquet-1.8.1 and column meta min/max not defined fix: Fix error equal filter result with parquet-1.8.1 and column meta min/max not defined Mar 12, 2026
@mbasmanova mbasmanova requested a review from PingLiuPing March 12, 2026 11:26
@lifulong lifulong force-pushed the fix_parquet_181_undefined_min_max branch 3 times, most recently from 3865ecc to 1cf0ced Compare March 13, 2026 07:43
Copy link
Collaborator

@PingLiuPing PingLiuPing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check this comment #16743 (comment) ? What's your opinion?
If you agree then we should change the PR description and title to reflect that (i.e the root cause is not min/max missing). If not then the bug still hiding and we need more investigation.

Your parquet shows:
min = "三星应用商店"
max = "vivo预装"

And you are using "360手机助手" as filter.

With signed byte ordering, the above stats are correct. Meaning
三星应用商店 < 360手机助手 < vivo预装
But Velox compare the stats using memcmp which is a standard lexicographic byte comparison. So the order is:
360手机助手 < vivo预装 < 三星应用商店

See

int compare = memcmp(lhs, rhs.data(), size);

EXPECT_EQ(reader->numberOfRows(), 10ULL);
}

TEST_F(ParquetReaderTest, parquetMR181SkipsFilterRowGroupsByStringStats) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a case similar to TEST_F(ParquetReaderTest, varcharFilters) as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't gen simple data file to reproduce ths issue , is this test case ok, has check parquet version 1.8.1 and 1.8.2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this case to E2EFilterTest.cpp ?

// Reproduces the real-world scenario from the bug report. Parquet-mr 1.8.1
// computed binary column min/max using signed byte ordering, which differs from
// the unsigned lexicographic (memcmp) ordering Velox uses.
//
// With signed byte ordering:  三星应用商店 < 360手机助手 < vivo预装
// With memcmp byte ordering:  360手机助手  < vivo预装    < 三星应用商店
//
// A row group containing {"三星应用商店", "vivo预装"} has memcmp-based stats
// min="vivo预装", max="三星应用商店". A filter for "360手机助手" falls below the
// memcmp min, so the row group would be incorrectly skipped — even though it
// should match under the signed ordering that parquet-mr 1.8.1 used to write
// the stats.
TEST_F(E2EFilterTest, parquetMRVersionStringStatsRowGroupFiltering) {
  const std::string kSanXing = "三星应用商店";
  const std::string kVivo = "vivo预装";
  const std::string k360 = "360手机助手";

  auto rowType = ROW({"s"}, {VARCHAR()});

  auto writeAndGetStats =
      [&](const std::string& createdBy,
          RuntimeStatistics& stats) {
    options_.memoryPool = E2EFilterTestBase::rootPool_.get();
    options_.createdBy = createdBy;
    // Flush after every 5 rows to create separate row groups.
    options_.flushPolicyFactory = []() {
      return std::make_unique<LambdaFlushPolicy>(
          /*rowsInRowGroup=*/5,
          /*bytesInRowGroup=*/1'024 * 1'024,
          []() { return false; });
    };

    auto sink = std::make_unique<MemorySink>(
        200 * 1024 * 1024,
        FileSink::Options{.pool = leafPool_.get()});
    auto* sinkPtr = sink.get();
    auto writer = std::make_unique<parquet::Writer>(
        std::move(sink), options_, rowType);
    // Row group 1: contains the value we will filter for ("360手机助手").
    writer->write(makeRowVector(
        {"s"},
        {makeFlatVector<std::string>(
            {k360, kSanXing, kVivo, k360, kSanXing})}));
    // Row group 2: does not contain "360手机助手".
    writer->write(makeRowVector(
        {"s"},
        {makeFlatVector<std::string>(
            {kSanXing, kVivo, kSanXing, kVivo, kSanXing})}));
    writer->close();

    dwio::common::ReaderOptions readerOptions{leafPool_.get()};
    auto input = std::make_unique<BufferedInput>(
        std::make_shared<InMemoryReadFile>(
            std::string(sinkPtr->data(), sinkPtr->size())),
        readerOptions.memoryPool());
    auto reader = makeReader(readerOptions, std::move(input));
    auto& parquetReader = dynamic_cast<ParquetReader&>(*reader);
    EXPECT_EQ(parquetReader.fileMetaData().numRowGroups(), 2);

    auto scanSpec = std::make_shared<ScanSpec>("");
    scanSpec->addAllChildFields(*rowType);
    // Equality filter: s = "360手机助手".
    scanSpec->getOrCreateChild(Subfield("s"))
        ->setFilter(std::make_unique<BytesRange>(
            k360, false, false, k360, false, false, false));

    RowReaderOptions rowReaderOpts;
    rowReaderOpts.select(
        std::make_shared<ColumnSelector>(rowType, rowType->names()));
    rowReaderOpts.setScanSpec(scanSpec);
    auto rowReader = reader->createRowReader(rowReaderOpts);

    VectorPtr result = BaseVector::create(rowType, 1, leafPool_.get());
    uint64_t totalRows{0};
    while (rowReader->next(1'000, result)) {
      totalRows += result->size();
    }
    EXPECT_EQ(totalRows, 2);

    rowReader->updateRuntimeStats(stats);
  };

  // parquet-mr 1.8.2: stats are trusted. Under memcmp ordering, row group 1
  // has min="360手机助手" max="三星应用商店" which contains "360手机助手", so it
  // is read. Row group 2 has min="vivo预装" max="三星应用商店" which does not
  // contain "360手机助手" (it falls below memcmp min), so it is skipped.
  RuntimeStatistics stats182;
  writeAndGetStats("parquet-mr version 1.8.2", stats182);
  EXPECT_EQ(stats182.skippedStrides, 1);
  EXPECT_EQ(stats182.processedStrides, 1);

  // parquet-mr 1.8.1: stats are untrusted (signed byte ordering bug), so no
  // row groups are skipped. Both row groups are scanned.
  RuntimeStatistics stats181;
  writeAndGetStats("parquet-mr version 1.8.1", stats181);
  EXPECT_EQ(stats181.skippedStrides, 0);
  EXPECT_EQ(stats181.processedStrides, 2);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit test is really professional. Learned a lot.

@PingLiuPing
Copy link
Collaborator

An example: https://godbolt.org/z/dj9acqzT9

@lifulong
Copy link
Contributor Author

@PingLiuPing thanks for your reply!
I tried to create a file that can reproduce this issue, but currently, we no longer have any engines using parquet-1.8.1 to generate data, so I wasn't able to successfully create a simple data file reproduce this issue. It's not convenient to directly provide the production data files from our company's production environment.
As you said, the problem is indeed with sorting, not with undefined min/max values. It should be an issue with the parquet-1.8.1 version we used in the past. The historical data from 2023 has this problem, while the latest data no longer has such inconsistent results. At the same time, I also found the records of relevant changes in the parquet community:
apache/parquet-java#362
apache/parquet-java#367

@PingLiuPing
Copy link
Collaborator

@lifulong Thanks. Let’s update the PR description to better reflect the root cause, and add a test case to verify that when toggling SemanticVersion, the row group filter is either applied or skipped accordingly.

@lifulong lifulong changed the title fix: Fix error equal filter result with parquet-1.8.1 and column meta min/max not defined fix: parquet-1.8.1 column meta min/max order use signed format, diff with new parquet version Mar 16, 2026
@lifulong
Copy link
Contributor Author

@lifulong Thanks. Let’s update the PR description to better reflect the root cause, and add a test case to verify that when toggling SemanticVersion, the row group filter is either applied or skipped accordingly.

ok

@lifulong lifulong force-pushed the fix_parquet_181_undefined_min_max branch from 1cf0ced to a33b383 Compare March 16, 2026 11:20
@lifulong lifulong changed the title fix: parquet-1.8.1 column meta min/max order use signed format, diff with new parquet version fix: Parquet-1.8.1 column meta min/max order use signed format, diff with new parquet version Mar 16, 2026
@lifulong
Copy link
Contributor Author

@lifulong Thanks. Let’s update the PR description to better reflect the root cause, and add a test case to verify that when toggling SemanticVersion, the row group filter is either applied or skipped accordingly.

@PingLiuPing hi, has update the test, can you review again?

EXPECT_EQ(reader->numberOfRows(), 10ULL);
}

TEST_F(ParquetReaderTest, parquetMR181SkipsFilterRowGroupsByStringStats) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldIgnoreStatsForParquetMRVersions

EXPECT_EQ(reader->numberOfRows(), 10ULL);
}

TEST_F(ParquetReaderTest, parquetMR181SkipsFilterRowGroupsByStringStats) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this case to E2EFilterTest.cpp ?

// Reproduces the real-world scenario from the bug report. Parquet-mr 1.8.1
// computed binary column min/max using signed byte ordering, which differs from
// the unsigned lexicographic (memcmp) ordering Velox uses.
//
// With signed byte ordering:  三星应用商店 < 360手机助手 < vivo预装
// With memcmp byte ordering:  360手机助手  < vivo预装    < 三星应用商店
//
// A row group containing {"三星应用商店", "vivo预装"} has memcmp-based stats
// min="vivo预装", max="三星应用商店". A filter for "360手机助手" falls below the
// memcmp min, so the row group would be incorrectly skipped — even though it
// should match under the signed ordering that parquet-mr 1.8.1 used to write
// the stats.
TEST_F(E2EFilterTest, parquetMRVersionStringStatsRowGroupFiltering) {
  const std::string kSanXing = "三星应用商店";
  const std::string kVivo = "vivo预装";
  const std::string k360 = "360手机助手";

  auto rowType = ROW({"s"}, {VARCHAR()});

  auto writeAndGetStats =
      [&](const std::string& createdBy,
          RuntimeStatistics& stats) {
    options_.memoryPool = E2EFilterTestBase::rootPool_.get();
    options_.createdBy = createdBy;
    // Flush after every 5 rows to create separate row groups.
    options_.flushPolicyFactory = []() {
      return std::make_unique<LambdaFlushPolicy>(
          /*rowsInRowGroup=*/5,
          /*bytesInRowGroup=*/1'024 * 1'024,
          []() { return false; });
    };

    auto sink = std::make_unique<MemorySink>(
        200 * 1024 * 1024,
        FileSink::Options{.pool = leafPool_.get()});
    auto* sinkPtr = sink.get();
    auto writer = std::make_unique<parquet::Writer>(
        std::move(sink), options_, rowType);
    // Row group 1: contains the value we will filter for ("360手机助手").
    writer->write(makeRowVector(
        {"s"},
        {makeFlatVector<std::string>(
            {k360, kSanXing, kVivo, k360, kSanXing})}));
    // Row group 2: does not contain "360手机助手".
    writer->write(makeRowVector(
        {"s"},
        {makeFlatVector<std::string>(
            {kSanXing, kVivo, kSanXing, kVivo, kSanXing})}));
    writer->close();

    dwio::common::ReaderOptions readerOptions{leafPool_.get()};
    auto input = std::make_unique<BufferedInput>(
        std::make_shared<InMemoryReadFile>(
            std::string(sinkPtr->data(), sinkPtr->size())),
        readerOptions.memoryPool());
    auto reader = makeReader(readerOptions, std::move(input));
    auto& parquetReader = dynamic_cast<ParquetReader&>(*reader);
    EXPECT_EQ(parquetReader.fileMetaData().numRowGroups(), 2);

    auto scanSpec = std::make_shared<ScanSpec>("");
    scanSpec->addAllChildFields(*rowType);
    // Equality filter: s = "360手机助手".
    scanSpec->getOrCreateChild(Subfield("s"))
        ->setFilter(std::make_unique<BytesRange>(
            k360, false, false, k360, false, false, false));

    RowReaderOptions rowReaderOpts;
    rowReaderOpts.select(
        std::make_shared<ColumnSelector>(rowType, rowType->names()));
    rowReaderOpts.setScanSpec(scanSpec);
    auto rowReader = reader->createRowReader(rowReaderOpts);

    VectorPtr result = BaseVector::create(rowType, 1, leafPool_.get());
    uint64_t totalRows{0};
    while (rowReader->next(1'000, result)) {
      totalRows += result->size();
    }
    EXPECT_EQ(totalRows, 2);

    rowReader->updateRuntimeStats(stats);
  };

  // parquet-mr 1.8.2: stats are trusted. Under memcmp ordering, row group 1
  // has min="360手机助手" max="三星应用商店" which contains "360手机助手", so it
  // is read. Row group 2 has min="vivo预装" max="三星应用商店" which does not
  // contain "360手机助手" (it falls below memcmp min), so it is skipped.
  RuntimeStatistics stats182;
  writeAndGetStats("parquet-mr version 1.8.2", stats182);
  EXPECT_EQ(stats182.skippedStrides, 1);
  EXPECT_EQ(stats182.processedStrides, 1);

  // parquet-mr 1.8.1: stats are untrusted (signed byte ordering bug), so no
  // row groups are skipped. Both row groups are scanned.
  RuntimeStatistics stats181;
  writeAndGetStats("parquet-mr version 1.8.1", stats181);
  EXPECT_EQ(stats181.skippedStrides, 0);
  EXPECT_EQ(stats181.processedStrides, 2);
}

@PingLiuPing PingLiuPing changed the title fix: Parquet-1.8.1 column meta min/max order use signed format, diff with new parquet version fix: Ignore string column statistics for parquet-mr versions before 1.8.2 Mar 18, 2026
@lifulong lifulong force-pushed the fix_parquet_181_undefined_min_max branch 2 times, most recently from fc0176d to a4a81b0 Compare March 19, 2026 02:44
@lifulong lifulong force-pushed the fix_parquet_181_undefined_min_max branch from a4a81b0 to 507370c Compare March 19, 2026 02:54
Copy link
Collaborator

@PingLiuPing PingLiuPing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@PingLiuPing
Copy link
Collaborator

@mbasmanova Could you please take a final look at your convenience? The root cause has been clearly identified, and the test coverage is now solid.

@mbasmanova
Copy link
Contributor

@PingLiuPing Would it be possible to update the PR description?

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Mar 21, 2026
@PingLiuPing
Copy link
Collaborator

@PingLiuPing Would it be possible to update the PR description?

Thank you @mbasmanova, PR description been updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet-1.8.1 min/max order error

3 participants