build: Use FBThrift instead of Apache Thrift (#16019)#16019
build: Use FBThrift instead of Apache Thrift (#16019)#16019peterenescu wants to merge 5 commits intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
|
@peterenescu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D90723225. |
c2f1992 to
44e3c48
Compare
44e3c48 to
e3e5d96
Compare
e3e5d96 to
a117ed7
Compare
a117ed7 to
95858ee
Compare
95858ee to
00b6209
Compare
00b6209 to
6ec8219
Compare
dcc5f6a to
fec4ca4
Compare
fec4ca4 to
1d18404
Compare
czentgr
left a comment
There was a problem hiding this comment.
Can you please explain the rationale a bit more? Looks like we want Arrow to link to fbthrift instead of apache thrift. I suppose fbthrift is compatible with apache thrift for arrow usage?
| VELOX_BUILD_SHARED: "ON" | ||
| VELOX_ARROW_CMAKE_PATCH: ${{ github.workspace }}/velox/CMake/resolve_dependency_modules/arrow/cmake-compatibility.patch | ||
| run: | | ||
| if git diff --name-only HEAD^1 HEAD | grep -q "scripts/setup-"; then |
There was a problem hiding this comment.
Make sure this is only triggered on a pull_request.
Similar to #15896
| # https://github.com/facebook/folly/issues/1666 for it. | ||
| install_boost | ||
| install_fmt | ||
| install_folly |
There was a problem hiding this comment.
The setup script for macOS also installs fast_float just like for Linux. So you probably need that here too.
| | wangle | v2025.04.28.00 | No || | ||
| | mvfst | v2025.04.28.00 | No || | ||
| | fbthrift | v2025.04.28.00 | No || | ||
| | folly | v2025.09.15.00 | Yes || |
There was a problem hiding this comment.
I'm trying to upgrade folly to a later version (2026.01.05.00). We will have a conflict here. The only issue for now is the velox_remote_functions_client_test is failing. I'm investigating why. It shouldn't fail. See #15967
Why did you pick this version (or just because that was the latest at the time you started)?
1d18404 to
1bb6b17
Compare
|
Is this similar to #14942? |
02cb3d7 to
3d90ea7
Compare
3d90ea7 to
390849b
Compare
|
@czentgr @peterenescu I wasn’t able to reproduce this issue using a pure scan, but could always reproduce it when running TPC-DS q4. I wonder if you have a way to test the TPC-DS queries. I can upload my test data for reproduction if needed. I assume further tests will be needed to verify the functionality of this PR, as merging it without sufficient tests could be risky. |
|
@czentgr @peterenescu This issue doesn’t appear to be a corner case. When I tested the first 10 TPC-DS queries, Q4, Q5, and Q9 all failed with the same core dump, while the others haven’t been verified yet. I’m still working on finding a simpler query to reproduce the issue. It would also be very helpful if anyone could help verify the TPC-DS queries. Thanks. |
|
@rui-mo Thanks, it is interesting because again it is in the seeking page header logic. It must be hitting another edge case. |
|
Hi @czentgr @peterenescu, I’m finally able to reproduce this issue in Velox using a simple query plan. Would you like to take a try? BTW, it appears that removing |
| @@ -118,6 +117,7 @@ jobs: | |||
| mkdir /tmp/build | |||
| cd /tmp/build | |||
| source /opt/rh/gcc-toolset-12/enable | |||
| export VELOX_ARROW_CMAKE_PATCH="${GITHUB_WORKSPACE}/CMake/resolve_dependency_modules/arrow/arrow-testing-boost.patch ${GITHUB_WORKSPACE}/CMake/resolve_dependency_modules/arrow/cmake-compatibility.patch" | |||
There was a problem hiding this comment.
Arrow was updated from Arrow 15 to Arrow 18. I see some issues with dependencies of Arrow for testing and boost in the CI. Looks like this patch was supposedly fixing this?
There was a problem hiding this comment.
Oh, sorry. I missed the upgrade.
Yes. It's for fixing arrow-testing and Boost but it's incompleted yet.
I tried fixing it with the patch when I have time but I couldn't fix it entirely. I will retry it when I have time again but I don't have enough time for now. Sorry. If the patch has any trouble, you can remove the patch. I may re-add something when I retry.
|
@peterenescu @rui-mo I've been working on a fix for the issue with the Parquet reading part. I have a fix for this. I ran a Presto TPCDS SF1k workload and it passed successfully. |
|
@czentgr Thanks for verifying TPC-DS. I noticed this issue on a smaller dataset (SF500). |
|
@peterenescu has imported this pull request. If you are a Meta employee, you can view this in D90723225. |
|
@rui-mo Would you be able to do another run? I ran on the SF1k and had no problems. I will re-run with SF10k TPCDS. |
|
Thanks @czentgr for the fix! I’ll run another test to verify it. By the way, I noticed another potential incompatibility between FBThrift and Apache Thrift (see #13175 (comment)) and wonder if there might be other issues. |
|
@rui-mo There should not be any compatibility issues between fbthrift and thrift since both are expected to have the same on-the-wire byte order (big endian). If at all there is a difference, it's a bug which we can fix. Not moving to fbthrift is a blocker for multiple other Velox improvements. |
|
@czentgr I verified the previously failed queries as well as all TPC-DS queries (SF500) on my end using your fix, and they are now working correctly. Thank you again for your fix! |
Build Impact AnalysisSelective Build Targets (building these covers all 54 affected)Total affected: 54/555 targets
Affected targets (54)Directly changed (21)
Transitively affected (33)
Slow path • Graph generated from PR branch |
Build Impact AnalysisSelective Build Targets (building these covers all 54 affected)Total affected: 54/555 targets
Affected targets (54)Directly changed (21)
Transitively affected (33)
Slow path • Graph generated from PR branch |
2 similar comments
Build Impact AnalysisSelective Build Targets (building these covers all 54 affected)Total affected: 54/555 targets
Affected targets (54)Directly changed (21)
Transitively affected (33)
Slow path • Graph generated from PR branch |
Build Impact AnalysisSelective Build Targets (building these covers all 54 affected)Total affected: 54/555 targets
Affected targets (54)Directly changed (21)
Transitively affected (33)
Slow path • Graph generated from PR branch |
Build Impact AnalysisSelective Build Targets (building these covers all 54 affected)Total affected: 54/556 targets
Affected targets (54)Directly changed (20)
Transitively affected (34)
Slow path • Graph generated from PR branch |
…or#16456) Summary: In some cases the file_offset might be 0 if its relative to the end of PAR1. Tests: Added new test in ParquetReader.cpp, it tests both the changes in ParquetReader::filterRowGroup and ParquetData::getRowGroupRegion. Run all tests in velox/dwio Pull Request resolved: facebookincubator#16456 Reviewed By: kKPulla Differential Revision: D98980070 Pulled By: bikramSingh91 fbshipit-source-id: 58d8aaaad3cd248540d40ae103b4b02dbd969ff7
The original ThriftStreamingTransport used memcpy to copy data into contiguous buffers. The new CompactProtocolReaderWithRefill uses IOBuf chains which can be non-contiguous, causing: - Data corruption when Thrift reader crossed buffer boundaries - Incorrect byte counting for stream position tracking - Buffer management issues after reading page headers Solution Implemented 1. Buffer Coalescing (ParquetThrift.h) - Modified refiller to create a single contiguous buffer by copying: - Unconsumed bytes from current buffer - New data from stream - This mimics the original memcpy behavior, ensuring Thrift always reads from contiguous memory - Prevents data corruption when deserializing structures that span buffer boundaries 2. Correct Byte Counting (ParquetThrift.h) - Track totalBytesReadBeforeRefill (bytes consumed before refiller was called) - Save coalescedBufferStart and coalescedBufferSize before cloning - Calculate bytesConsumedFromCoalesced (position in coalesced buffer) - Total stream bytes = totalBytesReadBeforeRefill + bytesConsumedFromCoalesced - This gives accurate readBytes for calculating pageDataStart_ 3. Proper Buffer Positioning (PageReader.cpp) - After reading page header with refiller, position bufferStart_/bufferEnd_ to point to remaining data in the stream buffer - Calculate: bytes consumed from new stream = result.readBytes - - Set bufferEnd_ to end of stream data - Set bufferStart_ to skip consumed bytes in stream data - This allows readBytes() to continue reading page data without additional stream reads Key Differences from Original - Original: Used reference parameters automatically updated by memcpy - New: Uses IOBuf chains with manual coalescing and buffer management - Both: Ensure Thrift reads from contiguous memory, preventing corruption
Summary:
Removes external Thrift dependency primarily used in Parquet as noted by #13175.
Issues noted below for which we'd like to move to FBThrift from Apache Thrift.
FBThrift and Apache Thrift has many incompatible API changes:
XXX.__isset.YYY->XXX.YYY().has_value()(orbool(XXX.YYY())): The latter implicitbool(...)syntax is used in this PRXXX.__isset.YYY = true->XXX.YYY().ensure()XXX.YYY->XXX.YYY().value()(or*XXX.YYY()): The latter*...syntax is used in this PRXXX.__set_YYY(value)->XXX.YYY() = valuethrift::ThriftTransport/TCompactProcotolT->folly::IOBuf/CompactProtocolReaderoptionaland default value that is used byDataPageHeaderV2.is_compressed: We need to use the default value manually in our codeThere are many changes in this PR but there is no logic change and no refactoring. This just converts API usage for FBThrift.
(Copy of #14942 from kou due to some push permission issues.)
Differential Revision: D90723225