-
Notifications
You must be signed in to change notification settings - Fork 11
Chunk with multiple messages #251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.06
Are you sure you want to change the base?
Chunk with multiple messages #251
Conversation
Signed-off-by: niranda perera <[email protected]>
If you're merging branch-25.08 into your PR you need to retarget it to branch-25.08 as well. Was that intended @nirandaperera ? |
…ti-packed-data-chunk
Signed-off-by: niranda perera <[email protected]>
fd59be4
to
aba10c0
Compare
Thanks @pentschev. It was a mistake. I force pushed the changes now |
Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
/** | ||
* @brief Chunk with multiple messages. | ||
* | ||
* This class will have two buffers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* This class will have two buffers: | |
* This class will has two buffers: |
* - data_: The data buffer that contains the concatenateddata of the messages in the | ||
* chunk. | ||
* | ||
* The metadata_ buffer will have the following format: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* The metadata_ buffer will have the following format: | |
* The metadata_ buffer has the following format: |
* - [psum_meta]: std::vector<uint32_t>, Prefix sums (excluding 0) of the metadata | ||
* sizes of the messages, size = n_elements | ||
* - [psum_data]: std::vector<uint64_t>, Prefix sums (excluding 0) of the data sizes of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this something like metadata_offsets
and data_offsets
respectively?
* @return The number of messages in the chunk. | ||
*/ | ||
inline size_t n_messages() const { | ||
return *reinterpret_cast<size_t*>(metadata_->data() + sizeof(ChunkID)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these reinterpret_cast type-punning approaches break strict-aliasing rules unfortunately.
What we should do is:
- (C++ 20) use
std::bit_cast
(but it's messy because we're carrying aroundstd::byte
andbit_cast
takes values, not pointer + size). - Use
memcpy
(the compiler will optimise this)
So, this would be (for example):
inline size_t n_messages() const {
size_t result;
memcpy(&result, metadata_->data() + sizeof(ChunkID), sizeof(result));
return result;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. TIL. Sure, I will change to memcpy.
Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
…ti-packed-data-chunk Signed-off-by: niranda perera <[email protected]>
/ok to test |
Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good!
@nirandaperera, let's prioritize getting full chunk support. I think it will have a significant impact!
* is a control message, otherwise zero (data message). | ||
*/ | ||
inline size_t expected_num_chunks(size_t i) const { | ||
return expected_num_chunks_[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return expected_num_chunks_[i]; | |
return expected_num_chunks_.at(i); |
* @return True if the message is a control message, false otherwise. | ||
*/ | ||
inline bool is_control_message(size_t i) const { | ||
return expected_num_chunks_[i] > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return expected_num_chunks_[i] > 0; | |
return expected_num_chunks(i) > 0; |
*/ | ||
[[nodiscard]] std::unique_ptr<cudf::table> unpack(rmm::cuda_stream_view stream) const; | ||
static Chunk from_serialized_buf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static Chunk from_serialized_buf( | |
static Chunk from_serialized_buffer( |
cpp/src/shuffler/chunk.cpp
Outdated
return {chunk_id, 1, {part_id}, {expected_num_chunks}, {0}, {0}}; | ||
} | ||
|
||
Chunk Chunk::from_serialized_buf(std::vector<uint8_t> const& msg, bool validate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chunk Chunk::from_serialized_buf(std::vector<uint8_t> const& msg, bool validate) { | |
Chunk Chunk::from_serialized_buffer(std::vector<uint8_t> const& msg, bool validate) { |
Signed-off-by: niranda perera <[email protected]>
…ti-packed-data-chunk
Chunk with multiple messages. This PR only moves the existing
Chunk
class to the new impl and it would only have 1 message in it.This class has two buffers:
metadata_
: The metadata buffer that contains information about the messages in the chunks and the concatenated metadata of the messages.data_
: The data buffer that contains the concatenateddata of the messages in the chunk.All the chunk information will be encoded to the metadata_ buffer as follows.
The metadata_ buffer uses the following format:
chunk_id
: uint64_t, ID of the chunkn_elements
: size_t, Number of messages in the chunk[partition_ids]
: vector, Partition IDs of the messages, size = n_elements[expected_num_chunks]
: vector<size_t>, Expected number of chunks of the messages, size = n_elements[meta_offsets]
: vector<uint32_t>, Offsets (excluding 0) of the metadata sizes of the messages, size = n_elements[data_offsets]
: vector<uint64_t>, Offsets (excluding 0) of the data sizes of the messages, size = n_elements[concat_metadata]
: vector<uint8_t>, Concatenated metadata of the messages, size = meta_offsets[n_elements - 1]For a chunk with N messages with M bytes of concat metadata the size of metadata_ buffer is
sizeof(ChunkID) + sizeof(size_t) + N (sizeof(PartID) + sizeof(size_t) + sizeof(uint32_t) + sizeof(uint64_t)) + M = 16 + N 24 + M
bytes.For a chunk with a single control message, the size of the metadata_ buffer is
sizeof(ChunkID) + sizeof(PartID)+ 2*sizeof(size_t) + sizeof(uint32_t) + sizeof(uint64_t) = 40
bytes.For a chunk with a single message with M bytes of metadata, the size of the metadata_ buffer is
sizeof(ChunkID) + sizeof(PartID) + sizeof(size_t) + sizeof(uint32_t) + sizeof(ChunkID) + sizeof(PartID) + sizeof(size_t) + sizeof(uint32_t) + sizeof(uint64_t) + M = 40 + M
bytes.