Skip to content

Chunk with multiple messages #251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: branch-25.06
Choose a base branch
from

Conversation

nirandaperera
Copy link
Contributor

@nirandaperera nirandaperera commented May 6, 2025

Chunk with multiple messages. This PR only moves the existing Chunk class to the new impl and it would only have 1 message in it.

This class has two buffers:

  • metadata_: The metadata buffer that contains information about the messages in the chunks and the concatenated metadata of the messages.
  • data_: The data buffer that contains the concatenateddata of the messages in the chunk.

All the chunk information will be encoded to the metadata_ buffer as follows.
The metadata_ buffer uses the following format:

  • chunk_id: uint64_t, ID of the chunk
  • n_elements: size_t, Number of messages in the chunk
  • [partition_ids]: vector, Partition IDs of the messages, size = n_elements
  • [expected_num_chunks]: vector<size_t>, Expected number of chunks of the messages, size = n_elements
  • [meta_offsets]: vector<uint32_t>, Offsets (excluding 0) of the metadata sizes of the messages, size = n_elements
  • [data_offsets]: vector<uint64_t>, Offsets (excluding 0) of the data sizes of the messages, size = n_elements
  • [concat_metadata]: vector<uint8_t>, Concatenated metadata of the messages, size = meta_offsets[n_elements - 1]

For a chunk with N messages with M bytes of concat metadata the size of metadata_ buffer is sizeof(ChunkID) + sizeof(size_t) + N (sizeof(PartID) + sizeof(size_t) + sizeof(uint32_t) + sizeof(uint64_t)) + M = 16 + N 24 + M bytes.

For a chunk with a single control message, the size of the metadata_ buffer is sizeof(ChunkID) + sizeof(PartID)+ 2*sizeof(size_t) + sizeof(uint32_t) + sizeof(uint64_t) = 40 bytes.

For a chunk with a single message with M bytes of metadata, the size of the metadata_ buffer is sizeof(ChunkID) + sizeof(PartID) + sizeof(size_t) + sizeof(uint32_t) + sizeof(ChunkID) + sizeof(PartID) + sizeof(size_t) + sizeof(uint32_t) + sizeof(uint64_t) + M = 40 + M bytes.

Signed-off-by: niranda perera <[email protected]>
@pentschev
Copy link
Member

If you're merging branch-25.08 into your PR you need to retarget it to branch-25.08 as well. Was that intended @nirandaperera ?

@nirandaperera nirandaperera force-pushed the multi-packed-data-chunk branch from fd59be4 to aba10c0 Compare May 6, 2025 21:01
@nirandaperera
Copy link
Contributor Author

If you're merging branch-25.08 into your PR you need to retarget it to branch-25.08 as well. Was that intended @nirandaperera ?

Thanks @pentschev. It was a mistake. I force pushed the changes now

Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
@nirandaperera nirandaperera added breaking Introduces a breaking change improvement Improves an existing functionality labels May 6, 2025
@nirandaperera
Copy link
Contributor Author

@wence- @madsbk this PR has the "new" Chunk API, with scaffolding for housing multiple messages. I didnt rename it to Chunk ATM, because I felt like the API is cleaner to review like this. I will replace Chunk once this comes out of draft.

Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
/**
* @brief Chunk with multiple messages.
*
* This class will have two buffers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* This class will have two buffers:
* This class will has two buffers:

* - data_: The data buffer that contains the concatenateddata of the messages in the
* chunk.
*
* The metadata_ buffer will have the following format:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The metadata_ buffer will have the following format:
* The metadata_ buffer has the following format:

Comment on lines 45 to 47
* - [psum_meta]: std::vector<uint32_t>, Prefix sums (excluding 0) of the metadata
* sizes of the messages, size = n_elements
* - [psum_data]: std::vector<uint64_t>, Prefix sums (excluding 0) of the data sizes of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this something like metadata_offsets and data_offsets respectively?

* @return The number of messages in the chunk.
*/
inline size_t n_messages() const {
return *reinterpret_cast<size_t*>(metadata_->data() + sizeof(ChunkID));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these reinterpret_cast type-punning approaches break strict-aliasing rules unfortunately.

What we should do is:

  1. (C++ 20) use std::bit_cast (but it's messy because we're carrying around std::byte and bit_cast takes values, not pointer + size).
  2. Use memcpy (the compiler will optimise this)

So, this would be (for example):

inline size_t n_messages() const {
    size_t result;
    memcpy(&result, metadata_->data() + sizeof(ChunkID), sizeof(result));
    return result;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. TIL. Sure, I will change to memcpy.

Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
@nirandaperera
Copy link
Contributor Author

/ok to test

Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
@nirandaperera nirandaperera requested a review from wence- May 8, 2025 22:46
Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!
@nirandaperera, let's prioritize getting full chunk support. I think it will have a significant impact!

* is a control message, otherwise zero (data message).
*/
inline size_t expected_num_chunks(size_t i) const {
return expected_num_chunks_[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return expected_num_chunks_[i];
return expected_num_chunks_.at(i);

* @return True if the message is a control message, false otherwise.
*/
inline bool is_control_message(size_t i) const {
return expected_num_chunks_[i] > 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return expected_num_chunks_[i] > 0;
return expected_num_chunks(i) > 0;

*/
[[nodiscard]] std::unique_ptr<cudf::table> unpack(rmm::cuda_stream_view stream) const;
static Chunk from_serialized_buf(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static Chunk from_serialized_buf(
static Chunk from_serialized_buffer(

return {chunk_id, 1, {part_id}, {expected_num_chunks}, {0}, {0}};
}

Chunk Chunk::from_serialized_buf(std::vector<uint8_t> const& msg, bool validate) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Chunk Chunk::from_serialized_buf(std::vector<uint8_t> const& msg, bool validate) {
Chunk Chunk::from_serialized_buffer(std::vector<uint8_t> const& msg, bool validate) {

@nirandaperera nirandaperera marked this pull request as ready for review May 9, 2025 15:16
@nirandaperera nirandaperera requested a review from a team as a code owner May 9, 2025 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Introduces a breaking change improvement Improves an existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants