Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CQL Vector support #1165

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

CQL Vector support #1165

wants to merge 7 commits into from

Conversation

smoczy123
Copy link
Contributor

@smoczy123 smoczy123 commented Jan 6, 2025

This PR adds serialization and deserialization of CQL Vector (as implemented in Cassandra) therefore achieving compatibility with Cassandra's Vector type. It's important to note that Cassandra implements Vector serialization and deserialization in a way that
contradicts the CQL protocol, using [unsigned vint] instead of [int] as the element size encoding for variable type length vectors.

Fixes #1014

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@github-actions github-actions bot added the semver-checks-breaking cargo-semver-checks reports that this PR introduces breaking API changes label Jan 6, 2025
Copy link

github-actions bot commented Jan 6, 2025

cargo semver-checks found no API-breaking changes in this PR.
Checked commit: 322cb09

@smoczy123
Copy link
Contributor Author

I'm not sure this is the correct way to split this PR into commits (I'm pretty sure it isn't, as the commits won't compile), however I can't think of a proper way.

@smoczy123 smoczy123 marked this pull request as ready for review January 10, 2025 03:19
@smoczy123 smoczy123 force-pushed the vector-type branch 2 times, most recently from 6aee097 to 440d63a Compare January 13, 2025 12:38
@wprzytula wprzytula added this to the 0.16.0 milestone Jan 13, 2025
Copy link
Contributor

@muzarski muzarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed the first commit (introduction of TypeParser)

Some general comments:

  1. The logic of TypeParser is quite complex. I suggest adding some docstrings next to the type definitions and methods. For example, I have no idea what TypeParser::from_hex does. Docstrings will also help a lot in the future in case some other developer touches this piece of code.
  2. It's worth adding some comments next to the non-intuitive parts of the code. Example:
        if name.is_empty() {
            if !self.is_eos() {
                return Err(CqlTypeParseError::AbstractTypeParseError());
            }
            return Ok(ColumnType::Blob);
        }

It's not obvious why we return Blob if name is empty. A link to the corresponding part of original source code would be helpful.

  1. Please, add some unit tests. I saw that there is some small test of TypeParser in a later commit. I think we should add more tests and try to handle as many parsing cases as we can. In addition, I think that in this case, unit tests should be added in the same commit (they help during review - it's easier to reason about the complex code when there are some use case examples one can look at)
  2. This implementation is based on some existing (probably Java) implementation, correct? If so, please, provide the link to the source in the commit. Ideally, the link should be placed in the comments in code as well.

Comment on lines 6 to 13
type UDTParameters<'result> = (
Cow<'result, str>,
Cow<'result, str>,
Vec<(Cow<'result, str>, ColumnType<'result>)>,
);
Copy link
Contributor

@muzarski muzarski Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer to have it as a struct instead of a type alias. Cow<'result, str> type appears twice and it's hard to reason about it without explicit field names.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer to have it as a struct instead of a newtype.

Actually, a struct is called a newtype. type is a type alias, which is not a new type, yet just a new name for an existing type.

Comment on lines 12 to 19
pub(crate) struct TypeParser<'result> {
pos: usize,
str: Cow<'result, str>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Probably need to rename it to CustomTypeParser (or AbstractTypeParser if we decide to stick to abstract naming convention). Same goes for the name of the module - type_parser.rs is not specific enough IMO.

Comment on lines 283 to 407
if !self.is_eos() && self.str.as_bytes()[self.pos] == b':' {
self.pos += 1;
let _ = usize::from_str_radix(&name, 16)
.map_err(|_| CqlTypeParseError::AbstractTypeParseError());
name = self.read_next_identifier();
}
Copy link
Contributor

@muzarski muzarski Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this part do? Is it tested somewhere?

@smoczy123
Copy link
Contributor Author

Whole TypeParser logic was ripped straight out of ScyllaDB's vector implementation, however, as it still in development and probably won't be merged for a while, it will be hard to link directly. IIRC there is a lot of tests there for this functionality, so thay also can be borrowed.

@muzarski
Copy link
Contributor

Whole TypeParser logic was ripped straight out of ScyllaDB's vector implementation, however, as it still in development and probably won't be merged for a while, it will be hard to link directly. IIRC there is a lot of tests there for this functionality, so thay also can be borrowed.

Ok, makes sense. And let's borrow the tests in such case :)

Copy link
Collaborator

@wprzytula wprzytula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 What a great piece of code! Thank you for the contribution!

There are quite many comments, though.
I think that the new parser module needs much more unit tests.
Also, tests for particular errors upon serialization and deserialization of Vector are missing.

Comment on lines 6 to 13
type UDTParameters<'result> = (
Cow<'result, str>,
Cow<'result, str>,
Vec<(Cow<'result, str>, ColumnType<'result>)>,
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer to have it as a struct instead of a newtype.

Actually, a struct is called a newtype. type is a type alias, which is not a new type, yet just a new name for an existing type.

Comment on lines 77 to 81
pub struct CellWriter<'buf> {
buf: &'buf mut Vec<u8>,
cell_len: Option<usize>,
size_as_uvarint: bool,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lorak-mmk As you know the serialization framework quite well, could you please aid in review of this commit?

Comment on lines 305 to 317
if let Some(buffer) = self.variable_length_buffer {
let value_len = buffer.len();
let mut len = Vec::new();
types::unsigned_vint_encode(value_len as u64, &mut len);
self.buf.extend_from_slice(&len);
self.buf.extend_from_slice(&buffer);
} else {
let value_len: i32 = (self.buf.len() - self.starting_pos - 4)
.try_into()
.map_err(|_| CellOverflowError)?;
self.buf[self.starting_pos..self.starting_pos + 4]
.copy_from_slice(&value_len.to_be_bytes());
}
Copy link
Collaborator

@wprzytula wprzytula Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 The logic around Cell* got much convoluted with these alterations. Let's think how we can make it more digestible. @Lorak-mmk @muzarski

@smoczy123 smoczy123 force-pushed the vector-type branch 5 times, most recently from 23d6dad to 8e128e1 Compare January 22, 2025 22:07
@Lorak-mmk Lorak-mmk modified the milestones: 0.16.0, 1.0.0 Feb 5, 2025
@muzarski muzarski modified the milestones: 1.0.0, 1.1.0 Feb 6, 2025
@github-actions github-actions bot removed the semver-checks-breaking cargo-semver-checks reports that this PR introduces breaking API changes label Mar 17, 2025
@smoczy123 smoczy123 force-pushed the vector-type branch 4 times, most recently from b572171 to 62f008a Compare March 22, 2025 18:59
@smoczy123
Copy link
Contributor Author

As of right now all changes requested above should be implemented, the only thing left is the lifetime issue with the new_type_parser in deser_type_generic. @wprzytula @Lorak-mmk

@Lorak-mmk
Copy link
Collaborator

As of right now all changes requested above should be implemented, the only thing left is the lifetime issue with the new_type_parser in deser_type_generic. @wprzytula @Lorak-mmk

I see a lot of open review comments. Please go over them and:

  • If you are sure that you fixed the issue that the comment pointed out, mark it as resolved
  • If you did not, or you are not sure, then respond to the comment to keep the discussion going
  • If the comment does not point out any issue with your code (like CQL Vector support #1165 (comment) ) then you can ignore it.

Copy link
Collaborator

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left just 2 comments. I'll continue the review after the second one is addressed.

Copy link
Collaborator

@Lorak-mmk Lorak-mmk Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit: frame: added a parser for Java-class type names

You introduced both new_type_parser.rs and type_parser.rs.

  • Both contain UDTParameters struct, and TypeParser struct with pub(crate) parse method.
  • There is no module-level comment on those files
  • No comments on the aforementioned structs and method.
  • type_parser seems unused (Does make ci pass on this commit?).
  • Commit message says nothing about those modules and structs.

How I as a reviewer am supposed to understand the purpose of those modules and structures? Please improve it by addressing some of the points above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of those do the same thing, parse the Java class type name, however the new_type_parser uses the parse.rs module that was already present in the project, however due to lifetime issues I was not able to make it work correctly. So right now there are two versions of the parser, once we either find a way to fix the lifetime issues or decide to go back to type_parser.rs (that would be quite sad, we would duplicate a lot of parsing code) the other version will be deleted. cc @wprzytula

@smoczy123 smoczy123 force-pushed the vector-type branch 2 times, most recently from 85ef4e7 to 8165b53 Compare April 2, 2025 13:19
@smoczy123
Copy link
Contributor Author

I have closed the resolved conversations, all of the issues in them should be fixed. I have decided to not touch the bigger issues with the type_parser.rs right now, as if we decide to throw it away it would be a waste of time right now. cc @Lorak-mmk

@wprzytula wprzytula requested a review from Lorak-mmk April 2, 2025 13:27
This is needed to deserialize vector metadata
as it is implemented as a Custom type with
VectorType as its class
Due to the fact that Cassandra implements
variable type length vectors in a way
that contradicts the CQL protocol, special
care must be given when deserializing them
as sizes of their elements are encoded as
unsigned vint instead of an int
Similarly to the previous commit, special care
must be given when serializing variable type length
vectors, as sizes of their elements must be written
as an unsigned varint
@swasik
Copy link

swasik commented Apr 3, 2025

I have decided to not touch the bigger issues with the type_parser.rs right now, as if we decide to throw it away it would be a waste of time right now.

If we decide not to fix it right now, could you create the issue with the whole connected covered so that we don't forget about it in the future?

@Lorak-mmk Lorak-mmk modified the milestones: 1.1.0, 1.2.0 Apr 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CQL Vector type
6 participants