Skip to content

CQL Vector support #1165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 8, 2025
Merged

CQL Vector support #1165

merged 8 commits into from
May 8, 2025

Conversation

smoczy123
Copy link
Contributor

@smoczy123 smoczy123 commented Jan 6, 2025

This PR adds serialization and deserialization of CQL Vector (as implemented in Cassandra) therefore achieving compatibility with Cassandra's Vector type. It's important to note that Cassandra implements Vector serialization and deserialization in a way that
contradicts the CQL protocol, using [unsigned vint] instead of [int] as the element size encoding for variable type length vectors.

Fixes #1014

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@github-actions github-actions bot added the semver-checks-breaking cargo-semver-checks reports that this PR introduces breaking API changes label Jan 6, 2025
Copy link

github-actions bot commented Jan 6, 2025

cargo semver-checks found no API-breaking changes in this PR.
Checked commit: 3180d38

@smoczy123
Copy link
Contributor Author

I'm not sure this is the correct way to split this PR into commits (I'm pretty sure it isn't, as the commits won't compile), however I can't think of a proper way.

@smoczy123 smoczy123 marked this pull request as ready for review January 10, 2025 03:19
@smoczy123 smoczy123 force-pushed the vector-type branch 2 times, most recently from 6aee097 to 440d63a Compare January 13, 2025 12:38
@wprzytula wprzytula added this to the 0.16.0 milestone Jan 13, 2025
Copy link
Contributor

@muzarski muzarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed the first commit (introduction of TypeParser)

Some general comments:

  1. The logic of TypeParser is quite complex. I suggest adding some docstrings next to the type definitions and methods. For example, I have no idea what TypeParser::from_hex does. Docstrings will also help a lot in the future in case some other developer touches this piece of code.
  2. It's worth adding some comments next to the non-intuitive parts of the code. Example:
        if name.is_empty() {
            if !self.is_eos() {
                return Err(CqlTypeParseError::AbstractTypeParseError());
            }
            return Ok(ColumnType::Blob);
        }

It's not obvious why we return Blob if name is empty. A link to the corresponding part of original source code would be helpful.

  1. Please, add some unit tests. I saw that there is some small test of TypeParser in a later commit. I think we should add more tests and try to handle as many parsing cases as we can. In addition, I think that in this case, unit tests should be added in the same commit (they help during review - it's easier to reason about the complex code when there are some use case examples one can look at)
  2. This implementation is based on some existing (probably Java) implementation, correct? If so, please, provide the link to the source in the commit. Ideally, the link should be placed in the comments in code as well.

@smoczy123
Copy link
Contributor Author

Whole TypeParser logic was ripped straight out of ScyllaDB's vector implementation, however, as it still in development and probably won't be merged for a while, it will be hard to link directly. IIRC there is a lot of tests there for this functionality, so thay also can be borrowed.

@muzarski
Copy link
Contributor

Whole TypeParser logic was ripped straight out of ScyllaDB's vector implementation, however, as it still in development and probably won't be merged for a while, it will be hard to link directly. IIRC there is a lot of tests there for this functionality, so thay also can be borrowed.

Ok, makes sense. And let's borrow the tests in such case :)

Copy link
Collaborator

@wprzytula wprzytula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 What a great piece of code! Thank you for the contribution!

There are quite many comments, though.
I think that the new parser module needs much more unit tests.
Also, tests for particular errors upon serialization and deserialization of Vector are missing.

@smoczy123 smoczy123 force-pushed the vector-type branch 5 times, most recently from 23d6dad to 8e128e1 Compare January 22, 2025 22:07
@Lorak-mmk Lorak-mmk modified the milestones: 0.16.0, 1.0.0 Feb 5, 2025
@muzarski muzarski modified the milestones: 1.0.0, 1.1.0 Feb 6, 2025
@smoczy123
Copy link
Contributor Author

❓ One question: could you explain this change? https://github.com/scylladb/scylla-rust-driver/compare/1eb80183b32bb74c440253534fd3ab1f57bad387..181c9f62624f8cb2974b24602b568d357e19735a

I have copied the test case and changed the expected error type, but forgot to change the test string, so the tests didn't pass, here I fix this

@smoczy123
Copy link
Contributor Author

@piodul (I'm not sure why, but I can't reply to your comment) The issue with invalid parameter count is that any elegant way of gathering that data that I could think of (and the one you propose here) would require allocating each time we parse a vector. The solution without allocations (getting the required arguments manually and iterating through the rest if needed) is quite ugly.

@Lorak-mmk
Copy link
Collaborator

@piodul (I'm not sure why, but I can't reply to your comment)

This is a weird behavior in Github UI. If, when doing a review, you respond to a comment thread belonging to some other review, then this new comment will show up in 2 places:

@piodul
Copy link
Collaborator

piodul commented May 7, 2025

@piodul (I'm not sure why, but I can't reply to your comment)

This is a weird behavior in Github UI. If, when doing a review, you respond to a comment thread belonging to some other review, then this new comment will show up in 2 places:

Replied here: #1165 (comment)

@wprzytula
Copy link
Collaborator

@smoczy123 Are you planning to address @piodul's nitpicks or shall we merge this as-is?

@smoczy123 smoczy123 dismissed stale reviews from piodul, Lorak-mmk, and wprzytula via a628c7a May 8, 2025 12:41
@smoczy123
Copy link
Contributor Author

They should be addressed now @wprzytula

@smoczy123 smoczy123 requested review from wprzytula and piodul May 8, 2025 12:49
Copy link
Collaborator

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InvalidParameterCount(usize, usize), - can we make that a struct variant (with actual and expected fields) instead of tuple variant?

                    .collect::<Vec<_>>()
                    .try_into()
                    .map_err(
                        |v: Vec<Result<ColumnType<'result>, CustomTypeParseError>>| {
                            CustomTypeParseError::InvalidParameterCount(v.len(), 1)
                        },
                    )?;

I think the intention was to only allocate in the case of an error. Now you made this function always allocate. One way to allocate only in case of error is to do something like this:

    fn get_complex_abstract_type(
        &mut self,
        mut name: &'result str,
    ) -> Result<ColumnType<'result>, CustomTypeParseError> {
        name = name
            .strip_prefix("org.apache.cassandra.db.marshal.")
            .unwrap_or(name);

        // Calculates the real number of parameters.
        // Can be called only after verifying that get_type_parameters() returned Ok,
        // because it panic in case of error.
        // It is declared in this admittedly weird way to make it FnOnce - calling it
        // more than once would obviously be a bug.
        let calc_params_count = {
            let self_clone = CustomTypeParser {
                // Clone here is cheap since parser is just one &str.
                parser: self.parser.clone(),
            };
            move || {
                // If we make `self_clone` mut and used it here,
                // the whole closure would be FnMut.
                let mut parser = self_clone;
                parser.get_type_parameters().unwrap().count()
            }
        };

        match name {
            "ListType" => {
                let [element_type_result] = self
                    .get_type_parameters()?
                    .collect_array::<1>()
                    .ok_or_else(move || {
                        CustomTypeParseError::InvalidParameterCount(calc_params_count(), 1)
                    })?;
                let element_type = element_type_result?;
                Ok(ColumnType::Collection {
                    frozen: false,
                    typ: CollectionType::List(Box::new(element_type)),
                })
            }
            "SetType" => {
                let [element_type_result] = self
                    .get_type_parameters()?
                    .collect_array::<1>()
                    .ok_or_else(move || {
                        CustomTypeParseError::InvalidParameterCount(calc_params_count(), 1)
                    })?;
                let element_type = element_type_result?;
                Ok(ColumnType::Collection {
                    frozen: false,
                    typ: CollectionType::Set(Box::new(element_type)),
                })
            }
            "MapType" => {
                let [key_type_result, value_type_result] = self
                    .get_type_parameters()?
                    .collect_array::<2>()
                    .ok_or_else(move || {
                        CustomTypeParseError::InvalidParameterCount(calc_params_count(), 2)
                    })?;
                let key_type = key_type_result?;
                let value_type = value_type_result?;
                Ok(ColumnType::Collection {
                    frozen: false,
                    typ: CollectionType::Map(Box::new(key_type), Box::new(value_type)),
                })
            }
            "TupleType" => {
                let params = self
                    .get_type_parameters()?
                    .collect::<Result<Vec<_>, CustomTypeParseError>>()?;
                if params.is_empty() {
                    return Err(CustomTypeParseError::InvalidParameterCount(0, 1));
                }
                Ok(ColumnType::Tuple(params))
            }
            "VectorType" => {
                let (typ, len) = self.get_vector_parameters()?;
                Ok(ColumnType::Vector {
                    typ: Box::new(typ),
                    dimensions: len,
                })
            }
            "UserType" => {
                let params = self.get_udt_parameters()?;
                Ok(ColumnType::UserDefinedType {
                    frozen: false,
                    definition: Arc::new(UserDefinedType {
                        name: params.type_name.into(),
                        keyspace: params.keyspace.into(),
                        field_types: params.field_types,
                    }),
                })
            }
            name => Err(CustomTypeParseError::UnknownComplexCustomTypeName(
                name.into(),
            )),
        }
    }

Before pushing this let's get @wprzytula or @piodul 's opinion on wheter this solution makes sense.

@Lorak-mmk
Copy link
Collaborator

Lorak-mmk commented May 8, 2025

The .clone() can probably be omitted from my code since ParserState is Copy.

smoczy123 added 7 commits May 8, 2025 16:19
This is needed to deserialize vector metadata
as it is implemented as a Custom type with
VectorType as its class
Due to the fact that Cassandra implements
variable type length vectors in a way
that contradicts the CQL protocol, special
care must be given when deserializing them
as sizes of their elements are encoded as
unsigned vint instead of an int
This is needed for serialization of vectors
as they either don't write the size of elements
or write it weirdly.
Similarly to the deserialization commit, special care
must be given when serializing variable type length
vectors, as sizes of their elements must be written
as an unsigned varint
@wprzytula wprzytula requested a review from Lorak-mmk May 8, 2025 14:31
@wprzytula wprzytula merged commit 18f5e39 into scylladb:main May 8, 2025
13 checks passed
@wprzytula wprzytula mentioned this pull request May 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CQL Vector type
7 participants