Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement vector<> data type #2

Closed
Jadw1 opened this issue Oct 31, 2024 · 3 comments · Fixed by #5
Closed

Implement vector<> data type #2

Jadw1 opened this issue Oct 31, 2024 · 3 comments · Fixed by #5
Assignees
Labels
enhancement New feature or request

Comments

@Jadw1
Copy link

Jadw1 commented Oct 31, 2024

Add support for vector type.
The vector is a fixed-length collection with specified type of elements: VECTOR<INT, 5>.

The implementation should:

  • extend CQL syntax
  • adjust built-in type hierarchy and implement all necessary abstractions
  • add support for serialization/deserialization
  • other required adjustments...
  • add unit tests similar to other data types

As a result, a user should be able to use vector type in the same way as any other data type.

Note:
None of Scylla's drivers support vector type. Until the driver team (or we) adds this functionality, we're probably forced to use Cassandra's driver.

Apache Cassandra issue: https://issues.apache.org/jira/browse/CASSANDRA-18504
Patch adding other data type to Scylla (some code might be outdate): scylladb@509626f

@Jadw1 Jadw1 changed the title Implement vector data type Implement vector<> data type Oct 31, 2024
@piodul
Copy link

piodul commented Oct 31, 2024

Link to the specification of the protocol (explains how vector values should be serialized): https://github.com/apache/cassandra/blob/15ed18e9d49f48e88f40b90c156248b8b697c7e2/doc/native_protocol_v5.spec#L1210-L1215

@QuerthDP QuerthDP added the enhancement New feature or request label Nov 14, 2024
@piodul
Copy link

piodul commented Nov 19, 2024

I propose structuring the implementation in the following way. The most safe, in my opinion, way to approach developing this would be to go over the phases and implement them in order, perhaps revisiting the previous ones if some adjustments need to be made. The stuff here is bound tightly enough that I don't think splitting into smaller PRs makes sense.

Introduce support for representing the vector datatype internally

This will be a rather big part which requires thorough lecture of the code in the ./types/ directory (although there are some files in other directories).

The goal is to be able to represent vector types internally in Scylla. The most important components are data_value and abstract_type, both defined in ./types/types.hh.

  • data_value is a type which can represent any value that CQL allows. It can hold a value of an arbitrary CQL type (e.g. int, blob, set, etc.).
  • abstract_type is a dynamic representation of a CQL type.

First, I recommend getting familiar with how this support looks like for some "native" type (i.e. a type that is not a collection) and then look at how lists and sets are supported. Look at the definition and implementation of the following:

  • int32_type_impl (defined in ./concrete_types.hh)
  • set_type_impl (defined in ./types/list_type_impl.hh)
  • data_type (defined in ./types/types.hh)

At this point, you can implement a vector_type_impl and extend data_type so that you can create a data_type which is a vector, and you can get a vector back out of it (via visit).

Perhaps you will have to implement more stuff after all, but I'm not sure what will be needed, and the above are required for certain. I recommend proceeding with the later steps and add more stuff in the types module as needed, then rework it when preparing the PR for review.

Extend CQL grammar to be able to express the vector type

Now that you have an internal representation of the vector type, you can implement necessary syntax so that you can create a table with a column of the vector type. Start by adding the syntax and work your way down the abstractions, implementing what is needed. After this point, you should be able to create a table with a vector datatype and, most likely, be able to write to / read from the table (by using the bind markers, i.e. ? signs in the query).

Extend CQL grammar to be able to express vector literals

This will require delving into the cql3 layer.

The first thing that should be done there is changing the name list to list_or_vector in the collection_constructor::style_type enum.

Then, go over all occurrences of list_or_vector and fix those places up:

  • fmt::formatter<cql3::expr::expression::printer>::format - no need to change, assuming that the syntax is the same as for lists
  • do_evaluate(const collection_constructor& collection, ...) - evaluate_list could be changed to evaluate_list_or_vector, and adjusted accordingly
  • try_prepare_expression - list_prepare_expression should be changed in similar way as evaluate_list from the previous point
  • test_assignment - ditto

Tests

Some tests that use the python driver would be appreciated. For now, you can just substitute the python driver for the upstream driver if Scylla fork does not support vector types. These tests could actually be developed in parallel to other steps and, for now, only ran against Cassandra - running them against a valid implementation will make sure that the tests make sense.

There is also an option to write boost unit tests. There are some tests of this kind for data_value and abstract_type abstractions - check out ./test/boost/types_test.cc and ./test/boost/user_types_test.cc. It is a good idea to write at least some of those before reaching the last stage.

@Jadw1
Copy link
Author

Jadw1 commented Nov 19, 2024

There is also an option to write boost unit tests. There are some tests of this kind for data_value and abstract_type abstractions - check out ./test/boost/types_test.cc and ./test/boost/user_types_test.cc. It is a good idea to write at least some of those before reaching the last stage.

Boost test is a good way to check vector type implementation, especially in the first stage when CQL layer doesn't support vector type yet. Types module is independent of the rest of database systems, so you can validate the implementation without spinning up the whole system (for instance, test cases in test/boost/types_test.cc don't use cql_env).

@QuerthDP QuerthDP linked a pull request Nov 24, 2024 that will close this issue
@Jadw1 Jadw1 closed this as completed in #5 Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants