Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing trouble when a column is called "vector" #483

Closed
hemidactylus opened this issue Sep 27, 2023 · 6 comments · Fixed by #495
Closed

Parsing trouble when a column is called "vector" #483

hemidactylus opened this issue Sep 27, 2023 · 6 comments · Fixed by #495
Assignees

Comments

@hemidactylus
Copy link

hemidactylus commented Sep 27, 2023

On a schema like this

CREATE TABLE cassio_tutorials.my_vector_db_table (
    row_id text PRIMARY KEY,
    attributes_blob text,
    body_blob text,
    metadata_s map<text, text>,
    vector vector<float, 1536>
);

it happens that dsbulk unload gets confused by vector and the unload operation fails with something like

Operation UNLOAD_20230927-152011-882206 failed: Invalid query: 'SELECT row_id, attributes_blob, body_blob, metadata_s, vector FROM cassio_tutorials.my_vector_db_table WHERE token(row_id) > :start AND token(row_id) <= :end' could not be parsed at line 1:55: mismatched input 'vector' expecting {'(', ':', '[', '-', '{', K_AS, K_KEY, K_PER, K_PARTITION, K_DISTINCT, K_COUNT, K_VALUES, K_TIMESTAMP, K_TTL, K_CAST, K_TYPE, K_FILTERING, K_CONTAINS, K_GROUP, K_CLUSTERING, K_ASCII, K_BIGINT, K_BLOB, K_BOOLEAN, K_COUNTER, K_DECIMAL, K_DOUBLE, K_DURATION, K_FLOAT, K_INET, K_INT, K_SMALLINT, K_TINYINT, K_TEXT, K_UUID, K_VARCHAR, K_VARINT, K_TIMEUUID, K_TOKEN, K_WRITETIME, K_DATE, K_TIME, K_NULL, K_EXISTS, K_MAP, K_LIST, K_NAN, K_INFINITY, K_TUPLE, K_FROZEN, K_JSON, K_LIKE, STRING_LITERAL, QUOTED_NAME, INTEGER, '?', FLOAT, BOOLEAN, DURATION, IDENT, HEXNUMBER, UUID}.
   Caused by: InputMismatchException (no message).

v1.11

@msmygit
Copy link
Collaborator

msmygit commented Sep 27, 2023

We probably also need to test for other similar cases like if the table name is also same as the cql data type like below:

token@cqlsh:correctness> create table int(int int primary key);
token@cqlsh:correctness> desc tables;

int  vector

token@cqlsh:correctness> desc table int;

CREATE TABLE correctness.int (
    int int PRIMARY KEY
);

@absurdfarce
Copy link
Collaborator

Agreed @msmygit . My guess is that I got something in the CQL parser code, specifically we're trying to pull out some kind of non-keyword token and failing miserably.

Any fix to this ticket should involve a test to validate that we've addressed the issue generally.

@hemidactylus
Copy link
Author

I have checked with INT and TEXT and ... they work just fine. So it seems it's something with vector specifically.

java -jar dsbulk-1.11.0.jar unload -k ... -t a1 -u "token" -p ... -b ...

Username and password provided but auth provider not specified, inferring PlainTextAuthProvider
A cloud secure connect bundle was provided: ignoring all explicit contact points.
Operation directory: ...

int
12
11
10
total | failed | rows/s |  p50ms |  p99ms | p999ms
    3 |      0 |      2 | 119.80 | 121.11 | 121.11
Operation UNLOAD_20230928-155343-102056 completed successfully in 1 second.
Checkpoints for the current operation were written to checkpoint.csv.
To resume the current operation, re-run it with the same settings, and add the following command line flag:
--dsbulk.log.checkpoint.file=...

java -jar dsbulk-1.11.0.jar unload -k ... -t a2 -u "token" -p ... -b ...

Username and password provided but auth provider not specified, inferring PlainTextAuthProvider
A cloud secure connect bundle was provided: ignoring all explicit contact points.
Operation directory: ...

text
t3
t1
t2
total | failed | rows/s |  p50ms |  p99ms | p999ms
    3 |      0 |      2 | 119.63 | 120.59 | 120.59
Operation UNLOAD_20230928-155431-632878 completed successfully in 1 second.
Checkpoints for the current operation were written to checkpoint.csv.
To resume the current operation, re-run it with the same settings, and add the following command line flag:
--dsbulk.log.checkpoint.file=...

@hemidactylus
Copy link
Author

In the meantime @phact has brilliantly found a workaround:

-query 'SELECT row_id, attributes_blob, body_blob, metadata_s, \"vector\" FROM poc_data.product_table'

@pravinbhat
Copy link

I got the same above issue with the vector column named as vector. Some of our libraries names the vector column as such, so that makes it imp to address this issue.

@pravinbhat
Copy link

Adding details for the insert workaround as well as its a bit tricky
-query 'insert into <ks>.<table> (row_id, attributes_blob, body_blob, metadata_s, \"vector\") values (:row_id, :attributes_blob, :body_blob, :metadata_s, :\"vector\")'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants