Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSBulk DELETE can not accept any ranges on the clustering column when used within -query #490

Open
biswadipseth1980 opened this issue Mar 12, 2024 · 0 comments

Comments

@biswadipseth1980
Copy link

biswadipseth1980 commented Mar 12, 2024

DSBulk DELETE can not accept any ranges on the clustering column where as SELECT does
Also cqlsh can work with the same

Consider this is the table from where customer needs to delete the data

CREATE TABLE live_mom_transaction (
str_num text,
day_no int,
transaction_uid text,
trans_typ text,
sequence_no text,
created_date timestamp,
raw_data text,
updated_date timestamp,
PRIMARY KEY ((str_num, day_no), transaction_uid, trans_typ, sequence_no)
) WITH CLUSTERING ORDER BY (transaction_uid ASC, trans_typ ASC, sequence_no ASC)

Executing a SELECT query with conditions on these columns works in cqlsh, but attempts to perform a similar operation with DSBulk result in failure due to the inability to specify a range on the clustering key sequence_no. For instance, in cqlsh, the following query is successful:```

cqlsh> Select *  FROM biswa_ks.live_mom_transaction WHERE str_num='006' and day_no=7 and transaction_uid='14d6f78c-c468-4b71-84ce-0a8009bf3dc1' and trans_typ= 'SALE' and  sequence_no > '0';

 str_num | day_no | transaction_uid                      | trans_typ | sequence_no | created_date                    | raw_data                       | updated_date
---------+--------+--------------------------------------+-----------+-------------+---------------------------------+--------------------------------+---------------------------------
     006 |      7 | 14d6f78c-c468-4b71-84ce-0a8009bf3dc1 |      SALE |     seq-188 | 2024-06-05 00:55:41.789000+0000 | {"item":"sample","price":9.99} | 2024-06-05 01:55:41.789000+0000

However, trying to replicate this deletion with DSBulk, using the command below, results in an error because DSBulk does not accept a range condition on the clustering column sequence_no, even though cqlsh can handle it:`

cat /home/automaton/data/output-000001.csv|./dsbulk load -query "DELETE FROM biswa_ks.live_mom_transaction WHERE str_num=:str_num and day_no=:day_no and transaction_uid=:transaction_uid and trans_typ=:trans_typ and sequence_no > '0' "
Operation directory: /home/automaton/dsbulk-1.9.0/bin/logs/LOAD_20240312-155253-850242

Operation LOAD_20240312-155253-850242 failed: Missing required primary key column sequence_no from schema.mapping or schema.query.

This also means DSBulk generated deletes (using full PK & not cluster range) may create a large amount of tombstones e.g. if there are 10 sequence_no on a avg per str_num, day_no, transaction_uid, trans_typ combo, then it will create 10x tombstones

same select works in DSBULK

[dse_test_kmip11:0] automaton@ip-10-166-76-196:~/dsbulk-1.9.0/bin$ ./dsbulk unload -url /home/automaton/data -query "Select *  FROM biswa_ks.live_mom_transaction WHERE str_num='006' and day_no=7 and transaction_uid='068cfdf3-ae5b-48f1-a8de-27b70e2c3daf' and trans_typ= 'SALE' and  sequence_no > '0' "
Operation directory: /home/automaton/dsbulk-1.9.0/bin/logs/UNLOAD_20240312-202339-691820
total | failed | rows/s | p50ms | p99ms | p999ms
    1 |      0 |      3 | 20.51 | 20.58 |  20.58
Operation UNLOAD_20240312-202339-691820 completed successfully in less than one second.

slack conversation is there in
https://datastax.slack.com/archives/C6B5L9GQN/p1710264412970939

@biswadipseth1980 biswadipseth1980 changed the title DSBulk can not accept any ranges on the clustering column where as cqlsh can work with the same DSBulk DELETE can not accept any ranges on the clustering column Mar 12, 2024
@msmygit msmygit changed the title DSBulk DELETE can not accept any ranges on the clustering column DSBulk DELETE can not accept any ranges on the clustering column when used within -query Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant