Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate fails to get response from test cluster #382

Closed
2 tasks
fruch opened this issue Jul 9, 2023 · 4 comments
Closed
2 tasks

Validate fails to get response from test cluster #382

fruch opened this issue Jul 9, 2023 · 4 comments

Comments

@fruch
Copy link
Collaborator

fruch commented Jul 9, 2023

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.
AssertionError: [{'errors': [{'timestamp': '2023-07-07T13:33:58.888559481Z', 'message': 'Validation failed: unable to load check data from the test store: gocql: expected 2 values send got 0', 'query': 'SELECT * FROM ks1.table1 WHERE pk0=? AND pk1=? ', 'stmt-type': 'SelectStatement'}], 'write_ops': 252722571, 'write_errors': 0, 'read_ops': 149894044, 'read_errors': 1}]

It would help if this error would have to he actual values that was used. also it would help if gemini can have a full log to prove those key were written

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1039-aws
Scylla version (or git commit hash): 5.4.0~dev-20230703.1ab2bb69b8a6 with build-id 1054adfa55f238441c3044e6213fd02d31d43279

Cluster size: 3 nodes (i3.4xlarge)

Scylla Nodes used in this run:

  • gemini-1tb-10h-master-oracle-db-node-fae3291b-1 (34.228.160.205 | 10.12.10.78) (shards: 14)
  • gemini-1tb-10h-master-db-node-fae3291b-5 (3.80.47.237 | 10.12.10.114) (shards: 14)
  • gemini-1tb-10h-master-db-node-fae3291b-4 (54.80.189.156 | 10.12.9.192) (shards: 14)
  • gemini-1tb-10h-master-db-node-fae3291b-3 (54.226.136.192 | 10.12.10.80) (shards: 14)
  • gemini-1tb-10h-master-db-node-fae3291b-2 (204.236.208.174 | 10.12.8.114) (shards: 14)
  • gemini-1tb-10h-master-db-node-fae3291b-1 (3.91.201.90 | 10.12.9.198) (shards: 14)

OS / Image: `` (aws: undefined_region)

Test: gemini-1tb-10h
Test id: fae3291b-604b-43f9-b22b-c98e1bb06ae2
Test name: scylla-master/gemini-/gemini-1tb-10h
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor fae3291b-604b-43f9-b22b-c98e1bb06ae2
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs fae3291b-604b-43f9-b22b-c98e1bb06ae2

Logs:

Jenkins job URL
Argus

@fruch
Copy link
Collaborator Author

fruch commented Jul 11, 2023

it's not clear, if this the only validation error ? should it stop on the first one ?

{"L":"INFO","T":"2023-07-07T13:33:58.831Z","N":"work cycle.validation_job","M":"Validation failed. Error: unable to load check data from the test store: gocql: expected 2 values send got 0"}
Error detected: &joberror.JobError{Timestamp:time.Date(2023, time.July, 7, 13, 33, 58, 888559481, time.Local), Message:"Validation failed: unable to load check data from the test store: gocql: expected 2 values send got 0", Query:"SELECT * FROM ks1.table1 WHERE pk0=? AND pk1=? ", StmtType:"SelectStatement"}{"L":"INFO","T":"2023-07-07T13:33:58.895Z","N":"work cycle.validation_job","M":"ending validation loop"}

also after the validation failure, seem like still ongoing mutations, and some are failing:

{"L":"ERROR","T":"2023-07-07T13:34:21.368Z","M":"failed to apply mutation","attempts":15,"error":"[cluster = oracle, query = 'INSERT INTO ks1.table1 (pk0,pk1,ck0,ck1,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,(?,?,?,?,?),?,?,?) ']: gocql: expected 20 values send got 18","errorVerbose":"gocql: expected 20 values send got 18\n[cluster = oracle, query = 'INSERT INTO ks1.table1 (pk0,pk1,ck0,ck1,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,(?,?,?,?,?),?,?,?) ']\ngithub.com/scylladb/gemini/pkg/store.(*cqlStore).doMutate\n\t/home/ls/repos/gemini_upstream/pkg/store/cqlstore.go:87\ngithub.com/scylladb/gemini/pkg/store.(*cqlStore).mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/cqlstore.go:53\ngithub.com/scylladb/gemini/pkg/store.mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/store.go:173\ngithub.com/scylladb/gemini/pkg/store.delegatingStore.Mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/store.go:164\ngithub.com/scylladb/gemini/pkg/jobs.mutation\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:385\ngithub.com/scylladb/gemini/pkg/jobs.mutationJob\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:182\ngithub.com/scylladb/gemini/pkg/jobs.List.Run.func2\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:136\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/ls/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"}
{"L":"INFO","T":"2023-07-07T13:34:21.373Z","N":"delegating_store","M":"oracle failed mutation, transition to next state impossible so continuing with next mutation","error":"unable to apply mutations to the oracle store: [cluster = oracle, query = 'INSERT INTO ks1.table1 (pk0,pk1,ck0,ck1,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,(?,?,?,?,?),?,?,?) ']: gocql: expected 20 values send got 18","errorVerbose":"gocql: expected 20 values send got 18\n[cluster = oracle, query = 'INSERT INTO ks1.table1 (pk0,pk1,ck0,ck1,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,(?,?,?,?,?),?,?,?) ']\ngithub.com/scylladb/gemini/pkg/store.(*cqlStore).doMutate\n\t/home/ls/repos/gemini_upstream/pkg/store/cqlstore.go:87\ngithub.com/scylladb/gemini/pkg/store.(*cqlStore).mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/cqlstore.go:53\ngithub.com/scylladb/gemini/pkg/store.mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/store.go:173\ngithub.com/scylladb/gemini/pkg/store.delegatingStore.Mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/store.go:164\ngithub.com/scylladb/gemini/pkg/jobs.mutation\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:385\ngithub.com/scylladb/gemini/pkg/jobs.mutationJob\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:182\ngithub.com/scylladb/gemini/pkg/jobs.List.Run.func2\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:136\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/ls/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\nunable to apply mutations to the oracle store\ngithub.com/scylladb/gemini/pkg/store.mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/store.go:174\ngithub.com/scylladb/gemini/pkg/store.delegatingStore.Mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/store.go:164\ngithub.com/scylladb/gemini/pkg/jobs.mutation\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:385\ngithub.com/scylladb/gemini/pkg/jobs.mutationJob\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:182\ngithub.com/scylladb/gemini/pkg/jobs.List.Run.func2\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:136\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/ls/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"}
Error: gemini encountered errors, exiting with non zero status
{"L":"INFO","T":"2023-07-07T13:34:21.373Z","N":"work cycle.mutation_job","M":"ending mutation loop"}

@dkropachev doesn't gemini finish first the mutations loops ? and only then doing validation loops ?
or I'm missing something basic on how gemini works ?

@dkropachev
Copy link
Collaborator

it's not clear, if this the only validation error ? should it stop on the first one ?

It does, in a sense, there are two ways gemini stops, soft stop and hard stop.
Hard stop cancels top context, it should lead to emmidiate stop of any running task.
Soft stop just tells all the loops to stop on next iteration.

When error happens and --fail-fast is there, gemini is commanded to soft stop, that is why you see that error on the bottom.
But I am curiouse on why did take 23 seconds to finish mutation operation, even if default timeout is 30 Seconds this error should pop right away.

{"L":"INFO","T":"2023-07-07T13:33:58.831Z","N":"work cycle.validation_job","M":"Validation failed. Error: unable to load check data from the test store: gocql: expected 2 values send got 0"}
Error detected: &joberror.JobError{Timestamp:time.Date(2023, time.July, 7, 13, 33, 58, 888559481, time.Local), Message:"Validation failed: unable to load check data from the test store: gocql: expected 2 values send got 0", Query:"SELECT * FROM ks1.table1 WHERE pk0=? AND pk1=? ", StmtType:"SelectStatement"}{"L":"INFO","T":"2023-07-07T13:33:58.895Z","N":"work cycle.validation_job","M":"ending validation loop"}

also after the validation failure, seem like still ongoing mutations, and some are failing:

{"L":"ERROR","T":"2023-07-07T13:34:21.368Z","M":"failed to apply mutation","attempts":15,"error":"[cluster = oracle, query = 'INSERT INTO ks1.table1 (pk0,pk1,ck0,ck1,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,(?,?,?,?,?),?,?,?) ']: gocql: expected 20 values send got 18","errorVerbose":"gocql: expected 20 values send got 18\n[cluster = oracle, query = 'INSERT INTO ks1.table1 (pk0,pk1,ck0,ck1,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,(?,?,?,?,?),?,?,?) ']\ngithub.com/scylladb/gemini/pkg/store.(*cqlStore).doMutate\n\t/home/ls/repos/gemini_upstream/pkg/store/cqlstore.go:87\ngithub.com/scylladb/gemini/pkg/store.(*cqlStore).mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/cqlstore.go:53\ngithub.com/scylladb/gemini/pkg/store.mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/store.go:173\ngithub.com/scylladb/gemini/pkg/store.delegatingStore.Mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/store.go:164\ngithub.com/scylladb/gemini/pkg/jobs.mutation\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:385\ngithub.com/scylladb/gemini/pkg/jobs.mutationJob\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:182\ngithub.com/scylladb/gemini/pkg/jobs.List.Run.func2\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:136\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/ls/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"}
{"L":"INFO","T":"2023-07-07T13:34:21.373Z","N":"delegating_store","M":"oracle failed mutation, transition to next state impossible so continuing with next mutation","error":"unable to apply mutations to the oracle store: [cluster = oracle, query = 'INSERT INTO ks1.table1 (pk0,pk1,ck0,ck1,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,(?,?,?,?,?),?,?,?) ']: gocql: expected 20 values send got 18","errorVerbose":"gocql: expected 20 values send got 18\n[cluster = oracle, query = 'INSERT INTO ks1.table1 (pk0,pk1,ck0,ck1,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,(?,?,?,?,?),?,?,?) ']\ngithub.com/scylladb/gemini/pkg/store.(*cqlStore).doMutate\n\t/home/ls/repos/gemini_upstream/pkg/store/cqlstore.go:87\ngithub.com/scylladb/gemini/pkg/store.(*cqlStore).mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/cqlstore.go:53\ngithub.com/scylladb/gemini/pkg/store.mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/store.go:173\ngithub.com/scylladb/gemini/pkg/store.delegatingStore.Mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/store.go:164\ngithub.com/scylladb/gemini/pkg/jobs.mutation\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:385\ngithub.com/scylladb/gemini/pkg/jobs.mutationJob\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:182\ngithub.com/scylladb/gemini/pkg/jobs.List.Run.func2\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:136\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/ls/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\nunable to apply mutations to the oracle store\ngithub.com/scylladb/gemini/pkg/store.mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/store.go:174\ngithub.com/scylladb/gemini/pkg/store.delegatingStore.Mutate\n\t/home/ls/repos/gemini_upstream/pkg/store/store.go:164\ngithub.com/scylladb/gemini/pkg/jobs.mutation\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:385\ngithub.com/scylladb/gemini/pkg/jobs.mutationJob\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:182\ngithub.com/scylladb/gemini/pkg/jobs.List.Run.func2\n\t/home/ls/repos/gemini_upstream/pkg/jobs/jobs.go:136\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/ls/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"}
Error: gemini encountered errors, exiting with non zero status
{"L":"INFO","T":"2023-07-07T13:34:21.373Z","N":"work cycle.mutation_job","M":"ending mutation loop"}

@dkropachev doesn't gemini finish first the mutations loops ? and only then doing validation loops ? or I'm missing something basic on how gemini works ?
Nope, it runs them in parallel, there is a way to avoid collisions, called inflights, but it does not take into account range queries, since there is no way to get affected inflights from the range, that is why it have seperate retry logic on range validation queries.

@dkropachev
Copy link
Collaborator

@fruch , all in all behavior here is expected, run had -f which commanded gemini to fail on first error.
Error that gemini observed came from the cluster, so, gemini have nothing to do with it.

I isolated issue with failing mutation here - #384.
Since failing mutations are not primary reason of test to fail, i am closing it up.
Feel free to reopen if you think differently.

@fruch
Copy link
Collaborator Author

fruch commented Jul 11, 2023

@dkropachev

SELECT * FROM ks1.table1 WHERE pk0=? AND pk1=? isn't enough information to figure out what's going on

we would need to have the parameter used logged
and also the response from the oracle that it's comparing with would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants