argus.client.base.ArgusClientError: 'API Error encountered using endpoint on submit_results #492

juliayakovlev · 2024-10-27T14:35:42Z

scylla-enterprise-perf-regression-predefined-throughput-steps-vnodes test runs 3 load stages.

write and read stages are passed:

But mixed stage failed with :

10:08:27  2024-10-27 08:06:16.817: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=e8de7474-09f3-4770-bbb4-2b78159ce95b, source=PerformanceRegressionPredefinedStepsTest.test_mixed_gradual_increase_load (performance_regression_gradual_grow_throughput.PerformanceRegressionPredefinedStepsTest)() message=Traceback (most recent call last):
10:08:27  File "/home/ubuntu/scylla-cluster-tests/performance_regression_gradual_grow_throughput.py", line 72, in test_mixed_gradual_increase_load
10:08:27  self._base_test_workflow(workload=workload,
10:08:27  File "/home/ubuntu/scylla-cluster-tests/performance_regression_gradual_grow_throughput.py", line 127, in _base_test_workflow
10:08:27  self.run_gradual_increase_load(workload=workload,
10:08:27  File "/home/ubuntu/scylla-cluster-tests/performance_regression_gradual_grow_throughput.py", line 209, in run_gradual_increase_load
10:08:27  results = run_step(stress_cmds=workload.cs_cmd_tmpl, current_throttle=current_throttle,
10:08:27  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 272, in wrapped
10:08:27  send_result_to_argus(
10:08:27  File "/home/ubuntu/scylla-cluster-tests/sdcm/argus_results.py", line 165, in send_result_to_argus
10:08:27  argus_client.submit_results(result_table)
10:08:27  File "/usr/local/lib/python3.10/site-packages/argus/client/base.py", line 224, in submit_results
10:08:27  self.check_response(response)
10:08:27  File "/usr/local/lib/python3.10/site-packages/argus/client/base.py", line 68, in check_response
10:08:27  raise ArgusClientError(
10:08:27  argus.client.base.ArgusClientError: ('API Error encountered using endpoint: POST /api/v1/client/testrun/scylla-cluster-tests/7cb52991-e1ae-4e3e-8351-856c6b216b82/submit_results', '7cb52991-e1ae-4e3e-8351-856c6b216b82')

https://jenkins.scylladb.com/job/scylla-enterprise/job/perf-regression/job/scylla-enterprise-perf-regression-predefined-throughput-steps-vnodes/20/

The text was updated successfully, but these errors were encountered:

soyacz · 2024-10-28T08:02:35Z

problem is that due some reason this run was not created in Argus at all:

< t:2024-10-27 04:14:54,161 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '7cb52991-e1ae-4e3e-8351-856c6b216b82'], 'exception': 'SCTServiceException'}, 'status': 'error'}
< t:2024-10-27 04:14:54,729 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': 'Created', 'status': 'ok'}

I don't know yet the root cause.

k0machi · 2024-11-04T09:25:21Z

problem is that due some reason this run was not created in Argus at all:

< t:2024-10-27 04:14:54,161 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '7cb52991-e1ae-4e3e-8351-856c6b216b82'], 'exception': 'SCTServiceException'}, 'status': 'error'}
< t:2024-10-27 04:14:54,729 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': 'Created', 'status': 'ok'}

I don't know yet the root cause.

This could mean that test failed creation on a separate stage, but then got re-created by the sct itself, hence the "run not found" exception. Next response indicates that it got created, but something maybe dropped it later. Maybe id re-use or a consistency issue. Need to check all stages that interacted with argus

k0machi · 2024-11-04T09:27:53Z

problem is that due some reason this run was not created in Argus at all:
< t:2024-10-27 04:14:54,161 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '7cb52991-e1ae-4e3e-8351-856c6b216b82'], 'exception': 'SCTServiceException'}, 'status': 'error'}
< t:2024-10-27 04:14:54,729 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': 'Created', 'status': 'ok'}
I don't know yet the root cause.
This could mean that test failed creation on a separate stage, but then got re-created by the sct itself, hence the "run not found" exception. Next response indicates that it got created, but something maybe dropped it later. Maybe id re-use or a consistency issue. Need to check all stages that interacted with argus

It could also mean that another test (read/write) did this, but mixed didn't.

fruch · 2024-11-11T16:07:02Z

@k0machi what's up with this one ?

seems like we don't have "Create Argus Test Run" in the pefRegressionParallelPipeline.groovy

regardless we don't expect creation to fail, and we need to be able to lookup the logs for this failure, to understand it (on argus end)

k0machi · 2024-11-18T14:09:26Z

@k0machi what's up with this one ?

seems like we don't have "Create Argus Test Run" in the pefRegressionParallelPipeline.groovy

regardless we don't expect creation to fail, and we need to be able to lookup the logs for this failure, to understand it (on argus end)

Without the explicit stage the run is created inside "Run SCT Stages" stage, specifically during ClusterTester init, so the cause for failure should be visible in the logs for that particular SCT run.

k0machi · 2024-11-20T19:04:02Z

Looking into this, I see a weird issue:

So here's the write test, initializing succesfuly:

< t:2024-10-27 04:13:51,469 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '70fa1e8d-a52e-41ec-aeb0-9da8ef711ec6'], 'exception': 'SCTServiceException'}, 'status': 'error'}
< t:2024-10-27 04:13:52,035 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': 'Created', 'status': 'ok'}
< t:2024-10-27 04:13:52,387 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': 'updated', 'status': 'ok'}
< t:2024-10-27 04:13:52,743 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': 'added', 'status': 'ok'}

And here's the mixed test's argus.log:

< t:2024-10-27 04:14:54,161 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '7cb52991-e1ae-4e3e-8351-856c6b216b82'], 'exception': 'SCTServiceException'}, 'status': 'error'}
< t:2024-10-27 04:14:54,729 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': 'Created', 'status': 'ok'}
< t:2024-10-27 04:14:55,080 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '7cb52991-e1ae-4e3e-8351-856c6b216b82'], 'exception': 'SCTServiceException'}, 'status': 'error'}
< t:2024-10-27 04:14:55,430 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': {'arguments': [], 'exception': 'DoesNotExist'}, 'status': 'error'}
< t:2024-10-27 04:14:55,431 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '7cb52991-e1ae-4e3e-8351-856c6b216b82'], 'exception': 'SCTServiceException'}, 'status': 'error'}
< t:2024-10-27 04:15:18,778 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': {'arguments': ['Run not found', '7cb52991-e1ae-4e3e-8351-856c6b216b82'], 'exception': 'SCTServiceException'}, 'status': 'error'}
< t:2024-10-27 04:15:25,812 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': {'arguments': [], 'exception': 'DoesNotExist'}, 'status': 'error'}

So this is weird - API responded that it had succesfully saved the run, yet it does not exist for subsequent calls. Manually querying the run doesn't help - it indeed doesn't exist:

I suspect something happened with scylla at that moment - unfortunately the logs for this are not available anymore (and I suspect the application logs won't show much, since the API did respond with success - which means something happened on scylla's end).

fruch · 2024-11-20T19:13:24Z

That is very disturbing.

Did you check on each node separately?

It's not the first time we run into such cases, last time it was during upgrade of Scylla or node replacement

What more information we can log to help figure out the next time

(We should collect logs, as soon as issues are reported, or archive logs periodically to s3 to something like that)

k0machi · 2024-11-20T19:31:34Z

That is very disturbing.

Did you check on each node separately?

It's not the first time we run into such cases, last time it was during upgrade of Scylla or node replacement

What more information we can log to help figure out the next time

(We should collect logs, as soon as issues are reported, or archive logs periodically to s3 to something like that)

I haven't checked individual nodes yet, I'll do that. We should collect logs periodically, maybe have a github action that would do a snapshot of last N hours of production each time an issue is reported?

soyacz · 2024-11-25T11:06:21Z

I suspect MV scylla issue - see both id and test_id are indexed columns - so when MV fails to update (which is not ensured by cql insert/update request) we may hit issue with insert being correct and querying for it not. @k0machi can you confirm this ID exists without using indexed column in query?

k0machi · 2024-11-25T14:08:18Z

I suspect MV scylla issue - see both id and test_id are indexed columns - so when MV fails to update (which is not ensured by cql insert/update request) we may hit issue with insert being correct and querying for it not. @k0machi can you confirm this ID exists without using indexed column in query?

No, the id doesn't exist at all - only two runs that have #20 as their build number exist for this one:

fruch · 2024-12-02T08:42:43Z

@k0machi

what next ? what can we do to capture more data when it's gonna happen again ?

k0machi · 2024-12-02T10:30:05Z

@k0machi

what next ? what can we do to capture more data when it's gonna happen again ?

I will add context dump to the exceptions happening in the submit requests (or we could just dump them wholly on every error to improve readability - session data, request body, etc.

This commit adds additional information and trace ids to the exception that occur inside the API calls, allowing to collect more information about the error, including the request data. Fixes scylladb#492

fruch · 2024-12-09T14:44:05Z

We can't identify the root cause, adding new logging for identify those

juliayakovlev assigned k0machi Oct 27, 2024

k0machi added the bug Something isn't working label Nov 18, 2024

fruch removed the bug Something isn't working label Dec 2, 2024

k0machi mentioned this issue Dec 9, 2024

improvement(error_handlers): Add context dump to the exceptions #535

Open

fruch closed this as completed Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

argus.client.base.ArgusClientError: 'API Error encountered using endpoint on submit_results #492

argus.client.base.ArgusClientError: 'API Error encountered using endpoint on submit_results #492

juliayakovlev commented Oct 27, 2024 •

edited

Loading

soyacz commented Oct 28, 2024

k0machi commented Nov 4, 2024

k0machi commented Nov 4, 2024

fruch commented Nov 11, 2024

k0machi commented Nov 18, 2024

k0machi commented Nov 20, 2024

fruch commented Nov 20, 2024

k0machi commented Nov 20, 2024

soyacz commented Nov 25, 2024

k0machi commented Nov 25, 2024

fruch commented Dec 2, 2024

k0machi commented Dec 2, 2024

fruch commented Dec 9, 2024

argus.client.base.ArgusClientError: 'API Error encountered using endpoint on submit_results #492

argus.client.base.ArgusClientError: 'API Error encountered using endpoint on submit_results #492

Comments

juliayakovlev commented Oct 27, 2024 • edited Loading

soyacz commented Oct 28, 2024

k0machi commented Nov 4, 2024

k0machi commented Nov 4, 2024

fruch commented Nov 11, 2024

k0machi commented Nov 18, 2024

k0machi commented Nov 20, 2024

fruch commented Nov 20, 2024

k0machi commented Nov 20, 2024

soyacz commented Nov 25, 2024

k0machi commented Nov 25, 2024

fruch commented Dec 2, 2024

k0machi commented Dec 2, 2024

fruch commented Dec 9, 2024

juliayakovlev commented Oct 27, 2024 •

edited

Loading