-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blocking queries may block longer than necessary in particular race conditions. #18266
Comments
Thanks for another detailed report @stswidwinski 😄 Just to make sure I understand the problem and the reproduction correctly, in step 2 you mentioned:
Is the script expected to keep printing outputs without blocking until step 3? If so, that's not what I'm seeing. All calls are blocked waiting for the index to change:
|
I believe that this is expected since index 1 is somewhat special and an
entirely clean cluster will work differently due to this special case:
https://github.com/hashicorp/nomad/blob/main/nomad/job_endpoint.go#L1560
Sorry, I assumed that there was at least one transaction before this is ran
-- mea culpa!
…On Mon, Aug 21, 2023, 6:31 PM Luiz Aoqui ***@***.***> wrote:
Thanks for another detailed report @stswidwinski
<https://github.com/stswidwinski> 😄
Just to make sure I understand the problem and the reproduction correctly,
in step 2 you mentioned:
This will just forever return information and use the latest index known.
This is expected to continue to return information ~in a tight loop and
pretty much never block for a very long time.
Is the script expected to keep printing outputs without blocking until
step 3? If so, that's not what I'm seeing. All calls are blocked waiting
for the index to change:
+ NOMAD=./nomad
+ TMP_DIR=/tmp/test
+ mkdir -p /tmp/test
+ INDEX=1
+ true
+ for i in '{1..10}'
+ pids[${i}]=73477
+ for i in '{1..10}'
+ ./nomad operator api -verbose '/v1/job/test_job-1/allocations?index=1&all=true'
+ pids[${i}]=73478
+ for i in '{1..10}'
+ ./nomad operator api -verbose '/v1/job/test_job-1/allocations?index=1&all=true'
+ pids[${i}]=73479
+ for i in '{1..10}'
+ ./nomad operator api -verbose '/v1/job/test_job-1/allocations?index=1&all=true'
+ pids[${i}]=73480
+ for i in '{1..10}'
+ ./nomad operator api -verbose '/v1/job/test_job-1/allocations?index=1&all=true'
+ ./nomad operator api -verbose '/v1/job/test_job-1/allocations?index=1&all=true'
+ pids[${i}]=73481
+ for i in '{1..10}'
+ pids[${i}]=73482
+ for i in '{1..10}'
+ ./nomad operator api -verbose '/v1/job/test_job-1/allocations?index=1&all=true'
+ pids[${i}]=73483
+ for i in '{1..10}'
+ ./nomad operator api -verbose '/v1/job/test_job-1/allocations?index=1&all=true'
+ pids[${i}]=73484
+ for i in '{1..10}'
+ ./nomad operator api -verbose '/v1/job/test_job-1/allocations?index=1&all=true'
+ pids[${i}]=73485
+ for i in '{1..10}'
+ ./nomad operator api -verbose '/v1/job/test_job-1/allocations?index=1&all=true'
+ pids[${i}]=73486
+ for pid in '${pids[*]}'
+ wait 73477
+ ./nomad operator api -verbose '/v1/job/test_job-1/allocations?index=1&all=true'
—
Reply to this email directly, view it on GitHub
<#18266 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AECDANGDK5MUWNCCSHY6SD3XWPOUDANCNFSM6AAAAAA3XL7TBA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Nomad version
Tip.
Operating system and Environment details
Unix
Issue
The blocking query logic is written such that awaiting on a trigger may race with the trigger itself, resulting in a block which is much longer than necessary. To be exact:
https://github.com/hashicorp/nomad/blob/main/nomad/rpc.go#L832-L870
Let us consider the following lines:
Consider the following set of events:
The implementation of
ws.WatchCtx
is such that notifications from point 2 will be ignored and will not trigger a retry. This may cause long (sometimes all the way to timeout) queries where an instant query should've been used to satisfy the query.One should likely begin observing the context prior to the beginning of "the query" to correctly handle retries.
Reproduction steps
A simple repro is to explicitly race against state changes within "the query." Here is a simple way to achieve this with relatively high probability.
Step 1: Start the dev server
./nomad agent -dev
Step 2: Start running a tight loop of queries for a job allocations
Here is a sample script:
This will just forever return information and use the latest index known. This is expected to continue to return information ~in a tight loop and pretty much never block for a very long time.
Step 3: Start, Stop and GC a job
Observed behaviors.
The tight loop of getting job information will "deadlock" and continue waiting on an index which is lower than the index after the GC of the job and allocations. This will continue until the timeout is hit. An exactly identical query issued (same index blocking) after the GC completes will return immediately without any problems.
If this doesn't repro for you immediately, just run Step 3 a couple of times. I can exhibit the behavior roughly 9 out of 10 times.
Expected Result
We block only as long as we need to and no longer.
Actual Result
We block potentially indefinitely.
The text was updated successfully, but these errors were encountered: