-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Infinite logs in case of non-existent stream logs #17004
Conversation
Signed-off-by: Noble Mittal <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #17004 +/- ##
==========================================
- Coverage 69.34% 67.14% -2.20%
==========================================
Files 1571 1571
Lines 204179 251763 +47584
==========================================
+ Hits 141586 169048 +27462
- Misses 62593 82715 +20122 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work (as usual), @beingnoble03 ! Thank you. ❤️ I only had the very minor note about adding a comment for the new unit test.
I confirmed that it tests what we want. When running it against main:
❯ go test -timeout 10s -run ^TestGetWorkflowsSingleStream$ vitess.io/vitess/go/vt/vtctl/workflow
... MANY of these
W1018 16:36:43.159766 58815 server.go:865] Found stream log for nonexistent stream: id:1 type:"State Change" state:"Running" created_at:{seconds:1136214245} updated_at:{seconds:1136214245} message:"test message for non-existent 1" count:1
W1018 16:36:43.159768 58815 server.go:865] Found stream log for nonexistent stream: id:1 type:"State Change" state:"Running" created_at:{seconds:1136214245} updated_at:{seconds:1136214245} message:"test message for non-existent 1" count:1
W1018 16:36:43.159771 58815 server.go:865] Found stream log for nonexistent stream: id:1 type:"State Change" state:"Running" created_at:{seconds:1136214245} updated_at:{seconds:1136214245} message:"test message for non-existent 1" count:1
W1018 16:36:43.159774 58815 server.go:865] Found stream log for nonexistent stream: id:1 type:"State Change" state:"Running" created_at:{seconds:1136214245} updated_at:{seconds:1136214245} message:"test message for non-existent 1" count:1
W1018 16:36:43.159776 58815 server.go:865] Found stream log for nonexistent stream: id:1 type:"State Change" state:"Running" created_at:{seconds:1136214245} updated_at:{seconds:1136214245} message:"test message for non-existent 1" count:1
panic: test timed out after 10s
running tests:
TestGetWorkflowsSingleStream (10s)
It demonstrates perfectly what had been seen — spinning in a tight loop spewing that warning forever — and that it works as expected/desired in the PR branch.
@@ -1811,3 +1811,60 @@ func createReadVReplicationWorkflowFunc(t *testing.T, workflowType binlogdatapb. | |||
}, nil | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a brief comment about what this is testing as it's not obvious from the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out that @mattlord! I've changed the test func name and added the comment.
Signed-off-by: Noble Mittal <[email protected]>
Signed-off-by: Noble Mittal <[email protected]>
…#17004) (#17014) Signed-off-by: Noble Mittal <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Description
In
GetWorkflows()
we use the following algo while collecting logs for a workflow.We first sort all the streams of each shard by
streamID
in ASC order. Now, we callfetchStreamLogs
for each workflow. InsidefetchStreamLogs
, we fetch the streamLogs from each tablet. Now, we run afor
loop across all the tablets we got fromresults
(from query for fetching logs and results sorted bystreamID
in ASC order). Now, as we have already sorted theshardStreams
bystreamID
and logs are also ordered bystreamID
, so we use a variablestreamIdx
which starts from 0. Now, we run afor
loop across stream log results from that particular tablet (we got from the outer loop). In this loop, we updatestreamIdx
in this code block:So, basically if we increase the
streamIdx
(i.e.streamIdx++
) , the nextstreamID
will be greater than the last one because we already sorted the streams of each shard. We can consider thatcontinue
in this loop means that we are finding the stream that matches the particularstreamLog.StreamID
, and break means jumping to next log. But, in the following condition:Instead of jumping to the next log, we are continuing the search for matching
stream.Id
even when we know it's not possible to find it, because even if we increase thestreamIdx
,stream.Id
will increase as the results are already sorted. This results in an infinite loop. Instead we should break the for loop and jump to the next log.Failing Case: If the streamIDs are: 1, 2, 4, 6 and logs have streamIDs in this order: 1, 1, 1, 3, 3, 3, 4. Now, inside the loop when we are at log with streamID 3, right now we will be stuck. And, if we increase the
streamIdx
it won't be right, as we would already be atstream.Id = 4
and this will skip the logs for stream 4 and stream 6. But, if we break, we'll be accessing the next log, which would be the correct approach.Related Issue(s)
Checklist
Deployment Notes