Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suppress non-critical messages to prevent logs spamming #311

Closed
wants to merge 3 commits into from

Conversation

tanjinx
Copy link

@tanjinx tanjinx commented Apr 20, 2024

Description

suppress these 2 massive logging messages:

"keyspace event resolved: msgdata/msgdata is now consistent (serving: true)"
"disruption in shard msgdata/4780-47c0 resolved (serving: true)"
$ jq .msg jq.log|awk '{print $1}'|sort|uniq -c
     29 "Adding
    138 "[core]
  13000 "disruption
     53 "Error
     91 "HealthCheckUpdate(Serving
   1708 "keyspace
     26 "Removing
   7661 "tablet
     17 "Tablet

$ jq .level jq.log|sort|uniq -c
     57 "error"
  14867 "info"
   7799 "warn"

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: 'Tanjin Xu' <[email protected]>
@tanjinx tanjinx requested a review from a team as a code owner April 20, 2024 00:44
@tanjinx tanjinx changed the title Update zap logger setting Update zap logger level to WARNING and above Apr 20, 2024
@tanjinx tanjinx changed the title Update zap logger level to WARNING and above Suppress non-critical messages to prevent logs spamming Apr 22, 2024
@timvaillancourt
Copy link
Member

@tanjinx I feel this change is necessary only if this filtering cannot be done outside of Vitess and/or there is no way to remove the artificial limits we've hit with the logging pipeline

What I suggest is we first:

  • Check if there is a way to do this filtering in a different layer in the logging pipeline, outside of Vitess
  • Ask for an exception to what seems to be an artificial limit/threshold
  • Ask for a dedicated search index

Reducing how many logs we index by changing the Vitess code itself has a maintenance cost that can never be paid down and is difficult to re-enable (new build, CI, deploy) if we needed to re-enable these log lines for troubleshooting

That said, if Vitess changes is the only way forward, I suggest this is a v14 change only and it retires with v15. cc @vmogilev / @roderickyao / @venkatraju

@tanjinx
Copy link
Author

tanjinx commented Apr 22, 2024

We need to justify ourselves about the real value of those logs, are they really all critical for us? My experience might be limited, the only one or two messages which were helpful for me to debug vtgate issues during the incident is that the vtgate initialization, the tablet health state change, the query log and query errors.

ie I did not see any value of this message tabletserver is shutdown from a Restore-type tablet. And that message is logged every few seconds on every vtgate in the web and jobqueue pool.

tanjin.xu@z-kube-beach-tanjin-xu-iad-hedgehog:~$ grep "tabletserver is shutdown" vtgate.web.log |wc -l
8668
tanjin.xu@z-kube-beach-tanjin-xu-iad-hedgehog:~$ grep "tabletserver is shutdown" vtgate.web.log |grep "type:RESTORE"|wc -l
5686

Copy link

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

  • Push additional commits to the associated branch.
  • Remove the stale label.
  • Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

@github-actions github-actions bot added the Stale label May 23, 2024
@tanjinx tanjinx closed this May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants