Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace zap with slog; Replace logging wrappers with slog #435

Merged

Conversation

hoffmaen
Copy link
Contributor

@hoffmaen hoffmaen commented Sep 3, 2024

Summary

gorouter uses a very outdated zap version from 2016 for logging. This Pull-Request introduces required code changes to use slog as the logging frontend, together with a zap handler via zapslog. Slog provides a stable frontend, while we keep using zap with its configuration as the backend.

Both the gorouter logger and the lager logger (used for initializing the UAA Token Fetcher) used zap logger wrappers. This Pull-Request replaces the logger wrapper with direct use of slog.Logger. A thin library is provided to keep it compatible with the code. The library provides the functions CreateLogger and CreateLoggerWithSource to create a logger that subsequently writes data into the data field, and optionally sets the source field in the log's root. The library also has functions to dynamically set the WriteSyncer, which is required for tests, and to dynamically set the logging level and timestamp format.

With the removal of the Logger interface, all occurrences of the FakeLogger have been replaced with an slog.Logger based TestLogger.

Additionally, some test refactoring and code reformatting has been done.

Backward Compatibility

Breaking Change? No

This Pull-Request requires cloudfoundry/routing-release#434 for updating zap dependencies.

The zap handler configuration has been kept, so the log message format does not change. Workarounds for differing behavior of the slog.Logger frontend have been found and implemented. There are quite a lot of tests on log messages in the code, an additional end-to-end test with a deployment of this code did not show any differences compared to the old implementation.

@@ -97,11 +96,17 @@ func (r *RouteRegistry) Register(uri route.Uri, endpoint *route.Endpoint) {

switch endpointAdded {
case route.ADDED:
r.logger.Info("endpoint-registered", zapData(uri, endpoint)...)
if r.logger.Enabled(context.Background(), slog.LevelInfo) {
Copy link
Contributor

@peanball peanball Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, we have discussed replacing this with a "flattened" group of attributes that can be evaluated lazily.

The "flattening" can be achieved in slog by adding a "Group" attribute that has "" as name. This is a signal to not have a group, but add all the group's elements at the same level.

While slog does support this in its native handlers (e.g. JSONHandler), the ZAP slog handler does not quite yet support it.

There is code in the ZAP slog handler's master branch that applies this logic, but it has not made it into a release yet.

We have taken note to improve this and remove these ifs once the ZAP slog handler supports group flattening.

@plowin plowin requested a review from a team September 9, 2024 13:05
@hoffmaen hoffmaen force-pushed the refactoring/zap-to-zapslog branch 2 times, most recently from c2bde8d to ce04e0e Compare September 10, 2024 10:02
@hoffmaen hoffmaen force-pushed the refactoring/zap-to-zapslog branch from 00cc4c6 to 11ef228 Compare September 18, 2024 11:28
@hoffmaen hoffmaen requested a review from ameowlia September 18, 2024 12:18
hoffmaen and others added 6 commits September 18, 2024 14:19
Some of the log messages pass on complex data to be logged.
When the log level is not enabled, the input data is already
computed but then discarded by the logger, after checking
the active log level.

This check looks redundant but avoids many allocations and
processing that are not needed.

A nicer solution would be lazy evaluation, e.g. via callback
closures, but log/slog does not support those.
@hoffmaen hoffmaen force-pushed the refactoring/zap-to-zapslog branch from 11ef228 to 66da390 Compare September 18, 2024 12:20
@ameowlia
Copy link
Member

ameowlia commented Sep 27, 2024

I was able to validate the fixes for "tls-listener-started" and "tcp-listener-started".

😞 But I found another discrepancy, with the "endpoint-not-unregistered", "endpoint-registered", and "endpoint-unregistered" logs. All of them had the same capitalization discrepancy.

BEFORE

{
  "log_level": 1,
  "timestamp": "2024-09-25T15:17:24.679442106Z",
  "message": "endpoint-not-unregistered",
  "source": "vcap.gorouter.registry",
  "data": {
    "uri": "api.cove-3156153.z60a1041a.shepherd.lease/routing",
    "route_service_url": "",
    "backend": "10.0.1.5:3001",
    "application_id": "",
    "instance_id": "9b280680-4fcd-4e11-7a48-38b251758191",
    "server_cert_domain_san": "routing-api.service.cf.internal",
    "protocol": "http1",
    "modification_tag": {
      "guid": "",    # <--- note the lack of capitalization here for guid
      "index": 0  # <--- note the lack of capitalization here for index
    },
    "isolation_segment": "-",
    "isTLS": true
  }
}

AFTER

{
  "log_level": 1,
  "timestamp": "2024-09-27T14:36:32.792195379Z",
  "message": "endpoint-not-unregistered",
  "source": "vcap.gorouter.registry",
  "data": {
    "uri": "api.near-3161542.z8dd11304.shepherd.lease/routing",
    "route_service_url": "",
    "backend": "10.0.1.6:3001",
    "application_id": "",
    "instance_id": "4493b7d4-e400-4dd0-7da4-4c607290083d",
    "server_cert_domain_san": "routing-api.service.cf.internal",
    "protocol": "http1",
    "modification_tag": {
      "Guid": "",  # <--- note the capitalization here for Guid
      "Index": 0  # <--- note the capitalization here Index
    },
    "isolation_segment": "-",
    "isTLS": true
  }
}

@ameowlia
Copy link
Member

ameowlia commented Sep 27, 2024

This is how I am doing my comparison by the way...

  1. I run two environments through the same actions:
  • push an app
  • stop nats
  • start nats
  • restart the router vm
  • restart a diego cell vm
  • delete the app
  1. Then I get all of the uniq log messages jq .message LOG_FILE | sort | uniq
  2. Then I use this script to compare the log files for each one of those messages...
#!/bin/bash
messages=(
"Draining with 0 outstanding active connections"
"Draining with 24 outstanding idle connections"
"Successfully-connected-to-nats"
"Successfully-subscribed-to-routing-api-event-stream"
"endpoint-not-unregistered"
"endpoint-registered"
"endpoint-unregistered"
"error-stopping-route-services-server"
"exited"
"failed-closing-health-tls-listener"
"failed-getting-next-event"
"gorouter-draining"
"gorouter.exited"
"gorouter.started"
"gorouter.stopped"
"gorouter.stopping"
"nats-connection-disconnected"
"nats-connection-reconnected"
"nats-connection-still-disconnected"
"prune-suspended"
"retrieved-isolation-segments"
"route-registered"
"route-unregistered"
"setting-up-nats-connection"
"setting-up-routing-api"
"stopping"
"subscriber-started"
"subscriber-starting"
"subscribing-to-routing-api-event-stream"
"syncer-started"
"tcp-listener-started"
"tls-listener-started"
"validation-error"
)


old_file="LOGFILE"
new_file="LOGFILE"
old_log_file="/tmp/old-log"
new_log_file="/tmp/new-log"

for i in "${messages[@]}"
do
    echo
    echo
    echo $i

    ag "${i}" $old_file --silent --nonumbers | head -n 1 | jq . > $old_log_file
    ag "${i}" $new_file --silent --nonumbers | head -n 1 | jq . > $new_log_file
    echo "------------------"

    diff $old_log_file $new_log_file
done

If the logs match it should only be the timestamp/URI/IP that are different like this...

validation-error
------------------
<   "timestamp": "2024-09-25T11:04:49.850291246Z",
---
>   "timestamp": "2024-09-27T14:45:00.824626396Z",

or

route-unregistered
------------------
<   "timestamp": "2024-09-25T15:17:47.874868993Z",
---
>   "timestamp": "2024-09-27T14:36:35.878030926Z",

<     "uri": "uaa.cove.meow.com"
---
>     "uri": "uaa.near.meow.com"

@ameowlia ameowlia merged commit 9f678fd into cloudfoundry:main Sep 30, 2024
1 check passed
@maxmoehl maxmoehl deleted the refactoring/zap-to-zapslog branch October 10, 2024 07:16
peanball added a commit that referenced this pull request Jan 31, 2025
The slog handler for Zap is configured to automatically emit stack traces at "Error" level.

In pure Zap, stack traces are added via option AddStacktrace, which was not used before the move to slog.

The Zap slog handler has a mapping from Zap levels to slog levels, and does not emit values above slog.LevelError. By setting the limit for automatically added stack traces via slog to ErrorLevel + 1, this effectively disables automatic stack traces.

The `panic-check` handler that catches and logs panics in handlers remains unaffected, as it explicitly logs a field called "stacktrace", as it did with pure Zap.

Related: #435
peanball added a commit that referenced this pull request Feb 3, 2025
The slog handler for Zap is configured to automatically emit stack traces at "Error" level.

In pure Zap, stack traces are added via option AddStacktrace, which was not used before the move to slog.

The Zap slog handler has a mapping from Zap levels to slog levels, and does not emit values above slog.LevelError. By setting the limit for automatically added stack traces via slog to ErrorLevel + 1, this effectively disables automatic stack traces.

The `panic-check` handler that catches and logs panics in handlers remains unaffected, as it explicitly logs a field called "stacktrace", as it did with pure Zap.

Related: #435
peanball added a commit that referenced this pull request Feb 3, 2025
The slog handler for Zap is configured to automatically emit stack traces at "Error" level.

In pure Zap, stack traces are added via option AddStacktrace, which was not used before the move to slog.

The Zap slog handler has a mapping from Zap levels to slog levels, and does not emit values above slog.LevelError. By setting the limit for automatically added stack traces via slog to ErrorLevel + 1, this effectively disables automatic stack traces.

The `panic-check` handler that catches and logs panics in handlers remains unaffected, as it explicitly logs a field called "stacktrace", as it did with pure Zap.

Related: #435
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants