This document defines logging and the expectations from logs that are generated at different parts of the system. The purpose of logging is to allow transparency into the state of systems. It is used to keep tabs on processes running and to allow investigative analysis. Some examples of when logs are investigated include: audits, crash reports, partner dialogs, certain aggregation metrics, and diagnosing problems.
Logging should allow investigation and involves a specific part of code (unit) or on a whole process (integration). Logs are broken into severity levels. Each level has a set of standards. The code authors and reviewers should try to ensure that logs at each of these layers are well represented and can be interpreted independently of each other.
The levels are grouped into four larger categories that we are more concerned with:
- Process Layer Logs: starting, ending, and failure for a process. Could include status such as memory usage and cpu usage.
- Application Errors: application code making note of any known and unknown error conditions (all unexpected)
- Application Audit Information: application exposing information about significant events in a running system (e.g. Warnings and Notices)
- Application Status Logs: application details about high level operational progress. (e.g. processed x records successfully, starting data dump etc)
Another important concern for logs is to be able to filter out logs for the instance, application or feature that is under investigation. For e.g. when debugging a specific feature details logs for that feature are useful, but logs from other parts of the system should be ideally hidden. These separation is often referred to as locality
or facility
. More details around locality are discussed in the section on debugging.
Individual lines should be readable and provide any necessary context to understand what is going on at that point.
-
Remember: consecutive lines for one process are not necessarily consecutive in the logs once aggregated. Avoid logging within
loops
or functions that are called repeatedly that might generate more noise than context. -
Be concise.
-
Log lines should be filterable - if components go together, you should be able to filter for those components.
-
Log messages should link related processes (e.g. using a prefix, using a common id across a system).
-
All informational logs should go to stderr. Only program output should be logged to stdout. This is mostly useful for command-line programs so they can be piped together - for services it doesn't matter as much - but it's a good habit to get into.
Logs should be machine parsable and human readable. The Kayvee logging libraries make it easy to enforce log formats.
- log with
key:value
pairs or a known format such asjson
oryaml
. The key, value delimiter should be consistent with the language semantics. e.g.route:api/sso type:invalid-request message:"access code not found" user:"john doe"
. - whenever possible add unique identifiers to the start of the log line. e.g.
[b65318c4-8330-4d50-993d-6d76d41c29bb] REQUEST: path=/resource/id remoteaddr=127.0.0.1:47059 Connection=close
- the application deployment environment should append timestamp, hostnames, process name and ideally log level to the beginning of the line.
Jun 9 22:50:55 machine-name-12389123 supervisord: sync_service id=000000 action=posting-coach....
Further reading: logfmt by Heroku
Don't log credentials!
- log usernames not username/password pairs
- don't log sensitive information (e.g. API Keys, tokens, personally identifiable data). In some cases a helper library that scrubs the data before logging could be helpful.
NOTE Ask yourself: would it be acceptable to send these logs over email
or to an auditor
? If not, don't log it.
e.g. containers, process control systems, buildpacks, etc
Logs usually generated by process managers (supervisord, upstart), instances (kern.log). Some of these logs might be generated as metrics to keep tabs on CPU, Memory and Disk Usage by processes or on machines. Some logs to expect:
- process restart, crashes, out of memory errors.
- machine restarts, disk failures, kernel panic.
- installation failures (eg. dependencies installation failed, launch configuration failure)
Expected Levels:
Level | Name | Description | About |
---|---|---|---|
0 | Emergency | Emergency: system is unusable | The highest priority, usually reserved for catastrophic failures and reboot notices. |
1 | Alert | Alert: action must be taken immediately | A serious failure in a key system - These should alert someone (ala PagerDuty and require immediate attention. They often bubbles up from error, warning, or critical logs. |
Running application code should make note of any known and unknown error conditions. There should be no unexpected errors that are ignored - see Application Audit Information
for situations where it might be reasonable to swallow errors.
Examples of situations where Application Errors can usually be found:
- Errors during requests or connections to external services. (database connections, third-party or internal APIs)
- Validation failures when processing user entered data.
- Disk write failures, network connection errors and other system failures.
- Unexpected data or conditionals (eg. places where you want to comment
this should never have happened
)
Level | Name | Description | About |
---|---|---|---|
2 | Critical | A failure in a key system. | This is when an error occurred and the process cannot recover and continue. - It is often logged right before exiting with code 1. (See error handling documentation for best error handling practices) e.g. "could not connect to database" |
3 | Error | Something has failed | This is when an error occurred but we recovered and the application can continue running. (See error handling documentation for best error handling practices) e.g. "request to fetch user failed. please try again" |
-
Error:
Node: use 'console.error' or use a logging library
Python: use 'logging.error'
Go: 'log.Errorf' or use a logging library
-
Critical Log using the same format used for Error and send to an appropriate error alert system (eg. Sentry)
- error context (the application name, function name, operational context)
- message/reason
- type of error (eg. user, external, internal). External errors are usually errors from network dependencies (third-party API, database error)
- stack trace and/or error code that can be used to find exact line where error was thrown.
Audit information is concerned with exposing in logs information about significant events in a running system. This information can be used to design automated systems that keep tabs on the health of the application. These logs are also used to create reports on failure rates and planning for future optimizations to the system. Some examples of audit logs are listed below:
- Application startup and shutdown
- Start and end of jobs, including exit status
- Access logs (users logging in to the application)
- Requests and return status
Level | Name | Description | About |
---|---|---|---|
4 | Warning | Something is amiss and might fail if not corrected. | These are warning messages (not an error) but an indication that an error will occur if action is not taken. e.g. "file system 85% full" |
5 | Notice | Things of moderate interest to the user or administrator. | This is information such as progress information, significant events, auditing information. e.g. "worker Y started with payload X" |
-
Log starting and ending of major components or anything needed to show the major work flow of our system.
Node: use 'console.warn' or a logging library
Python: use 'logging.info' or 'logging.notice'
Go: 'log.Print(f|ln)' or use a logging library
-
component name (e.g. which process/worker)
-
input or output
-
actor (user, process name, system id)
e.g.
"system\_id=123456789012: reverse-words-worker started with payload 'abcdefgh'"
"system\_id=123456789012: reverse-words-worker finished 'hgfedca'"
These logs are typically used to identify the operational state of a running system, or getting additional context when investigating issues. Using Audit logs, it should be easy to build a model of the application flow.
Level | Name | Description | About |
---|---|---|---|
6 | Info | The lowest priority that you would normally log, and purely informational in nature. | This is the most common type of logging. It explains how the application is running, what component is running at what point, etc. It is mostly for investigative purposes after the fact. e.g. "validating url" |
7 | Debug | This is the lowest priority, and normally not logged except for messages from the kernel. | This is the lowest priority, and normally not logged except for messages from the kernel. |
- component name (e.g. which process/worker)
- input or output
- actor (user, process name, system id)
-
Log starting and ending of major components or anything needed to show the major work flow of our system.
Node: use 'console.error' or a logging library
Python: use 'logging.info'
Go: 'log.Print(f|ln)' or a logging library
-
Component name
-
Message/reason
-
Context (often input or env info)
name any object you dump (like payload, uri, csv headers, DOM object)
e.g. "reverse-words-worker:info validating input payload '{"word": "123456789012"}'"
- Individual lines should be readable and provide any necessary context to understand what is going on at that point.
- Avoid info logs within loops
-
Add debug lines as you see fit.
Node: use 'debug'
Python: use 'logging.debug'
Go: no standard at the moment
There is no standard format. However, locality
is extremely important for debugging logs. Ensure that there is a token that allows developers to focus on the debug messages only for the context they are interested in. This can take the form of classname
or modulename
or the component
that is generating these logs. Some tools such as debug
for node allow the use of environment variables
to control which logs are logged to STDOUT.
Debugging is for whatever you need as it is only for the development environment.
NOTE Your code is for developers too, so make it clear why you are logging something or needed to log something.