How Google Runs Production Systems
Edited by Betsy Beyer, Chris Jones, Jennifer Petoff & Niall Richard Murphy
Software is run far longer than the amount of time it takes to implement. Google's SRE team attempts to build everything else to support Google's apps and services post-development. This book is a collection of principles, practices, and management to bring some of their ideas into your own organization.
- Google was scaling up at the time the sys admin role was being redefined (into DevOps)
- Google's way forward was the Site Reliability Engineer (SRE)
- Stories about building Google's infrastructure, but also how it studied & decided which tech to use
- In our "just show me the code" culture, "Google ... dared to think about the problems from first principles"
- "Stories like these are far more valuable than the code or designs they resulted in. Implementations are ephemeral, but the documented reasoning is priceless. Rarely do we have access to this kind of insight."
- Not just scaling computer architecture, but also business process
- "Software engineering tends to focus on designing and building software systems"
- Site Reliability Engineering (SRE) focuses on the whole lifecycle of software including deployment and operation
- SREs are: engineers who write software, focus on system reliability, and operating on distributed services
- "Ben Treynor Sloss, Google’s VP for 24/7 Operations, originator of the term SRE, claims that reliability is the most fundamental feature of any product"
- "Managing change itself is so tightly coupled with failures of all kinds"
- Broader applications to other communities and organizations
- Margaret Hamilton of NASA was the first "SRE"
- "A thorough understanding of how to operate the systems was not enough to prevent human errors"
- SRE way: "thoroughness and dedication, belief in the value of preparation and documentation, and an awareness of what could go wrong, coupled with a strong desire to prevent it"
- "Software systems are inherently dynamic and unstable."
- You could only make it perfectly stable if you could prevent change (in the codebase, libraries, userbase, ...)
- SREs balance agility vs. stability
- Software gets bloated over time from adding new features, which also introduces the opportunity for more bugs
- "A smaller project is easier to understand, easier to test, and frequently has fewer defects."
- "Some of the most satisfying coding I’ve ever done was deleting thousands of lines of code at a time when it was no longer useful."
- Software must be simple to be reliable
- Simplifying the steps of a task is not being lazy: it's clarifying what needs to be accomplished and the easiest path
- Saying "no" to features keeps the environment uncluttered from distractions to focus on real engineering