Article #8 from 2023
During my 1st year working as a Sovereign Google Cloud Engineer on a major defense company public Cloud, I had the opportunity to be trained by Google on Site Reliability Engineering and to apply that knowledge in my work. SRE enables businesses to balance innovation with stability and to scale efficiently while calculating risks.
Reading Time: 20 minutes
Site Reliability Engineering (SRE) missions are to avoid loosing information, to avoid getting it stolen and to serve it well. Systems can't always be up but users except you too. Aiming for 100% uptime/SLA is non-sense, freezes you in time and makes people become dependent. Rollouts cause the biggest amount of outages yet they must be done frequently. When big companies go down, it goes public because many people rely on it.
SRE are tasked with maintaining company's reliability and covers both engineering and operations. They have to maximize the rate at which new features and products can be delivered to users without breaking anything. Monitoring is at the base of the SRE.
Breaking something for the first time is celebrated and people are not blamed for that, to an extant, as it is mostly the system fault: your infrastructure shouldn't fall with a single action. A post-mortem follows incidents to know what happened alongside the lost amount of money and people give the full truth more easily if they're not blamed. Breakages happen and people shouldn't be afraid of dealing with them. Resistance is futile, regarding the need to use newer and better services. Things shouldn't be too reliable as people become dependent on it.
SRE are responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning. The goal is to do less than 50% of Ops and over 50% of Eng. You should promote mobility between teams as it helps seeing the bigger picture and understanding things more clearly as well as to avoid having a SPOF (Single Point Of Failure) where one person is the only expert. SRE is the fire department that shuts down fires but that also wants to prevent fire from starting by being involved in building design so there is the lowest risk of a fire happening.
Failure Happens - Software, network, machine, power and everything fails. Nothing surprises SRE.
Be Resilient & Redundant - Think about other ways to deliver a service and prepare degraded versions with graceful degradation. Fast is better than slow obviously. If you suddenly have too much traffic, handle user-facing traffic and drop the rest.
Automate - Goal of SRE is sublinear scale (log/exp). When you automate, people can't make mistakes.
Blood vs Steel - Fix and find the root cause. Avoid throwing more people at an issue. Triggers and root cause often aren't the same.
Diverse Opinions Matter - A positive and friendly workplace where people can disagree without taking it personally.
Production is the computer, network, storage and datacenter infrastructure built to serve services to your users and customers. However computers, network, storage and datacenters can break and, as hope is not a strategy, you must have backup plans. Production scales with automation, n+2 redundancy and resilience by continuing ops in degraded mode like dropping 100 qps when you get 600 with a capacity of 500. We all prefer a product that works without all its functionalities rather than a product that doesn't work. Perform Disaster Recovery Training (DiRT) by breaking things than fixing them to learn a lot.
Common causes of failure:
Computers - HDD/SSD, Power, CPU, Memory, Boards, OS.
Managed by automating diagnostics and repairs, redundancy by running on multiple machines (n+2 for planned maintenance + unexpected error), resilience meaning you have more than one way to do it.
Storage - (Many Exabytes) HDD & SSD (HDD are more effective for cold storage), Power, Machine, Operators. The bigger the disk, the slower the data seek time, but bigger disk are cheaper. Fast access is on the outer part of the disk and backup on the inner due to spin / seek speed. People don't want backups but restore.
Managed by automatically handling broken disks and data migration, redundancy with storage in multiple places, resilience with backups.
Networks - Undersea fiber network all across the globe with electricity-powered repetitors, Network switches, Power, Bad OpenFlow SDN config pushes, Humans. High-speed traders can corrupt a buying order while it's going out by corrupting the checksum.
Managed by automating mitigation for network problems, redundancy by having multiple components and path, resilience by having multiple backbones and graceful degradation.
Datacenters - A physical rack filled with machines is part of an enclosure with other racks that are under a single Top-of-Rack (ToR). A campus is several datacenters (DC). A cluster is a collection of rows. Everything as a name (fabric, metro, Point-of-Presence (PoP), building, DC). Power, Cooling, Nature, Humans. Safety first: Human life >>> Equipement > Service.
Managed by automating with load-balancing and climate control, redundancy with multiple cooling and power components and generators, resilience with services running in multiple locations.
You expect growth, many requests per second, too many for one machine to handle thus you need many instances of your UI and logic servers. Don't mix features in one server, do one thing really well. Prefer scaling horizontally over vertically and add safety margin because ~15% of machines reboot daily.
Hardware is referred to a machine, not server. A cluster consists of several machines. A running binary/program who offer services is called a task and several tasks are called a job. To handle many tasks, use a scheduler (tasks must be moved if a machine fails), an addressing scheme (tasks can move and need to be found) and communication between tasks (which protocol).
A Scheduling System/Cluster Manager like Borg runs, configures and manages jobs at scale. The Cluster Manager decides where to run a job and controls the amount of resources needed to run a it. On each machine runs an agent process accepting RPCs, packages (collections of files) and tasks that are instances of a program being executed.
Locking means having only one process working while the others are on hot standby. In a traditional environment, you would have sat a mutual exclusion (mutex). At cluse scale, use a cluster-wide mutex / lock and a name server to map jobs to hosts, ports and names to machines. When a job is started, tasks get an address that you must use to address it.
RPC (Remote Procedure Call) is a mechanism used to communicate between tasks, like gRPC and protobuf (Protocol Buffer), used because HTTP is too heavy. Protobufs are way of storing language-independent stuff and sending them over the network, like a language-independent method for declaring network and disk data structures from which you can generate code in any coding language.
A launch is any change that introduces visible changes to an app or production, intentionally or unintentionally, internally or externally visible, like new code or configuration changes. Launches are the single largest source of preventable outages as well as negative publicity. Bad launches are often due to incomplete diligence like not considering scope and impact of launch.
Control launches with canaries and gradual rollouts, dark launches as well as careful automated A/B testing. Canary launches use a log scale where you test on 0.1% users then 1% then 10% then 100%, each time after a small period of observation. Dark launches are not visible to end users and run without them seeing to test specific parts. A/B testing is used to test user behavior.
Launches are controlled by gradual rollouts and canary-ing to limit the blast radius and avoid pushing to several domains at once. Not rolling out means not progressing means loosing market shares and making customers unhappy. Resistance is futile regarding the need to use newer and better services.
Rollbacks are normal and have little to no consequences when using canaries and gradual rollouts. They are not negative and you just need to try again. Rollback first, fix later. Don't hide if you break something, shout it out. You are rewarded, not punished for admitting mistakes. Make rollback as fast as launches. Keep monitoring while launching and don't launch without a rollback plan.
Monitoring is at the base of the SRE pyramid: Monitoring > Incident Response > Postmortem/RCA (Root Cause Analysis) > Testing & Release procedure > Capacity planning > Development > Product. Without monitoring, everything else won't matter because how would you know if something needs immediate attention, if something is wrong, how to do long-term planning, make measurable improvements and prevent regressions.
You want to be aware of the system / service's state and observe any changes occurring over time. Monitoring can be either Closed Box (is it working, is it fast enough, from an external view) or Open Box (what is wrong, from an internal view). Prefer alerting based on Closed Box monitoring even though it is more vague. Give the monitoring system the lowest level of data possible without prior aggregation. The metrics collection is push-based: the monitoring server push a message to servers with the things he wants once and they send him regular updates. Use real clients data / queries rather than fake curls.
Monitoring at scale places load on other components and requires lots of data storage as well as manipulation like aggregation and comparison. To monitor a distributed setup at scale, have local monitoring per cluster and a global aggregator.
You need to monitor the monitoring to keep it reliable by using n+1 monitoring, avoiding using a service you monitor to make monitoring work, doing meta monitoring (but who watches the watchers) and doing cross monitoring.
Failure is not optional, it is normal, everything will fail at some point: hardware, software, people, communications. IP (Internet Protocol) is connectionless in order to take different paths in case of failures. Internet is intended to route around failure so use that. To handle possible failures, overcompensate.
An SLI, for Service Level Indicator, is a carefully defined quantitative measure of some aspect of the level of service that is provided. SLI can measure performance, availability, quality, internal and people and is used for reporting, impact assessment and alerting. The easiest form is good-answers out of total-answers.
An SLO, for Service Level Objective, is a target value or range of values for a service level that is measured by an SLI. SLO helps you make decisions and are about what you want / expect. An SLO takes the form of an SLI-relation value like 99.9% of availability. The initial SLO should be simple and will improve over time, but it should be actionable.
An SLAs, for Service Level Agreement, is an explicit or implicit contract with your users that includes consequences of meeting or missing the SLO it contains. An unmet SLA makes the service perceived as not reliable. Even if people are payed back for a SLA not met, they're not happy.
SLI: what, where, for how long. SLO: how many. SLA: fee if not met. SLI > SLO > SLA becomes Measurements > Objectives > Consequences.
SLA is less strict than SLO with consequences for the provider. Don't provide a higher SLO than the underlying layer, which often is the network. 100% SLO is non-sense for almost anything but life-related things like pacemakers and airbags. 99% of a quarter 21.9h and 99.9% is 2.19h. Every extra 9 cost 10x more effort to get.
The EB (Error Budget) is 1 - SLO and must be spent on outages and launches. The better the code, the less errors, the more opportunities to deploy. SLO and Error Budget balance reliability with feature velocity. Page (activating on-call and creating a ticket) people only when the alert is actionable and impacting.
Product Management defines SLO, SLO influences design, design influences performance. Engineering is the use of scientific principles to design and build. Innovation and Reliability meet, meaning handling and balancing proactiveness and changes with support and reactiveness, which is possible thanks to the Error Budget. Not being able to express something with numbers is unsatisfactory as it can't be weighed/evaluated.
If you want to run reliable, fast and cheap systems, accept that you can't serve every query but only 99.X% of them. Every else can fail as well and every extra 9 shouts out the cost exponentially. The art of scaling is done with NALSD (Non Abstract Large Systems Design). You must do quick math to realize the importance: millions of users means hundreds of millions of QPS (queries per second) means billions of disk input/output per second (IOPS) means thousands of machines means hundreds of square meters. Scaling bring additional problems as well as large problems bring complexity.
Everything is load balanced. Send requests to the FE (Front End) who is the closet to the location given that it has capacity. Draining something means not sending traffic to it.
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatisable, devoid of enduring value and that scales linearly as a service grows. Toil feels like Sisyphus' myth.
Compare time spent on automation vs time lost doing it manually over time to see if it's worth to change or not. Consultation (reviews), non-ops overhead (meetings) and actionnable outage-responses is toil. Common sources of toil are noisy tickets (non-actionnable), noisy pages and pings (direct messages for manual requests). To make good pings, be direct by inserting the cause of interruption in the first message instead of just saying hello and waiting. Alert mainly on consequences (SLO) rather than causes (technical failures).
Toil is often due to unreliable rollouts and overly protective file ownership. Fix toil in order to reduce self-inflicted pains, multiply impact by reducing toil and get more engineering time. Measure toil to know how big it is and see the results of your fixes, you can also tag outages and do shift handovers to keep information flowing.
You can't remove every toil but having a quarterly OKR to remove at least one source of toil is nice to avoid accumulating technical dept. Don't alert for every failed task, set thresholds instead. Opening your source code for others to see and request changes is important for innovation and cooperation.
You are on call to uphold your services' SLO with reactive and proactive pages, but also to gain experience and drives changes. It works with ACK (Acknowledge) > Triage (assess impact) > Mitigate & Verify > Debug > Short term fix > Cleanup > Root cause > Postmortem written > Resolve (long term fix), or TMR for Triage > Mitigate > Resolve.
During triage, you assess the impact by focusing on the users' user impact. A mitigation is a low risk, fast, reversible and simple to use action to reduce the impact of an outage despite the eventual destruction from a restart. You must shorten the amount of time the user is impacted so balance triage time and action. Check capacity before draining to mitigate and verify after that it actually worked.
The next goal is to make this problem never happen again by noticing patterns of problems and eliminating them systematically. On call tooling includes an emergency IRC (Internet Relay Chat) in case every other services like the incident management and videoconference tools fail. You must keep detailed and organized records from beginning to end with timestamps and few chat windows. Postmortems are an asset for the company and your career, they must be blameless and actionnable.
Regarding complex outages potentially requiring escalation, like when you have no idea and / or the impact is very high, proceed to Escalate after Triage. The issue still has to be fixed, not necessarily by yourself. You may need to escalate to get advice from others, when impact is large, when the cause originates from another system, if it has security / privacy issues, if you feel over-stressed. Sometimes all you need is to have someone or something to listen to you to figure things out, and that's also escalation. Escalation isn't talking to someone's boss, it's about getting more people involved horizontally and / or vertically, both up, down and sideway. On call SRE's requests overrule VP's.
No matter the outage's scale, don't panic as you have entire teams backing you up, follow the same steps as usual and communicate. Keep communication in one place, limit number of emails, share documents' /preview link and add a summary at the top, use communication channels wisely and hand off your role like a TCP 3-steps handshake (SYN > SYN ACK > ACK). When escalating to another team, check their intranet page to follow their contact policy.
Even though you may not feel it, being on call is stressful as you are on high availability mode all the time. Stress is a reaction to external stressors and a certain amount of it is normal and beneficial, it can even lead to improved performance as you're more motivated and focused. However, pasted a certain point, once you have too much stress, you get exhausted, angry and much more.
As per plane pilot's stress management, three reactions occur when facing stress: fight, flight, freeze. Pause between a few seconds to a few hours when you feel like you need it and don't hesitate to forward your load on others if you don't feel capable. SRE teams are close together and look out for each other informally so keep an eye out for over-stress in the team. Stress factors include the fear of the unknown, substantial complexity, external parties, time pressure, ability mismatch, deviations from policies and procedures.
Reduce on call stress by asking for help, delegating pending tasks, detecting over-stress, staying aware of your colleagues, maintaining situational awareness, anticipating critical situations, using checklists, building watchdogs techniques, noting inconsistencies, debriefing long activities, using routines. Avoid having a hero that does everything or you'll rely on it and you'll never grow as a team nor make new exiting things, it's bad for the system.
Outages will happen and behavior is not always clear, a good postmortem allows to learn from mistakes. In the end, you're doing something good by avoiding making a mistake again. You write a postmortem when a service violates SLO (service-level objectives) and has any impact (financial, technical, etc.). An outage is a fee you pay to learn.
Failures often follow a pattern. You learn from your peer team's failure and can make respectful comments and suggestions. If you don't write enough postmortems, it may mean that you don't launch enough new features. Postmortems are part of the job, they must include facts and be blameless with the goal to address the root cause. The trigger isn't necessarily the root cause. If the system is broken, it's not the operator's fault when he triggers it.
Bad postmortem etiquettes: inconsequential, unclear or missing user impact, overly verbose timeline, promote heroism (heroes can't replace preparation and hold the team back as they rely comfortably on him). Asking for a postmortem is not a bad thing but a way to ensure the root cause is found. Be a scientist by gathering data and writing hypothesis.