Design Thinking and Service Management

Design thinking principles always point us to focus on the humans. Service Management has many more stakeholders, your users/clients, marketing, sales, and product management, and engineering teams. Service Management is however central to the job of a of Site Reliability Engineer, so I will focus for now on them. Service management is important to this wide variety of stakeholders, so we want to really focus on providing common, usable, shareable language to facilitate alignment.

The term SLA or Service Level Agreement is often used without any meaningful context. Marketing or sales material might brag about a “four-nines SLA”, but that statement leaves the engineering team with more questions than answers. What does Availability mean? How do we measure it to know if you are meeting four-nines? How do we monitor to know when services are not available? Are there allowances for scheduled down time? What is the penalty paid if you fall short of four-nines? Let’s start by creating some clarity around the language we use to describe the business decisions that drive design decisions made by the engineering team.

Service Level Objectives

Most of the time, when people outside of IT Operations talk about SLA’s, what they really mean is SLO, or Service Level Objective. These are the business requirements that describe the desired availability, responsiveness, and recoverability goals for an application or service. Typically expressed as “9’s” availability, or stratified response time metrics, and disaster recovery point / time objectives. As a business, there is always a temptation to set very high service level objectives with no scheduled downtime or maintenance, and near instant Disaster Recovery.

Business leaders should recognize that the cost of operating a service is tied directly to the service level objectives. If the business has a 24/7/365/99.99 availability requirements, that means they also have a 24/7/365/99.99 staffing requirement for the support team, and significant investments in highly available infrastructure and low-impact deployment automations.

The other important business impact of setting proper Service Level Objectives relates directly to managing the risk of deploying new code. The delta between 100% and the SLO is the budget that an active DevOps team will use to manage the release cadence. Setting unnecessarily high SLO’s will constrain the velocity at which your team can deploy new code.

Service Level Indicators

Capturing concise business requirements as Service Level Objectives is a great first step. Now we need to identify the necessary KPI’s that demonstrate SLO compliance. More importantly we need to track the service buffer remaining within an SLO’s to support decisions about managing the risk associated with new code deployment.

In this space again, we see why the SRE is the human that successful designs will focus on. Implementation of effective monitoring and notification services that are tied directly to service level objectives is such a significant part of IT Operations that it has it’s own sub-domain. Observability refers to the ways that a system can be monitored. This monitoring is done not just to demonstrate SLO accomplishments, but to inform risk management, and proactively protects services before SLO’s are violated.

From a Design Thinking perspective, it is not coincidental that this practice is called Observability, as it peg’s the Run stage of a Design/Build/Run SDLC loop to the Observe stage of the Design Thinking Observe/Reflect/Make loop. Using both monitoring and analytics, we Observe how the system is used in production, and then as a team we reflect on those observations to design the next loop.

Service Level Agreements

I want to start this part of the conversation by stating that most services or systems do not have, nor do they need, service level agreements. A Service Level Agreement is a Legal document where two or more parties establish financial or other penalties associated with a services that does not meet it’s Service Level Objectives. If there is no penalty, then it’s not a SLA.

Some of the largest and most widely deployed software in the world, things like Gmail, Facebook, or Amazon do not have SLA’s. Not every system needs SLA’s, however every system should define SLO’s and the associated SLI’s in order to effectively manage the risks associated with a rapid release cadence. These are the business requirements, and technology specifications that inform key architectural and deployment decisions.

If you want to read more, the Cloud Adoption Playbook has some really good coverage for this aspect of IT Service Management. If you would like some help defining the service management practices of your engineering team, reach out and schedule some time to discuss your needs.

Previous
Previous

Thanking Out Loud

Next
Next

The Search revolution