Metrics Sense: Designing a Metric I

An example interview question on designing a metric for an oncall process.

Question

How do you measure the effectiveness of an on-call process?

Background

Being able to effectively define and measure success metrics for your program is critical to the TPM role. This question is intended to help develop these skills.

Similar to a previous question, we’ll use a role playing approach to simulate a real live interview.

Solution approach

A mistake here would be to start by brainstorming a bunch of possible metrics. We want to make sure that we have a complete understanding of the problem before we dive too much into the weeds. We’ll use the following approach for this question:

  • Clarify the question given its ambiguous nature.
  • Outline the process workflow and state our goals and assumptions.
  • Propose metrics that will measure how we are doing to meet our stated goals.

Sample answer

Interviewee: I want to start by clarifying what we mean by an “on-call process." I understand this to mean that we have a rotation of engineers (or other staff) that are responsible for handling alerts or incidents for a particular system for a period of time. Typically, the on-call rotations have a primary and a backup that are also on-call at the same time. Does that align with your intent?

Interviewer: Yes, that sounds good.

Interviewee: Got it. Let me walk through the workflow of an on-call engineer to make sure I understand the picture:

  • The on-call engineer receives an alert via some communication channel (phone call, text message, or email, etc.) that indicates a problem.

  • They then acknowledge the alert and begin investigating the alert.

  • The engineer may then take potential action to mitigate the alert if the alert is actionable. This action will take some variable amount of time.

  • They may also document their findings in a playbook to ensure a standard response in the future.

Interviewer: Sounds good to me.

Interviewee: Okay, so based on this workflow, I see three main buckets of metrics:

  • Alert health: We want to minimize false positive/noisy alerts and maximize the volume of actionable alerts. We can use a metric like Signal-to-noise ratio per alert type to measure this.
  • Time to respond/resolve an alert: We want the on-call engineer to be able to respond quickly to the alert and ideally resolve it as fast as possible. We can use Time to respond and Time to resolve per alert type to measure this. We can use the median (p-50) or the 90th percentile (p-90) of these metrics based on how aggressive we want to be in goal setting.
  • On-call quality of life: We don’t want the on-call engineer’s life to be overwhelmed with alerts. We can use Utilization rate or a Sentiment score for each on-call week to measure this.

Interviewer: Sounds like a good start. How would we calculate these metrics? How can we use them to gauge how well we are doing?

Interviewee: Sure. Let me first walkthrough how we would measure these. Before we begin, we need to make sure that alerts are properly annotated with sufficient metadata to allow us to calculate these metrics. We should make sure our on-call alerts are always in one of the following four states: Open, Acknowledged, Resolved, or Won’t Fix and a timestamp is associated with each state. Then we can use this metadata to measure the following:

  • Signal-to-noise ratio: For each alert type, we calculate the following:

    Total  number  of  resolved  alertsTotal  number  of  wont  fix  alerts\frac{Total\;number\;of\;resolved\;alerts}{Total \;number\; of \;won't\; fix\; alerts}

  • Time to respond: Using the associated timestamps for each alert type, we measure the p-50 or p-90 of the following:

    Acknowledged time - Open time

  • Time to resolve: Using the associated timestamps for each alert type, we measure the p-50 or p-90 of the following:

    Resolved time - Open time

Measuring on-call health is a bit more manual. We will need the on-call engineer to provide us the following data at the end of each rotation: an estimate of how much time was spent handling alerts and a sentiment score (on a 1-5 scale) of their on-call experience. We can use these inputs to calculate the Utilization rate and Sentiment score respectively. Similar to the above metrics, we can take the p-50 or p-90 of these.

To determine how well we are doing, we can define an SLAService Level Agreement for each metric. For example, we can say “95% of alerts will be acknowledged in 10 minutes or less” or “85% of our alerts will have a Signal-to-noise ratio greater than 0.9”. Our goals should start out fairly low and then gradually become more aggressive as we mature the process and develop better playbooks and more experienced engineers.

Interviewer: Sounds good. Let’s imagine a hypothetical scenario: A new on-call engineer joins the rotation, and they will require some time to ramp up. How could we factor this into the metrics above?

Interviewee: Interesting. I am assuming this engineer would bias our metrics to make our process appear less effective than it actually is. To remove this bias, we can only factor in alerts and input handled from an on-call engineer that has been on at least 2 prior rotations. This way, data from new on-call engineers won’t be factored into the calculation until they’ve had time to properly ramp up.

Interviewer: Sounds reasonable. Let’s move on to the next question.

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.