...

/

Metrics Sense: Designing a Metric I

Metrics Sense: Designing a Metric I

An example interview question on designing a metric for an oncall process.

We'll cover the following...

Question

How do you measure the effectiveness of an on-call process?

Background

Being able to effectively define and measure success metrics for your program is critical to the TPM role. This question is intended to help develop these skills.

Similar to a previous question, we’ll use a role playing approach to simulate a real live interview.

Solution approach

A mistake here would be to start by brainstorming a bunch of possible metrics. We want to make sure that we have a complete understanding of the problem before we dive too much into the weeds. We’ll use the following approach for this question:

  • Clarify the question given its ambiguous nature.
  • Outline the process workflow and state our goals and assumptions.
  • Propose metrics that will measure how we are doing to meet our stated goals.

Sample answer

Interviewee: I want to start by clarifying what we mean by an “on-call process." I understand this to mean that we have a rotation of engineers (or other staff) that are responsible for handling alerts or incidents for a particular system for a period of time. Typically, the on-call rotations have a primary and a backup that are also on-call at the same time. Does that align with your intent?

Interviewer: Yes, that sounds good.

Interviewee: Got it. Let me walk through the workflow of an on-call engineer to make sure I understand the picture:

  • The on-call engineer receives an alert via some communication channel (phone call, text message, or email, etc.) that indicates a problem.

  • They then acknowledge the alert and begin investigating the alert.

  • The engineer may then take potential action to mitigate the alert if the alert is actionable. This action will take some variable amount of time.

  • They may also document their findings in a playbook to ensure a standard response in the future.

Interviewer: Sounds good to me.

Interviewee: Okay, so based on this workflow, I see three main buckets of metrics:

  • Alert health: We want to minimize false positive/noisy alerts and maximize the
...