In regular software engineering, we tend to monitor if the software is at least working —no errors, good response timing, etc.—which is usually enough. But what can go wrong with the machine learning code in runtime?

Regular software (say the CRM system) rarely breaks with no code changes or significant input data changes. ML software can be sensitive to minor distribution changes (seasonality, trends, new cameras, and microphones for visual/audio data).

Good monitoring comes with the following benefits:

  • We’re alerted when things break.
  • We can learn what’s broken and why.
  • We can inspect trends over long time frames.
  • We can compare system behavior across different versions and experimental groups (e.g., AB/testing).

ML-specific monitoring

In ML engineering, we should also monitor the quality of our models and pipelines, and carefully look for things like concept and data drift. At the same time, regular software problems are still there and can’t be ignored as well.

We’ll cover three main aspects of machine learning monitoring in this lesson:

  1. Service Health
  2. Data Health
  3. Model Health

Get hands-on with 1200+ tech skills courses.