Instance Metrics and Health Checks

Learn about instance metrics and health checks.

Instance metrics

The instance itself won’t be able to tell much about overall system health, but it should emit metrics that can be collected, analyzed, and visualized centrally. This may be as simple as periodically spitting a line of stats into a log file. The stronger our log-scraping tools are, the more attractive this option will be. Within a large organization, this is probably the best choice.

An ever-growing number of systems have outsourced their metrics collection to companies like New Relic and Datadog. In these cases, providers supply plugins to run with different applications and runtime environments. They’ll have one for Python apps, one for Ruby apps, one for Oracle, one for Microsoft SQL Server, and so on. Small teams can get going much faster by using one of these services. That way we don’t have to devote time to the care and feeding of metrics infrastructure, which can be substantial. Some developers from Netflix have quipped that Netflix is a monitoring system that streams movies as a side effect.

Health checks

Metrics can be hard to interpret. It takes some time to learn what “normal” looks like in the metrics. For quicker, easier summary information we can create a health check as part of the instance itself. A health check is just a page or API call that reveals the application’s internal view of its own health. It returns data for other systems to read (although that may just be nicely attributed HTML).

Health checks should be more than just “yup, it’s running.” It should report at least the following:

  • The host IP address or addresses
  • The version number of the runtime or interpreter (Ruby, Python, JVM, .Net, Go, and so on)
  • The application version or commit ID.
  • Whether the instance is accepting work
  • The status of connection pools, caches, and circuit breakers

The health check is an important part of traffic management, which we’ll examine further in the Interconnect chapter. Clients of the instance shouldn’t look at the health check directly; they should be using a load balancer to reach the service. The load balancer can use the health check to tell if a machine has crashed, but it can also use the health check for the “go live” transition, too. When the health check on a new instance goes from failing to passing, it means the app is done with its startup.

Get hands-on with 1200+ tech skills courses.