What to Expose
Explore how to select important metrics and variables to expose for effective monitoring in distributed systems. Understand challenges in predicting critical metrics, learn heuristic approaches, and discover categories of useful data like traffic indicators, resource health, and error counts to maintain system stability under changing demands.
We'll cover the following...
Which variables and metrics to expose
If we could predict which metrics would limit capacity, reveal stability problems, or expose other cracks in the system, then we could monitor only those. But that prediction will have two problems. First, we’re likely to guess wrong. Second, even if we guess right, the key metrics change over time. Code changes and demand patterns change. The bottleneck that burns us next year probably doesn’t exist right now.
Of course, we could spend an unlimited amount of effort exposing metrics for absolutely everything. Since our system still has to do something other than just collect data, we’ve found a few heuristics to help decide which variables or metrics to expose. Some of these will be available right away. For others, we might need to add code to collect the data in the first place. Here are some ...