The Risk of Fragmentation and Logs/Stats

Learn about business concerns, logs, and stats.

Technical and business concerns

The usual notion of perspectives splits into technical and business concerns. The technical perspective may even be split into development and operations. Most of the time, these constituencies look at different measurements collected by different means. Imagine the difficulty in planning when marketing uses tracking bugs on web pages, sales uses conversions reported in a business intelligence tool, operations analyze log files in Splunk, and development uses blind hope and intuition.

Could this crew ever agree on how the system is doing? It’d be much better to integrate the information so all parties can see the same data through similar interfaces.

Different constituencies require different perspectives. These perspectives won’t all be served by the same views into the systems, but they should be served by the same information system overall. Just as the question, “How’s the weather?” means very different things to a gardener, a pilot, and a meteorologist, the question, “How’s it going?” means something decidedly distinct when coming from the CEO or the system administrator. Likewise, a bunch of CPU utilization graphs won’t mean a lot to the marketing team. Each special interest group in each company may have its own favorite dashboard, but everyone should be able to see how releases affect user engagement or conversion rate affects latency.

Logs and Stats

In Transparency, we saw the importance of good logging and metrics generation at the microscopic scale. At the system scale, we need to gather all that data and make sense of it. This is the job of log and metrics collectors.

Like a lot of these tools, log collectors can either work in push or pull mode. Push mode means the instance is pushing logs over the network, typically with the venerable syslog protocol 7^{7}. Push mode is quite helpful with containers, since they don’t have any long-lived identity and often have no local storage.

With a pull-mode tool, the collector runs on a central machine and reaches out to all known hosts to remote-copy the logs. In this mode, services just write their logs to local files.

Indexing logs

Just getting all the logs on one host is a minor achievement. The real beauty comes from indexing the logs. Then you can search them for patterns, make trendline graphs, and raise alerts when bad things happen. Splunk dominates the log indexing space today 8^{8}. The troika of Elasticsearch, Logstash, and Kibana is another popular implementation.


The story for metrics is much the same, except that the information isn’t always available in files. Some information can only be retrieved by running a program on the target machine to sample, say, network interface utilization and error rates. That’s why metrics collectors often come with additional tools to take measurements on the instances.

Metrics also have the interesting property that we can aggregate them over time. Most of the metrics databases keep fine-grained measurements for very recent samples, but then they aggregate them to larger and larger spans as the samples get older. For example, the error rate on a NIC may be available second by second for today, in one-minute granularity for the past seven days, and only as hourly aggregates before that. This has two benefits. First, it really saves on disk space. Second, it also makes queries across very large time spans possible.

Get hands-on with 1200+ tech skills courses.