Chris's Wiki :: blog/sysadmin/HaveGeneralHealthMetric

https://utcc.utoronto.ca/~cks/space/blog/sysadmin/HaveGeneralHealthMetric

I was asked today if our monitoring of our Prometheus setup would detect database problems. The direct answer is that it wouldn't (unless Prometheus itself went down); much like we do with Alertmanager, all we check for is that Prometheus is up. The more complicated answer is that it seems hard to do this, and one reason is that while Prometheus exposes a number of specific metrics about aspects of its time series database, it doesn't have an overall health metric for the TSDB (or for Prometheus as a whole). I think it should have one; in fact I now think that everything with specific health metrics should also have an overall health metric.

(Probably this means I need to take a second look at some of our own metric generation.)

A system having an overall health metric insulates people consuming its metrics (like us) from two issues. First, we don't have to try to hunt down all of the specific health metrics that the system exposes and sort them out from all of its other metrics. The current version of Prometheus has 129 different metrics (not counting the general metrics from Go), of which I think perhaps eight indicate some sort of TSDB problem. Except I'm not sure that all of the metrics I picked out really indicate failures, and I'm also absolutely not sure I found all of the relevant metrics. If there was an overall health metric, people like me would be insulated from mistakes here in both directions; we wouldn't miss monitoring metrics that actually indicate problems or generate spurious alerts from metrics with alarming names that don't.

(In fact, doing a brute force grep check for TSDB metrics with 'error', 'fail', or 'corrupt' in their name turned up several additions to my initial eight.)

Second, we don't have to worry about an update to the system adding more specific metrics for (new) errors that we should add checks on. A properly done overall health metric should include those new metrics (and also deal with any restructuring of the system and shuffling of specific metrics). Without this, restructuring error metrics and perhaps adding new ones are quietly breaking changes, because they make previously theoretically comprehensive monitoring not so comprehensive any more. At the same time, you don't want to force systems to never add or restructure specific, detailed error metrics, because those specific error metrics need to reflect the actual structure of the system and the disparate things that can go wrong.

This points to the general issue with specific metrics, which is that specific metrics reflect the internal structure of the system. To understand them, you really need to understand this internal structure, and when the internal structure changes the metrics need to change as well. It's not a great idea to ask people who just want to use your system to understand its internal structure in order to monitor its health. It's better for everyone to also give people a simple overall metric that you (implicitly) promise will always reflect everything important about the health of the system.

To be explicit, you could have overall health metrics for general subsystems, such as Prometheus's time series database. You don't have to just have an overall 'is Prometheus healthy' metric, and in some environments an honest overall health metric might alarm too often. I do think it's nice to have a global health metric, assuming you can do a sensible one.

Sidebar: The low-rent implementation of general health metrics

If your system is reasonably decent sized, it probably has some sort of logging framework that categorizes log messages by both subsystem and broad level of alarmingness. Add a hook into your logging system so that you track the last time a message was emitted for a given subsystem at a given priority level, and expose these times (with level and subsystem) as metrics. Then people like me can put together monitoring for things like 'the Prometheus TSDB has logged warnings or above within the last five minutes'.

This is unaesthetic and probably will win you no friends among the developers if you propose it as a change, but it's simple and it works.

(If the log message format is regular, you can also implement this with an outside tool, such as mtail, that parses the program's log messages to extract the level and subsystem (and the timestamp).)

Written on 23 May 2022.