Status:

Timeline: January 2020 - 2021+

Overview


Monitoring and alarms are critical components in every production environment. Professional node operators rely on a constant stream of data (metrics) collected off of every part of a mission critical applications. Everything can be monitored from system metrics such as CPU load and disk usage to application layer metrics like number of active connections and block production statistics. With a robust monitoring setup in place, operators then have observability over the network and can be alerted when any condition arises. These can be conditions to be proactive about such as running out of storage to situations such as block production failure where the node needs maintenance or be re-deployed. Metrics can then be displayed on dashboards to create overviews to monitor minor incidents and overall network health along with optimizing various network parameters / server sizing.

Right now, the application exports a health check with data about the node's status on the network that can be viewed with the existing monitoring tool. The health check includes data such as the block height the node is synced up to and the versions of ICON software it is running but lacks metrics such as disk and cpu usage. Instead of baking in every imaginable metric into this health check to support the network, a more practical approach is to leverage an established open source ecosystem of monitoring tools such as Prometheus. Leveraging these tools will give us access to a full featured monitoring solution that can grow to the full scale of the network.

Components


Prometheus, while itself is just a database, is in application a whole ecosystem of tools including: