tldr:

Overview


When running large sets of services to the public, software companies will generally run status page to give a summary of the health across their network. For instance when GitHub experiences an outage, you can see it at githubstatus.com with a description of the interruption. Status pages are generally very high level and primarily used as a means to communicate service outages and planned system maintenance events to users. They can be also configured to notify users of various events by email and other means.

The ICON foundation requested Insight to build a status page for a high level overview of their services. In response, Insight put together a number of demos of some open source offerings and settled on the well supported open source status page tool called Cachet. This tool serves as a central hub where telegram bots, alarms, notifications, and to a small degree metrics can be fed into and displayed in an easy to view and manage format. Cachet, while having many easy to manage features, doesn't take the place of a full monitoring solution like Prometheus and Grafana that are more fit for node operators and is part of the ICON Network Monitoring project. It is instead supposed to stay very high level and a way to communicate events to end users who, in ICON's case, tend to be the application developers and users.

Currently the status page can be found at icon.status-page.net with a development environment at cachet.blockstatus.net. The code can be found at our github but it can easily be migrated to any domain as the whole deployment is done with Terraform and Ansible. We will have a development status page that we welcome contributions to for anyone who wants to customize the appearance or graphics. Cachet is written in PHP and has an active community.

https://www.lucidchart.com/documents/embeddedchart/4fe74fb6-f93b-4b1a-b983-aa79deb847c7

We have built a telegram bot to update incidents and are working with the foundation to customize their action response plan. We also built a metrics collection agent that measure the latency in responses to the main 4 networks supported by ICON, main net and three test nets. From the plot above, basically everything not directly connected to Prometheus is in late stages of development as of 4/12/2020.