02/23/2015

Improving Our System Status Communication

Organizations all over the world use Telerivet for mission critical communication. We take system reliability very seriously, and we know that our customers rely on Telerivet to be available all the time – 24 hours a day, 365 (or 366) days a year.

 

For the past 3 years, Telerivet has built an excellent record and reputation for reliability, in large part because of our significant work to build systems and processes for redundancy, monitoring, alerting, and automatic failover – and perhaps in small part because of good luck.

 

However, Telerivet’s good luck took a break on Saturday, January 31, as our servers experienced three separate hardware problems that coincided with additional problems in our automated failover systems to cause more than an hour and a half of downtime.

 

Telerivet has had a basic status blog since 2012 – which we have rarely ever needed to update. As we responded to all the support requests related to the downtime on January 31, however, we realized we needed a better way to communicate with customers about service interruptions.

 

Today, we’re happy to launch our new status page at status.telerivet.com. It’s hosted outside of Telerivet’s own infrastructure, so that the status page will likely remain online even if Telerivet’s servers experience any issues in the future.

 

On the status page, our system administrators will post updates in real time about any significant problems detected with Telerivet’s service. We’ll also publish detailed incident reports (postmortems) explaining the root cause of outages as well as actions we take to improve our systems.

 

Importantly, the new status page makes it easy for you to subscribe to get updates on future incidents with Telerivet, which can be delivered via email, SMS, or webhook.

 

In addition, we have published several real-time metrics on our status page to give you additional insight into how Telerivet’s service is performing.

 

For Telerivet’s web app and API, the public system metrics on status.telerivet.com show our servers’ uptime as determined by an external monitoring service. In addition, it shows the average response time and error rate for all HTTP requests, as computed from our server log files.

 

Status-metrics

 

With these public metrics, it’s now easier for Telerivet customers to see how well our service has performed – both now and in the past – and hold us accountable.

 

As the first entry on our new status page, we have completed a detailed incident report of the service interruptions on January 31, where you can read about the root causes and contributing factors of the downtime, as well as all the actions Telerivet has already taken in the past few weeks to fix the underlying issues and improve our reliability in the future. Read the incident report here.

 

We hope you find our new status page helpful. If you want to receive notifications of any future incidents, just visit status.telerivet.com and click the “subscribe to updates” button.

Subscribe to this blog's feed