This is the second article in the the building mOps (modern ops process) series, if you like this article, please subscribe to know about our upcoming blog posts.
For anyone who works in operations and usesPagerDuty, this story will be all too familiar. It is 2am or 3am in the morning and you receive the dreaded text or call letting you know a service or services are experiencing an outage or something is not working as expected. You immediately wake up and get on your computer to check what’s wrong.
In the meantime, if you setup everything properly, you can go into PagerDuty and see what service alert was triggered and check the error message. PagerDuty at this point has also alerted the rest of your team of the problem viaSlack, so you let your team know: “hey, no need to fear, for I am looking into the issue”. If you have PagerDuty connected to StatusPage, your customers are also informed there is a problem, this gives you a little time to investigate the issue before providing a more detailed update to your clients.
As you go in to check what’s wrong you become a little nervous but quickly calm down knowing that everyone is being updated on what is happening. You look at the error and re-run your API test that was failing inRunscope and it starts passing again. For good measure, you re-run it manually again and everything passes. You breathe a sigh of relief and let everyone know it was intermittent and that you will keep an eye on it. Since you’ve setup everything properly, the incident in PagerDuty resolves itself and your status page goes all green again indicating everything is working as expected.
On the other hand if you looked at the error and things are really wrong, you start sweating cold, but calm down knowing that your team will persevere in resolving the problem in the end. So, you let your development team and operations team know and they work together to resolve the issue. In the meantime you let your customers know there is problem by updating your status page with more detailed information. You continue to keep everyone updated at least every 30 minutes. Once everything is resolved, your Runscope test(s) starts passing and everything self resolves and you let your clients know you will make a post-mortem explaining what happened, why it happened and how you will prevent it in the future.
You now breathe a big sigh of relief and maybe catch up on the ZZZs you missed. You are also at peace knowing that even though PagerDuty will wake you up from your sweet slumber; you still sleep better conscious that if there is a problem, PagerDuty will always be there to call you or text you. It’s like the friend you can always count on when things are not going the way you had planned.