In SaaS, every outage is a learning opportunity, and each outage needs to be communicated to dev team so they can debug, rectify and learn from that. Currently, we are facing a problem that sometimes our app experience a downtime for few minutes, and while dev team does handle it well but there’s no record keeping. I wanted to get feedback from other teams if they maintain a log of all outages with root cause analysis, and what kind of tech. stack they use for record-keeping in collaboration.
Here are some of our tech. stack for the problem:
CloudWatch
Alarms
Uptime Robot
Thanks for reading!