Dear TitanFile clients, collaborators and community,
I am deeply relieved that we have been able to fully restore the service and it is back online.
I cannot express how much gratitude I and the TitanFile team have for your patience, loyalty to our service, team and community. Today, we experienced a critical service disruption that affected a portion of our Canadian clients for almost 23 hours. I know that for many of our clients, TitanFile is a mission critical application that is entrenched in many business processes, and an outage of this magnitude can be very disruptive, and for that I sincerely apologize, I promise, we will do better.
Below is a detail of the events that took place as I want to provide you full transparency into the issue we experienced, the steps we took to resolve it, and what we will be doing to ensure this doesn’t happen again.
At 11:42pm on Sunday January 27th, TitanFile engineers were running a routine maintenance check on the backup system. It was immediately discovered that there was an issue. After conducting an in depth internal investigation overnight, it was determined that there was a critical issue with the Microsoft Azure backup service. The Azure backup service failed while it was backing up the database, that then propagated into the database server itself and as a result we ended up with the database server going down and no backups to recover from.
At around 4:30am, a severity of “A – Critical” support request was placed with Microsoft to notify them of the issue (a software bug in the Microsoft Azure backup system). Because all of the data on TitanFile is double encrypted and Microsoft support engineers don’t have access to TitanFile encryption keys, they couldn’t even examine the data stores on their end to see what went wrong.
Around 7:30am we started working alongside the Microsoft team trying various troubleshooting methods. By 1:00pm we were no closer to figuring out how to recover the data. At this point Microsoft put additional engineers with varying backgrounds to help figure this out.
At 3pm, Microsoft engineers concluded that there were 2 independent issues at play. The first is the backups failing and causing the database server to fail, and the second was the challenge of recovering the encrypted data from the redundant storage and restoring the service. The complexity of the issue puzzled the senior engineers on both ends – TitanFile and Microsoft.
By 5pm we had tried everything we could think of. We pressed on. We re-did every step, every combination, every experiment. Maybe we missed a step or there?
We worked very closely with Microsoft engineers throughout the evening non-stop. And at around 8:30pm, together we were able to safely recover the TitanFile database and affected data without any data loss. Once the database was restored, TitanFile engineers conducted extensive checks to ensure the data integrity was intact and brought the service back online around 10:45pm, while I drafted this postmortem.
It was our #1 priority to restore the service as quickly as possible. We are doing a few things to ensure this doesn’t happen again.
First, we will continue to work with Microsoft engineers to fix the discovered bug and resolve this issue long term. Second, TitanFile will introduce a secondary backup system in addition to the primary backup service to provide a greater backup redundancy.
I want to thank everyone from Microsoft and TitanFile for working around the clock to restore the service, but most importantly, our clients, for their patience and endless support. Your trust and business means everything to me, and we will do better.
Tony Abou-Assaleh, President and COO