Sunday, August 20, 2017

Twice in One Week Disaster Recovery beyond our Control

Twice in One Week Disaster Recovery beyond our Control


It has been a busy week for disaster recovery so far - we had an outage on Monday and we had another one today.


Mondays Email Outage
On Monday, we lost our e-mail services. No, it had nothing to do with the Lotus Domino server. That was fine.

At first, it looked as if wed forgotten to renew our domain name. This was quickly followed with a back and forward check of various DNS services out there on the Web, our domain registrar and our Internet service providers. All seemed ok with our domain name but there was definitely something weird happening.

Eventually we discovered that our Internet service provider had their DNS running off a domain which they had forgotten to renew. Since they were providing our primary DNS, all of our inbound mail was getting confused when it went to resolve our domain.

After being told by us (yes, they were unaware of the situation despite the fact that it had occurred during the night and it wasnt noticed immediately in the morning), they quickly got to work renewing their domain. Of course, given the sorts of problems associated with domain name propagation, our problems persisted in one form or another for several hours.

Years ago, such a problem wouldve been effectively "over" within a matter of hours but unfortunately, more and more companies are outsourcing their services overseas, and it takes more than fixing the local domain services to resolve the problem.


Todays Building Outage
Our Wednesday problem occurred while I was at lunch. I hurried back and arrived at a darkened building. Luckily, since we moved offices, we arent as high up as we used to be and I only had to run up six stories worth of stairs (although immediately after lunch, it felt like more than that).

It turned out that the entire local grid had lost power. Our domino server and our main file server were happily running off UPS but unfortunately the UPS handling our communications gear was not up to scratch. It didnt matter because there was no way that the UPS would be able to power our systems for more than 30 minutes. Even if this was possible, the temperature inside the computer room was rapidly climbing now that the air-conditioning was off.

We had no choice but to start shutting down the servers. It took us a little while to make that decision because we knew we had a little time and we were hoping for the power to come back on. Of course, as soon as we got halfway through the shutdown procedure, the power came back on. This was after a 45 minute outage in the centre of Sydneys CBD.

Once again, the problem was "environmental" and out of our control. We could have switched to our offsite systems but it is a hard call to make because although our systems are clustered, we have a few special requirements which make a partially manual cut-over desirable. When the cut-over is not automatic, it becomes very difficult to make a decision as to when to flip the switch.


Out of Control Problems
The thing that these stories brings to mind is the fact that I keep reading anti-cloud computing "horror" stories from various vendors. In particular, they talk about Googles Gmail outages. I dont personally understand how people can think that cloud computing is any more or less unsafe than normal computing. As I said before, the problems had nothing to do with the Domino server. In fact, I cant remember a time when weve had an outage due to the Domino server.

I can remember plenty of times when we had hardware failures, ISP failures, power, air-conditioning, gas leaks and DNS failures. Weve had problems with Anti-Virus and Anti-Spam services running on the domino server - and when we moved them off the server, they still caused us the occasional problem at the gateway. Weve even had problems because of Windows itself and device driver updates. Its never domino though. The server product is entirely stable.

In some respects, our Domino mail is in exactly the same position as Gmail. Its not the product that is at fault, its the underlying infrastructure - and its out of our hands.

download file now