Alaska Software Inc. - Anatomy of a disaster ( at Alaska Software )
Username: Password:
AuthorTopic: Anatomy of a disaster ( at Alaska Software )
Steffen F. Pirsig

View the complete thread for this message in:

Anatomy of a disaster ( at Alaska Software )
on Tue, 10 May 2011 16:14:56 +0200
Hi,

here we go with some type of technical write up what really happend
at Alaska Software last week...
The disaster prelude, part #1:It all started to develop approximately 6 
months ago, when some of our external servers moved from one data center to 
another data center. As a side effect of that move, which by itself went 
perfectly, our primary and secondary DNS servers were "replaced" by a single 
physical machine. We know this is not a good idea, but for whatever reason 
that little detail went unnoticed by the admin when the move was made. Of 
course, everything worked fine, so nobody cared about it.

Intermezzo: To get a better understanding how the previous is related to the 
disaster which happened last week, a little look into the IT infrastructure 
of Alaska Software is required. First of all, there are 3 external server 
systems at a data center. Then there are 4 application servers hosted inside 
our intranet. These systems communicate using MessageQueues which are used 
to handle table replication, webservices over http using rest and of course, 
the Web Application Adaptors. This system works very solid, and any part of 
this infrastructure can normally go down without hampering the other parts 
because of many-to-many table replication between dbf tables and sql 
servers. But as always, this system has one single point of failure, and 
that's the DNS. If the DNS fails to work, none of the server systems or 
MessageQueue transport agents are able to connect to any other party. 
Furthermore, none of our customers can use any of the public Internet 
services of Alaska Software, such as www, email or whatever. So DNS needs to 
work always. That's why there is a primary and secondary DNS server!

The real disaster, part #2: Having said that, the disaster started last week 
with a distributed brute force attack against the external server hosting 
the primary and secondary DNS. Again, primary and secondary DNS were on the 
same physical maschine by accident. This brute force attack aimed at spying 
out valid account names and passwords on this maschine. But this attack also 
brought down the DNS server - which of course is not a big issue, as the 
secondary DNS normally takes over if the primary DNS times out. 
Unfortunately, primary and secondary were located on the same maschine, 
leading to the nice effect that primary DNS timed out, and secondary did, 
too! In addition, the massive brute force rate led to a complete malfunction 
of the Windows authentification system after some time. This brought on the 
even nicer effect that the maschine was no longer remote administratable. 
Only a hard reboot brougt it back for some minutes, until the brute force 
attack brought it down again.

Lesson learned: The primary and secondary DNS are now on two physical 
maschines, located at two different sites. And a tertiary DNS is planned 

Hope you can draw something useful from it...

regards
Steffen F. Pirsig
Alaska Software Inc.