|Author||Topic: Anatomy of a disaster ( at Alaska Software )|
|Steffen F. Pirsig|
View the complete thread for this message in:
|Anatomy of a disaster ( at Alaska Software )|
on Tue, 10 May 2011 16:14:56 +0200
Hi, here we go with some type of technical write up what really happend at Alaska Software last week... The disaster prelude, part #1:It all started to develop approximately 6 months ago, when some of our external servers moved from one data center to another data center. As a side effect of that move, which by itself went perfectly, our primary and secondary DNS servers were "replaced" by a single physical machine. We know this is not a good idea, but for whatever reason that little detail went unnoticed by the admin when the move was made. Of course, everything worked fine, so nobody cared about it. Intermezzo: To get a better understanding how the previous is related to the disaster which happened last week, a little look into the IT infrastructure of Alaska Software is required. First of all, there are 3 external server systems at a data center. Then there are 4 application servers hosted inside our intranet. These systems communicate using MessageQueues which are used to handle table replication, webservices over http using rest and of course, the Web Application Adaptors. This system works very solid, and any part of this infrastructure can normally go down without hampering the other parts because of many-to-many table replication between dbf tables and sql servers. But as always, this system has one single point of failure, and that's the DNS. If the DNS fails to work, none of the server systems or MessageQueue transport agents are able to connect to any other party. Furthermore, none of our customers can use any of the public Internet services of Alaska Software, such as www, email or whatever. So DNS needs to work always. That's why there is a primary and secondary DNS server! The real disaster, part #2: Having said that, the disaster started last week with a distributed brute force attack against the external server hosting the primary and secondary DNS. Again, primary and secondary DNS were on the same physical maschine by accident. This brute force attack aimed at spying out valid account names and passwords on this maschine. But this attack also brought down the DNS server - which of course is not a big issue, as the secondary DNS normally takes over if the primary DNS times out. Unfortunately, primary and secondary were located on the same maschine, leading to the nice effect that primary DNS timed out, and secondary did, too! In addition, the massive brute force rate led to a complete malfunction of the Windows authentification system after some time. This brought on the even nicer effect that the maschine was no longer remote administratable. Only a hard reboot brougt it back for some minutes, until the brute force attack brought it down again. Lesson learned: The primary and secondary DNS are now on two physical maschines, located at two different sites. And a tertiary DNS is planned Hope you can draw something useful from it... regards Steffen F. Pirsig Alaska Software Inc.