User blog:Crucially/Iowa datacenter outage

I figured I would try and start writing a bit on how the internal systems at Wikia works, what we do to make it work, and what happens every time it fails and why.

Wikia has currently 4 datacenters, San Jose (SJC), Iowa (IO), New Jersey (NJC) and London (LON). Of these, San Jose, Iowa and London are actively serving traffic. San Jose is our primary datacenter and Iowa is our backup datacenter with London serving as a cache node only. Iowa also serves double duty as a cache node when it is not actively serving traffic. We also use a CDN (CDNetworks) to host images.

The traffic is set up so you get the closest datacenter most of the time, so if you are in the middle or east coast US you end up going to Iowa.

Yesterday our datacenter in Iowa (backup) lost contact with the internet for 27 minutes, this caused a failover from Iowa to San Jose and London, this failover takes a few minutes due to DNS caches. Sadly we served around 60 000 blank pages due to this error in Europe, as well as degraded image service. We identified two problems that we are now working on correcting.


 * For non logged users London usually talks to Iowa instead of San Jose, since it is around 60 milliseconds closer. The setting to failover to San Jose when Iowa is impaired was not configured correctly and not tested. This will be fixed and tested beginning of next week.
 * Our CDN uses Iowa to pull in images, when Iowa went down we stopped serving images to people that were getting the CDN. We are working to figure out a solution to this that will let the CDN failover to another datacenter when the current one is down. We will hopefully have a solution by the end of next week.

If you have any questions, post them in the comments and I will try to answer them.