User blog:Crucially/Explanation of the Sunday outage

This past Sunday at around 6:00 pm PST (GMT -8), Wikia suffered the worst kind of outage. Our network completely died in San Jose and we stopped being able to serve pages -- kind of. First some background;

Wikia has four data centers. Three data centers are currently involved in serving traffic: Our primary in San Jose (SJC), our backup in Iowa, and a cache node in London (LON). We use a DNS service from Dynect for our Global Load Balancing (GLB). The GLB will return the IP of the closest data center when you lookup www.wikia.com. This is used for failover if a data center goes down, and for performance, routing users to the closest data center. London makes anonymous page access in Europe about three to six times faster than fetching the page from San Jose.

We are also heavy users of Varnish, an HTTP caching software which sits between users and our Apaches. If Varnish has a copy of the requested page in RAM or on disk (we use Solid State Disks to decrease response times) it will return it directly. If not, the request will go sent to the Apaches -- which are in San Jose -- and return a freshly rendered page. Depending on the rules of the page, Varnish will save it to the cache or discard it. Pages are rendered by the Apaches for every request by logged in users, so there is never anything cached. However, there are other assets on the page such as CSS, images, and Javascript that are cached, making the site faster for logged in users. Anonymous users typically get the page from the cache 85-95% of the time.

The result is that if the Apache machines go down or are unable to serve traffic for some reason, anonymous users will still get pages served from the caches, but logged in users will get the dreaded blank page. Some of you have asked why there is an urchin tag on the blank page. This is so we can track how many bad pages are served. Yesterday we served 800,000 blank pages during three hours to 210,000 people. At the same time we delivered around three million valid pages to around 450,000 people.

Unfortunately, we had a problem with the inter data center failover. When SJC fails, traffic is supposed to be redirected to Iowa, this worked correctly for wowwiki.com but not for wikia.com. Instead users just got a timeout trying to connect. This was pure human error as we hadn't gotten around to testing the failover yet. The issue was resolved around 7:00 pm PST and all traffic started going to Iowa or London. This part of the outage mainly affected the West coast of the United States and Eastern Asia.

Our network in San Jose is simple, yet quite complex. Parts of it are set up to be redundant between the different floors in the data center. This creates a loop in the network and loops are bad. Traffic gets sent around and around in the loop, amplifying it until the entire network goes boom. So we have a situation where we build in redundancy, but that redundancy creates a problem. While there is a protocol called Spanning Tree (STP) that is designed to detect loops and block them, it does not work well.

On Sunday, a switch in our network failed and suddenly disabled STP. This created one of these dreaded loops in the network, overloading it in a few seconds. The reason it took us three hours to fix was that all our capability to troubleshoot went away. The network was unusable for us too! Jason drove down to the data center and I tried to access as much as possible remotely. He got there at 7:00 pm PST and it then took us two hours to find the fault and temporarily fix it.

There are two projects already underway to fix this problem. The first one is a redesign of our network. It is a rather clean design that completely eliminates ethernet level loops and thus STP, making this kind of fault impossible. We implemented this network design in Iowa in early October as a test and it has been working beautifully. ( https://monitor1.sjc.wikia-inc.com/weathermap/iowa_new_router.html ) The important part is that the link between core-i5 and core-i6 is a point to point routed link and not part of the ethernet domain. Each rack is in its own ethernet domain and subnet and there is no direct connection between racks.

Of course, this failure happened before we managed to implement this network design in San Jose. In Iowa we had the luxury of closing down the data center for 24 hours while we put it in place. In SJC, we have to implement the fix in stages. The plan was to start the transition in a week and it will probably take around two months. This will completely eliminate this particular failure mode.

The second thing we are doing is actually making Iowa an active backup instead of a passive backup. Today if something completely catastrophic happened, we would manually start Iowa up and switch all traffic there. What we are working on is an automatic failover to Iowa if San Jose is down. The site would be set to read-only in Iowa, so you couldn't make any changes, but the pages would correctly render. With this in place, the site would have been viewable for all users during the outage. The plan for this feature was end of this week, but disaster struck before we were done! It is no coincidence that we decided to do this before changing the network in San Jose :) We are now accelerating that process and it should be done by Wednesday. So expect a test of that failover towards the end of this week.

I am terribly sorry for this outage. Please bear with us as we make improvements to make sure it doesn't repeat itself.