Back in April, Giffgaff suffered a partial network outage for a few hours. We liveblogged the event throughout the evening and later, Mike Fairman, the “gaffer” and CEO of Giffgaff issued a formal apology on the Giffgaff blog.
However, we wanted to dig a little deeper and find out the exact explanation for the series of events that led to the outage. After all, many things must have gone wrong to lead to the particular situation that left untold thousands without any mobile signal for several hours. And since there was another widespread outage at Giffgaff only last Friday, what better time than now to look back at the past mistakes have been made? [edit: There’s been another widespread outage tonight since we published this post. Seriously!]
This is especially true as there had only recently been an O2 issue in October the previous year after which Derek McManus, the Chief Operating Officer, promised a £10 million investment in infrastructure to prevent the same thing happening again. There had also been a major 24-hour outage in July which prompted Telefónica UK CEO Ronan Dunne to apologise and offer compensation to >O2 customers as well as a small outage on Giffgaff in March last year.
Given these previous issue and the promises of substantial investments in infrastructure to avoid further outages, we feel it’s only fair for customers to know the root cause and the decisions that led to them being without their phone for such a long time. Previously we investigated the true details of the big O2 outage so now we’re aiming to do the same for this one.
Giffgaff usually have excellent communication and kept users up to date through the forums while the issue was ongoing. They also posted several blog posts detailing the causes of the outage and what prophylactic measures would be taken in the days and weeks after service was restored.
Charl Tintinger, Head of Operations at Giffgaff, made a blog post explaining in detail what happened. Effectively, his explanation was that a third party supplier “experienced a mains power failure, which resulted in a service wide outage”.
Now this may seem like an explanation, but it’s actually ignoring some very important issues. First of all, anyone who works in telecoms knows that it takes more than a power outage to interrupt critical systems than affect uptime and necessary services. The standard configuration for any data centre is for them to use normal mains power from the National Grid but, in the event of an outage, to automatically transfer to battery powered Uninterruptible Power Supplies (UPS). These keep the systems alive until the power is restored or, if the outage is ongoing, allow for a smooth transfer to backup diesel generators. These generators should automatically kick in and start pumping out electricity before the batteries die.
Data centers rarely just rely on just battery UPS devices for emergency power – in fact, many even have secondary backup N+1 generators to provide additional redundancy. Critical services can take this even further. After all, the people who run mission-critical systems like this have learned over the years that, WTSHTF, outages can last longer than a few hours. It’s better to be prepared for the worst.
However, Giffgaff’s statement said that the reason for this outage was that the “backup power system” was “depleted before the issue was completely resolved”. Now backup generators typically have about a week’s supply of diesel – there’s no way this would have been going on for so long without the administrators being notified or the issue being resolved. So it sounds like there are two possibilities – either the data centre Giffgaff had outsourced to didn’t have backup generators or there was some sort of switchover failure when transferring from the UPS power to the generators.
Both of these possibility distinctly call into question the competence of the third-party Giffgaff were using as well as why they were chosen. It’s a real shame that they were not explicitly named nor was it mentioned whether they’d still be supplying services for the mobile network. After all, it is standard practice to have a Service Level Agreement that guarantees a certain level of uptime. For example, some service providers might promise a high availability of services where the downtime is less than 5.26 minutes per year – this is the equivalent of 99.999% availability (commonly referred to as “five nines”). Service Level Agreements are standard practice in telecommunications – it is not uncommon for network service providers to explicitly state their own service level agreement on their websites.
Giffgaff have said that “The data centre have already updated some of their processes and implemented some physical monitoring on the boxes should the direct power be lost and battery power start to be used”. They have also claimed that “the monitoring on the network nodes is being reviewed so that they alert at the right time, giving engineers enough time to get to site before the back up power is at risk”.
You do expect to get the quality of service you pay for and this sounds suspiciously like the data centre was not meeting modern standards or that someone messed up pretty severely. There certainly shouldn’t be issues with monitoring the power supply to critical systems and we sincerely hope that Giffgaff are reviewing their contract with this third party.
For those of you who want a bit more detail about the particular system that was affected, Giffgaff have said that it was a “piece of hardware that manages calls and texts on the network”. This might not seem like the most helpful explanation but let’s think about how a mobile network is structured.
There are three main parts of the core Network Switching Subsystem – the Mobile Switching Centre (MSC) which effectively sets and releases all connections, the Visitor Location Register (VLR) which stores information about the physical location of each subscriber and the Home Location Register (HLR) which holds information about each subscriber’s SIM card and phone number. Although we don’t have so much information, we can be reasonably sure that this particular outage was due to some part of the Network Switching Substation.
Anyway, we hope that explains a bit more of the details behind this outage. Hopefully now you have a better idea how a mobile operator works and what can be done to avoid these sorts of issues. Thankfully, this outage is now well in the past. Although we are becoming increasingly reliant on our mobiles, a few hours without signal is usually not an emergency and hopefully can be filed away under first world problems.