Mobile Network

Use this comparison site to pick the best and cheapest mobile phone network in the UK

02 October 201310 Comments by Jon M

What caused the Giffgaff outage?

giffgaff outage

Back in April, Giffgaff suffered a partial network outage for a few hours. We liveblogged the event throughout the evening and later, Mike Fairman, the “gaffer” and CEO of Giffgaff issued a formal apology on the Giffgaff blog.

However, we wanted to dig a little deeper and find out the exact explanation for the series of events that led to the outage. After all, many things must have gone wrong to lead to the particular situation that left untold thousands without any mobile signal for several hours. And since there was another widespread outage at Giffgaff only last Friday, what better time than now to look back at the past mistakes have been made? [edit: There’s been another widespread outage tonight since we published this post. Seriously!]

This is especially true as there had only recently been an O2 issue in October the previous year after which Derek McManus, the Chief Operating Officer, promised a £10 million investment in infrastructure to prevent the same thing happening again. There had also been a major 24-hour outage in July which prompted Telefónica UK CEO Ronan Dunne to apologise and offer compensation to >O2 customers as well as a small outage on Giffgaff in March last year.

Given these previous issue and the promises of substantial investments in infrastructure to avoid further outages, we feel it’s only fair for customers to know the root cause and the decisions that led to them being without their phone for such a long time. Previously we investigated the true details of the big O2 outage so now we’re aiming to do the same for this one.

Giffgaff usually have excellent communication and kept users up to date through the forums while the issue was ongoing. They also posted several blog posts detailing the causes of the outage and what prophylactic measures would be taken in the days and weeks after service was restored.

Charl Tintinger, Head of Operations at Giffgaff, made a blog post explaining in detail what happened. Effectively, his explanation was that a third party supplier “experienced a mains power failure, which resulted in a service wide outage”.

Now this may seem like an explanation, but it’s actually ignoring some very important issues. First of all, anyone who works in telecoms knows that it takes more than a power outage to interrupt critical systems than affect uptime and necessary services. The standard configuration for any data centre is for them to use normal mains power from the National Grid but, in the event of an outage, to automatically transfer to battery powered Uninterruptible Power Supplies (UPS). These keep the systems alive until the power is restored or, if the outage is ongoing, allow for a smooth transfer to backup diesel generators. These generators should automatically kick in and start pumping out electricity before the batteries die.

Data centers rarely just rely on just battery UPS devices for emergency power – in fact, many even have secondary backup N+1 generators to provide additional redundancy. Critical services can take this even further. After all, the people who run mission-critical systems like this have learned over the years that, WTSHTF, outages can last longer than a few hours. It’s better to be prepared for the worst.

However, Giffgaff’s statement said that the reason for this outage was that the “backup power system” was “depleted before the issue was completely resolved”. Now backup generators typically have about a week’s supply of diesel – there’s no way this would have been going on for so long without the administrators being notified or the issue being resolved. So it sounds like there are two possibilities – either the data centre Giffgaff had outsourced to didn’t have backup generators or there was some sort of switchover failure when transferring from the UPS power to the generators.

Both of these possibility distinctly call into question the competence of the third-party Giffgaff were using as well as why they were chosen. It’s a real shame that they were not explicitly named nor was it mentioned whether they’d still be supplying services for the mobile network. After all, it is standard practice to have a Service Level Agreement that guarantees a certain level of uptime. For example, some service providers might promise a high availability of services where the downtime is less than 5.26 minutes per year – this is the equivalent of 99.999% availability (commonly referred to as “five nines”). Service Level Agreements are standard practice in telecommunications – it is not uncommon for network service providers to explicitly state their own service level agreement on their websites.

Giffgaff have said that “The data centre have already updated some of their processes and implemented some physical monitoring on the boxes should the direct power be lost and battery power start to be used”. They have also claimed that “the monitoring on the network nodes is being reviewed so that they alert at the right time, giving engineers enough time to get to site before the back up power is at risk”.

You do expect to get the quality of service you pay for and this sounds suspiciously like the data centre was not meeting modern standards or that someone messed up pretty severely. There certainly shouldn’t be issues with monitoring the power supply to critical systems and we sincerely hope that Giffgaff are reviewing their contract with this third party.

For those of you who want a bit more detail about the particular system that was affected, Giffgaff have said that it was a “piece of hardware that manages calls and texts on the network”. This might not seem like the most helpful explanation but let’s think about how a mobile network is structured.

There are three main parts of the core Network Switching Subsystem – the Mobile Switching Centre (MSC) which effectively sets and releases all connections, the Visitor Location Register (VLR) which stores information about the physical location of each subscriber and the Home Location Register (HLR) which holds information about each subscriber’s SIM card and phone number. Although we don’t have so much information, we can be reasonably sure that this particular outage was due to some part of the Network Switching Substation.

Anyway, we hope that explains a bit more of the details behind this outage. Hopefully now you have a better idea how a mobile operator works and what can be done to avoid these sorts of issues. Thankfully, this outage is now well in the past. Although we are becoming increasingly reliant on our mobiles, a few hours without signal is usually not an emergency and hopefully can be filed away under first world problems.

Tags: , ,

10 Responses to “What caused the Giffgaff outage?”

  1. Si 3 October 2013 at 19:05 Permalink

    I understand there were issues yesterday (2nd October) and I certainly had a few messages delayed by an hour or two without really thinking about it. But then I was at home with WiFi so didn’t notice and data outage.

    Today (3rd October) I have been traveling through Essex and Kent and I’ve had no data on GiffGaff since mid-morning. It’s now 7pm and I’ve still got nothing.

    People must be losing their hair at GiffGaff – how does it go so badly wrong?

    • Mobile Network Comparison 3 October 2013 at 19:27 Permalink

      Good question! We thought that data should be fine though and it’s only calls and (sometimes) texts that are being affected.

      • Si 3 October 2013 at 20:11 Permalink

        Hard to say since I hardly use SMS due to having iMessage on my iPhone (which of course relies on a data connection). I’ve had a normal signal most of day (e.g. three bars) and the carrier shows up (O2-UK) but no “GPRS” or “3G” whatsoever. I can call voicemail, but no-one can call me.

  2. Timple 3 October 2013 at 11:06 Permalink

    Any critical piece of kit in a mobile network should also have geographic redundancy – i.e. when the NSS failed the system should have automatically directed calls at alternative NSS in other locations. This is obviously expensive because it implies redundancy, testing etc. As far as I know the major networks all employ this. Perhaps because GiffGaff is just a MVNO all its calls are directed at a single NSS.

    • Mobile Network Comparison 3 October 2013 at 17:29 Permalink

      We have to agree. It does seem strange that Giffgaff are interdependently responsible for so much of their own equipment compared to O2. Especially as they don’t seem to have the expertise/funds to make it truly redundant to modern standards.

      The real question is how come they are being affected so much whereas other, smaller, MVNOs rarely have issues on this scale? Could it be a competence issue somewhere on the technical side?

      • Timple 4 October 2013 at 13:34 Permalink

        I suspect it comes down to billing.

        Most network billing systems have the option to provide MVNO services – so all the same infrastructure is used. They just have to probably pay a bit more to the vendor who sells the billing platform for the option (plus maintenance, training etc)

        But giffgaff has fairly unusual billing schemes. So perhaps what they want to do is beyond the standard system O2 is providing.

        I used to be a giffgaff user and left because they seemed incapable and unwilling in providing an account of how you used your credit. This is their most requested feature – and is fairly standard. I therefore suspect the system is highly proprietary and possibly started as a small project self-developed by in house team. It is now gone the classic route of getting out of hand and a nightmare to keep stable due to the size and complexity (as they add on more features).

        • Mobile Network Comparison 4 October 2013 at 14:15 Permalink

          That does make a lot of sense. But they are probably still in a bit of denial about needing to rebuild that infrastructure from scratch.

          • Timple 14 October 2013 at 12:44 Permalink

            They are hardly going to announce that their systems are held together by sellotape and they desperately need a migration to a proper 5 9s platform….!

Leave a Reply