British Airways computer failure

Axe_meister · May 2017

BA is the high class escort who has had to start walking the streets but still is able to pull clientele from Chelsea and charge escort prices.

Jalapeno · May 2017

I'd assume nothing - other than they aren't using anyone's cloud, they could well still be using mainframes.

As @monquixote said - TCS could have asked for a DR test and BA said no. DR might not even be part of the outsourcer's scope. My experience of outsourcing is that all sorts of things you'd expect to be covered get deliberately excluded or accidentally omitted.

Most common UPS (data centre jobs, not the little battery things) thing I've seen is that standby generators start, cough and conk out because nobody's been maintaining them properly.

Blame game will start shortly no doubt.

Axe_meister · May 2017

They obviously had no business continuity plan either. Saying that not many companies do. The Business thinks it's an IT problem, but it sits firmly with the business (given that at that point in time there is no IT)

CarpeDiem · May 2017

Jalapeno said:

I'd assume nothing - other than they aren't using anyone's cloud, they could well still be using mainframes.

As @monquixote said - TCS could have asked for a DR test and BA said no. DR might not even be part of the outsourcer's scope. My experience of outsourcing is that all sorts of things you'd expect to be covered get deliberately excluded or accidentally omitted.

Most common UPS (data centre jobs, not the little battery things) thing I've seen is that standby generators start, cough and conk out because nobody's been maintaining them properly.

Blame game will start shortly no doubt.

When this happens, as it does, I see it as poor management and people not asking the right questions or following process. The same applies with defining outsourcing contracts and their subsequent oversight, but how many companies budget for the typical 40% oversight cost. With the reliance on IT, it has to be seen as a fundamental part of the business. I agree with you that a blame game is likely to start shortly as a result of BA's issue.

prowla · May 2017

BA have been offshoring and outsourcing, to the point that they have no control over their systems.
The fact that it's lasted days means that they had no Business Continuity plan either.

monquixote · May 2017

Jalapeno said:

I'd assume nothing - other than they aren't using anyone's cloud, they could well still be using mainframes.

As @monquixote said - TCS could have asked for a DR test and BA said no. DR might not even be part of the outsourcer's scope. My experience of outsourcing is that all sorts of things you'd expect to be covered get deliberately excluded or accidentally omitted.

Most common UPS (data centre jobs, not the little battery things) thing I've seen is that standby generators start, cough and conk out because nobody's been maintaining them properly.

Blame game will start shortly no doubt.

I have actually never seen a backup genny work.

The chances of it being on some modern cloud system seems remote. I'd be thinking COBOL on mainframes.

webrthomson · May 2017

TBH there is plenty of technology that can keep you up and running when your primary data center goes down, but it's all about cost and then testing it works.

You have to spend the money to make disaster recovery solid and you do have to test it. Having worked on these kind of system for the NHS in Scotland and some of the bigger banks we tested the DR switch over every year without failure - we learned something every time we did it and it takes a time to get it to be seamless.

At the NHS it required a 8 hour window for the testing and every year they would ask us if we could omit the DR testing, which we said we could if they waived any DR failures - funnily enough we then did the tests.

If you don't test your response for primary system failure hell mend you, even the best laid plans generally don't work the first time!

I once worked for a company that would did not test their DR solution, the building power failed one weekend and the UPS's kicked in and instantly went on fire as they had been wired in wrong - that was a fun week, not

iseverynamegone · May 2017

UPS typically have @5 mins battery backup in a DC.

Currently one of the largest cloud providers is now running with 1 min batteries, but the genset should be up within 20secs. If they don't start first or second time it's game over anyway, DC operations staff are not there to get critical power equipment working if it fails, anymore autonomy than that therefore is simply providing time to stand around and wait until either the mains comes back or the inevitable happens.

In the past people would have 15-30mins battery backup and DC managers would have IT running around shutting down servers or have automated programs to do the same as batteries reached end of life. This isn't an option today in most Cloud and colocation facilities.

Previously critical DCs were running 2N power systems but this is pretty much a thing of the past. Distributed redundant critical power facilities are the norm now which still offer resilience , but less of it at less cost.

I have yet to hear whose DC went down, but there are always white knuckles in the industry when it happens waiting to find out, and praying it isn't a facility with your equipment.

ToneControl · May 2017

Myranda said:

If the whole operation is based on cloud Software-as-a-service based somewhere abroad then potentially due to some strange routing issues a power cut in an intervening hop country's incoming fibre station *might* cause problems... the internet is supposed to re-route stuff...

If the cloud solution is so poor that it's all in a single data-centre and THAT had a power cut long enough for its back-up generators to die ... that could do it - but who operates a system like that?

If it's a private Software-as-a-service company that's not like Amazon or Google cloud sized then perhaps there are backups in their data centre but it's rooted through a set of switches that had a separate power supply without backup generators...

But realistically given the size of the BA Operation and modern plans to avoid a DDoS attack, in an appropriate out-sourced solution there would be multiple distributed servers, in probably multiple geographic locations, each with decent international fibre access... short of a power cut that somehow affects multiple geographic locations covering every single distributed server I can't see ANY proper out-sourced solution that BA should be using...

A cheap out-sourced solution ... maybe one server with a half hour backup power solution... once the server is down it might require hands-on intervention to fix...

Anyone with a budget of tens of thousands of moneys should be able to make a distributed system that's HARD to take down... with the likely millions that BA would throw at it... it should be possible to make it near impossible to kill completely for this amount of time

most stuff like this was implemented before Cloud
I haven't looked into BA in detail, but even if they did outsource to India in 2016, I'd assume their existing systems were passed over as a support contract

I'd guess at a 2 data centre setup, with kit going back 10-15 years

Many data centres are in quite vulnerable locations, and are subject to physical attack with an EMP weapon too
Sorry @Emp_Fab ;

http://www.datacenterdynamics.com/content-tracks/security-risk/emp-the-suitcase-that-can-close-down-your-site/94262.fullarticle

There is speculation that ransoms have already been paid by companies in the past

I've seen major organisations with DCs backing onto public pavements

Emp_Fab · May 2017

"emp the suitcase" ?

I prefer basket-case.

monquixote · May 2017

I can't see that suitcase being very useful as a weapon.

The dropoff with distance is going to be massive (inverse Square) and that's just in air let alone with a buttload of metal and concrete in the way.

Axe_meister · May 2017

These systems are old green screen based systems, based on monolithic systems interfacing with probably hundreds of other apps, with no interoperability HA/DR testing, each with their own HA/DR configurations.
It is not considered as a single system with multiple teams working against each other and competing for bundgets.
A complete system overhaul is deemed far to expensive. Hopefully the loss of revenue of this failure will be wake up call.

PolarityMan · May 2017

Its actually pretty hard to make some things properly fault tolerant and given the pressure on budgets im not surprised a company like BA doesnt have a perfect DR facility.

Thats before you even consider that it was probably not a green field development with existing legacy parts. In a company the size of BA the full architecture is often nit fully understood and it is not uncommon to find critical components running on old servers under someones desk etc.

Saas isnt a silver bullet either unless your software is essentially stateless (and using a db for state just shifts the problem). The current model of shooting nodes and scaling horizontally by bringing up new instances has some obvious issues (distributed transaction integrity being an obvious one).

olafgarten · May 2017

This reminds me of the time my Universities IT system went down, they had no proper backup and everything was shut down for around a week.

No timetables, the moodle installation was down, the lecturers couldn't access the network drive where they stored the slides, it was a proper disaster.

Jalapeno · May 2017

Axe_meister said:

These systems are old green screen based systems, based on monolithic systems interfacing with probably hundreds of other apps, with no interoperability HA/DR testing, each with their own HA/DR configurations.
It is not considered as a single system with multiple teams working against each other and competing for bundgets.
A complete system overhaul is deemed far to expensive. Hopefully the loss of revenue of this failure will be wake up call.

See now I'd disagree - it's far more likely to be a power failure, the tech is irrelevant (that and mainframes are the most reliable boxes on the planet - plus you can pull bits out, replace them with no service break).

I know an institution (definitely no names, and NOT BA) that lost it's DC dual power due to a small fire - 3 yrs on, still not fixed - it's like running the whole enterprise on one 13amp extension lead. Muppets.

Axe_meister · May 2017

The mainframe itself is reliable, but if the DC is down That doe not help.
Each other app that makes up the system, with have a number of separately installed components/services. Have they documented all these of the SPOFs (including all those little scripts created as work around)

ToneControl · June 2017

very suspicious:
http://www.bbc.co.uk/news/business-40186929

boogieman · June 2017

iseverynamegone said:

UPS typically have @5 mins battery backup in a DC.

Currently one of the largest cloud providers is now running with 1 min batteries, but the genset should be up within 20secs. If they don't start first or second time it's game over anyway, DC operations staff are not there to get critical power equipment working if it fails, anymore autonomy than that therefore is simply providing time to stand around and wait until either the mains comes back or the inevitable happens.

In the past people would have 15-30mins battery backup and DC managers would have IT running around shutting down servers or have automated programs to do the same as batteries reached end of life. This isn't an option today in most Cloud and colocation facilities.

Previously critical DCs were running 2N power systems but this is pretty much a thing of the past. Distributed redundant critical power facilities are the norm now which still offer resilience , but less of it at less cost.

I have yet to hear whose DC went down, but there are always white knuckles in the industry when it happens waiting to find out, and praying it isn't a facility with your equipment.

This ^ . When I first started work for BT they had huge open wet cell batteries that powered the whole exchange and gave an hour's back up. Over time everything got moved over to what amounts to car batteries at the end of each rack of equipment. They were optimistically rated at 15 minutes back up but realistically give a lot less.

I was part of a team that maintained and repaired standby diesel generator sets. When I first started in the job we'd service the engines every three years and then run them on full building load for 8 hours. Last I heard they don't get touched anymore unless they break. The maintenance team is down from 25 guys to 4. The engines now get run up (offload) automatically for 10 minutes once a month....pretty much the worst thing you can do with a diesel engine as it doesn't get hot enough and fills up with crud and unburnt fuel, eventually diluting the lubricant and totally knackering the bearings. This was all down to an executive decision by an American manager: in the US they traditionally operate a don't fix it unless it's broke policy.

chillidoggy · June 2017

When I were at sea, we used to run the emergency generator set every Sunday. It was supposed to cut in if the main generators failed, but the leckys never used to put the things on load because it was such a pain in the arse, and might mean them actually having to do something. As a result in many years, I don't think I ever saw one under load.

boogieman · June 2017

chillidoggy said:

When I were at sea, we used to run the emergency generator set every Sunday. It was supposed to cut in if the main generators failed, but the leckys never used to put the things on load because it was such a pain in the arse, and might mean them actually having to do something. As a result in many years, I don't think I ever saw one under load.

A PITA if it goes tits up I'm guessing? It's actually not difficult to do (although I know nothing about ships' electrics or how they work). In the exchange we'd just pull the mains breaker and let the engine set kick in. The biggest PITA for us was telling the building staff why their pc/kettle/toaster/printer had stopped working even though there were notices all round the building telling them when we were going to do it.

Howdy, Stranger!

Categories

In this Discussion

Become a Subscriber!

British Airways computer failure

Comments