It looks like you're new here. If you want to get involved, click one of these buttons!
Subscribe to our Patreon, and get image uploads with no ads on the site!
Base theme by DesignModo & ported to Powered by Vanilla by Chris Ireland, modified by the "theFB" team.
Comments
As @monquixote said - TCS could have asked for a DR test and BA said no. DR might not even be part of the outsourcer's scope. My experience of outsourcing is that all sorts of things you'd expect to be covered get deliberately excluded or accidentally omitted.
Most common UPS (data centre jobs, not the little battery things) thing I've seen is that standby generators start, cough and conk out because nobody's been maintaining them properly.
Blame game will start shortly no doubt.
Feedback
The fact that it's lasted days means that they had no Business Continuity plan either.
I have actually never seen a backup genny work.
The chances of it being on some modern cloud system seems remote. I'd be thinking COBOL on mainframes.
You have to spend the money to make disaster recovery solid and you do have to test it. Having worked on these kind of system for the NHS in Scotland and some of the bigger banks we tested the DR switch over every year without failure - we learned something every time we did it and it takes a time to get it to be seamless.
At the NHS it required a 8 hour window for the testing and every year they would ask us if we could omit the DR testing, which we said we could if they waived any DR failures - funnily enough we then did the tests.
If you don't test your response for primary system failure hell mend you, even the best laid plans generally don't work the first time!
I once worked for a company that would did not test their DR solution, the building power failed one weekend and the UPS's kicked in and instantly went on fire as they had been wired in wrong - that was a fun week, not
Currently one of the largest cloud providers is now running with 1 min batteries, but the genset should be up within 20secs. If they don't start first or second time it's game over anyway, DC operations staff are not there to get critical power equipment working if it fails, anymore autonomy than that therefore is simply providing time to stand around and wait until either the mains comes back or the inevitable happens.
In the past people would have 15-30mins battery backup and DC managers would have IT running around shutting down servers or have automated programs to do the same as batteries reached end of life. This isn't an option today in most Cloud and colocation facilities.
Previously critical DCs were running 2N power systems but this is pretty much a thing of the past. Distributed redundant critical power facilities are the norm now which still offer resilience , but less of it at less cost.
I have yet to hear whose DC went down, but there are always white knuckles in the industry when it happens waiting to find out, and praying it isn't a facility with your equipment.
I haven't looked into BA in detail, but even if they did outsource to India in 2016, I'd assume their existing systems were passed over as a support contract
I'd guess at a 2 data centre setup, with kit going back 10-15 years
Many data centres are in quite vulnerable locations, and are subject to physical attack with an EMP weapon too
Sorry @Emp_Fab
http://www.datacenterdynamics.com/content-tracks/security-risk/emp-the-suitcase-that-can-close-down-your-site/94262.fullarticle
There is speculation that ransoms have already been paid by companies in the past
I've seen major organisations with DCs backing onto public pavements
I prefer basket-case.
Chips are "Plant-based" no matter how you cook them
Donald Trump needs kicking out of a helicopter
The dropoff with distance is going to be massive (inverse Square) and that's just in air let alone with a buttload of metal and concrete in the way.
It is not considered as a single system with multiple teams working against each other and competing for bundgets.
A complete system overhaul is deemed far to expensive. Hopefully the loss of revenue of this failure will be wake up call.
Thats before you even consider that it was probably not a green field development with existing legacy parts. In a company the size of BA the full architecture is often nit fully understood and it is not uncommon to find critical components running on old servers under someones desk etc.
Saas isnt a silver bullet either unless your software is essentially stateless (and using a db for state just shifts the problem). The current model of shooting nodes and scaling horizontally by bringing up new instances has some obvious issues (distributed transaction integrity being an obvious one).
No timetables, the moodle installation was down, the lecturers couldn't access the network drive where they stored the slides, it was a proper disaster.
I know an institution (definitely no names, and NOT BA) that lost it's DC dual power due to a small fire - 3 yrs on, still not fixed - it's like running the whole enterprise on one 13amp extension lead. Muppets.
Feedback
Each other app that makes up the system, with have a number of separately installed components/services. Have they documented all these of the SPOFs (including all those little scripts created as work around)
http://www.bbc.co.uk/news/business-40186929
I was part of a team that maintained and repaired standby diesel generator sets. When I first started in the job we'd service the engines every three years and then run them on full building load for 8 hours. Last I heard they don't get touched anymore unless they break. The maintenance team is down from 25 guys to 4. The engines now get run up (offload) automatically for 10 minutes once a month....pretty much the worst thing you can do with a diesel engine as it doesn't get hot enough and fills up with crud and unburnt fuel, eventually diluting the lubricant and totally knackering the bearings. This was all down to an executive decision by an American manager: in the US they traditionally operate a don't fix it unless it's broke policy.